Big data is defined as data of a scale exceeding the scope of existing database (DB) management tools for collecting, storing, managing, and analyzing data. Heal-thcare big data has recently been established in various fields . Big data is mainly divided into electronic medical records, claims, genomic data, and patient-generated health data . Health care data is essential for improving the quality of treatment and reducing medical costs. It encompasses various data types and sources, including all health-related genotypes, family and friend relationships, biological phenotypes, environmental exposures, behaviors, and lifestyles-based on medical data .
Big data is divided into primary data (acquired for research) and secondary data (collected for claims) . The use of secondary data enables thorough analyses; however, it carries a considerable risk of systemic and random errors. In Korea, the Health Insurance Review and Assessment Service (HIRA) and National Health Insurance Service (NHIS) established large-capacity health and healthcare big data open systems in 2011 and 2013, respectively, and are providing researchers with various types of secured healthcare big data . The use procedure supports visits or remote access to enable customized data analysis desired by industry and researchers. Recently, a sample cohort DB based on claims data was built and supported, and it is being used in numerous clinical studies. This paper details the step-by-step knowledge required to design a clinical trial, write a thesis, and apply claims data to clinical practice using healthcare big data.
The representative public big data in the health and medical fields used by Korean researchers are the NHIS and HIRA DB data (Table 1) . The NHIS and HIRA DBs contain the same information. However, there is a slight difference in the research data services provided by these institutions.
Data from 2002 on qualifications and insurance from birth to death, hospital usages, national health exam results, rare incurable cancer registration information, medical benefits, and elderly long-term care data are available from the NHIS DBs (NHIS data sharing service: https://nhiss.nhis.or.kr/bd/ay/bdaya001iv.do). The DBs also include treatment, medical checkups, and medical care history.
Research data is primarily provided as “customized research” and “sample cohort” DBs. The customized research DB refers to data that has been processed and delivered as customized data so that it can be used for policy decisions and academic research. They can be reviewed and analyzed using SAS or R in the data analysis room within the NHIS or other designated local centers. The sample, medical checkup, geriatric, children’s screen-ing, and working women cohort DBs are all included in the sample cohort DB, distinguishing the HIRA’s data provisioning service. The sample cohort DB 2.0, for example, contains samples from approximately one million people, covers the entire country as of 2006, and gives data from 2002 to 2015, allowing for longitudinal resear-ch. The customized data provides only the date of death, however the sample cohort DB offers the date and cause of death.
Because the customized research DB is a representative DB consisting of the medical usage behaviors, diagnostic codes, prescription codes, and drug codes of Koreans, all types of research, including rare diseases, are possible. The control group can be derived based on the control group selection conditions if needed. There are considerable challenges related to access, cost, space, and data analysis for a customized research database. SPSS is difficult to use for analysis, so R statistics or SAS must be used instead.
Since 2007, the customized dataset has been available, including information on general specifications, medical treatment, disease, outpatient prescriptions, and the state of health care institutions (open data system of HIRA: http://opendata.hira.or.kr/home.do) . Although it is different from the cohort DB of the NHIS, the HIRA also provides sample data, as well as inpatient, total patient, elderly patient, and pediatric patient datasets. In the case of the pediatric patient dataset, the extraction rate of pediatric patients (under 20) after 2009 is 10%; therefore, the data of approximately one million children are provided. Since it is not longitudinal data, it is more suitable for cross-sectional studies such as prevalence. The difference from the NHIS is that HIRA provides additional information related to drugs. It contains data on drug utilization reviews (DURs) and medication distribution management, information on treatment adequacy evaluation, and some non-reimbursable drugs. The researcher is given remote access rights for some customized datasets, which may then be examined via the HIRA server. Unlike NHIS, HIRA does not offer a date of death.
The disadvantages include difficulty obtaining an accurate clinical diagnosis and excluding non-reimbursable prescription medications or treatments. Due to the fact that this information is used to process insurance claims, doctors may enter disease codes that are more severe than the actual condition in order to avoid a reduction in non-reimbursement. In contrast, even for severe diseases, it may be difficult to accurately identify the code if it is not related to reimbursement. In the case of the NHIS data, analysis is only feasible by visiting the main center of NHIS directly or restricted rooms in the regional analysis centers. Additionally, individual patient data cannot be extracted from the analysis center. After processing raw data in a summarized format, it can only be extracted from the original datasets, such as in the form of a table. Since data is stored as of the invoice issuance date, there may be a difference in time from when care was actually administered.
Another health and medical statistics source is the Korean National Institute of Health’s Korean Genome Analysis Project; this data is subject to limited disclosure. Additionally, the Korea Disease Control and Prevention Agency collects the Korea National Health & Nutrition Examination Survey, which is used to determine the current state and trends in people’s health and nutritional status, as well as the Community Health Survey, which is used to develop a community health care plan and evaluate health projects. If such healthcare big data are used in various ways, more diverse research can be conducted. However, since each sample datum is not linked with the National Cancer Center and Statistics Korea data, it has limitations as data from a single institution. Cancer registration data containing personal information cannot be used except in exceptional circumstances.
Prior to starting research using healthcare big data, it is critical to understand the characteristics of the data and the types of research that could be conducted using it. The case-control and cohort studies are the most frequently used research designs for health insurance data . The two studies differ in where the clinical outcome occurs. A case-control study is designed to ascertain prior exposure to risk factors after grouping participants into disease-prone and non-occurring groups at the time following the clinical outcome. A cohort study evaluates the risk of future disease occurrence between the two groups by dividing them into intervention and non-intervention groups. A relatively simple cross-sectional study among study design types is a method to simultaneously investigate the onset of the targeted disease and risk factors by extracting a sample.
First, select the customized database or a sample cohort for each study design. For diseases with a relatively high prevalence, researchers are recommended to use a sample cohort because most of the research objectives can only be achieved with sampled cohort data. It is advantageous to utilize customized data targeting the entire population for diseases with a low prevalence or inci-dence.
Researchers unfamiliar with the data have difficulty comprehending the classification and qualities, making it difficult to initiate the investigation. It is necessary to understand the characteristics of the claim code, and until now, analysis was possible only through SAS or R. To understand and analyze the data, a collaborator who can interpret it is essential.
In pediatric cancer research, public healthcare big data includes data from the entire population, so it can solve unmet research needs that are difficult to conclude with limited data. Since it is possible to check the medical records of each individual, it is an excellent data source for researching pediatric cancer survivors. When combined with national cancer registration data, it may help identify pediatric cancer patients and clarify their diagnostic name and date of diagnosis, allowing for world-class epidemiologic research of pediatric cancer patients.
Since such healthcare big data is based on claims for actual treatment, reimbursement-related factors affect data accuracy. As a result, the element of the study that requires the most effort is diagnostic accuracy. Because most research employs ICD-10 codes to define a specific disease, the number of patients with a given condition could frequently be higher or lower than predicted. Thus, extra attempts to support it, such as the use of medications and surgery, might be evaluated concurrently to improve diagnostic accuracy. It may not contain information about the use of essential drugs or procedures.
Health insurance data has the problem that, unlike medical records, it lacks comprehensive clinical information, making it impossible to distinguish between time-based medical practices that can explain a causal relationship. For example, suppose a pediatric cancer patient has surgery because of a perforation caused by a colonoscopy. It is hard to tell whether the operation is for cancer therapy or colonoscopy perforation because only the codes for colonoscopy and surgery can be checked. It is impossible to determine the outcome of a health checkup performed at patients’ own expense, which does not include non-covered therapy such as new surgeries or medications. While gender and age can be validated in the claims data, additional physical characteristics such as height, blood pressure, and socioeconomic factors such as drinking history, smoking history, and activity level are not included in the claims data. Therefore, it is impossible to account for multiple risk variables in the data analysis, contributing to the research’s lack of precision and trustworthiness.
It is necessary to select an appropriate research topic after thoroughly examining whether data suitable for the research can be extracted with expert advice. Because it is impossible to define operational diagnosis using only a few claim items thoroughly, it is required to confirm the operational diagnosis of a targeted disease with prior research. It is preferable to first validate the accuracy of the operational diagnosis by comparing it to the insti-tution’s health records or a cohort built by researchers’ organizations.
Research using public claims data is performed as a retrospective study. A retrospective database study can generally reflect routine care compared to randomized clinical trials and long-term follow-up of large-scale patient data to determine the clinical effect. It is useful when it is necessary to derive timely research results because research can be carried out in a relatively short time and with minimal cost.
By combining pre- and post-illness data, healthcare big data can be used to generate a disease cohort. An important advantage of such healthcare big data is that no patients drop out halfway through treatment, owing to the nature of Korea’s medical system, which is population-based. Because healthcare big data was initially designed to charge for medical treatment and treatment expenses, it enables numerous cost analyses.
When establishing a research hypothesis, it is necessary to consider whether it can be elucidated with this data. In studies involving child and adolescent cancer patients, non-reimbursable data such as immunotherapy or targeted anticancer drugs are not appropriate. Clinical results other than death are not confirmed, which needs to be taken into account. It is unknown whether over- --the-counter medications are used. It is necessary to consider the limitations owing to insufficient information about the cancer stage and accurate histotype at the time of diagnosis. Since cancer diagnosis claims continue even after treatment is finished, the claims data does not precisely define recurrence or the occurrence of secondary cancer, which is a drawback.
Research subjects must be chosen carefully to obtain the best results using the claims data and operational definitions using codes must be determined.
Recent analysis of operational definitions in cancer research (Table 2) has shown that both the prevalence and incidence rates showed the most similar results with the actual rates in the case of operational definition as “A person with a ‘C’ code in the primary diagnosis (SICK_SYM1) among inpatients (FORM_CD)” . In the case of breast, prostate, and cervical cancer, which have a limited incidence according to gender, these definitions are generally sufficient to estimate the real numbers. However, in other representative cancers, the overall prevalence tends to be overestimated.
The operational definition of childhood and adolescent cancer can be defined as a method of cross-searching the specific symbolized type, which is a mark that provides selective insurance benefits for cancer patients, together with the KCD-10 code corresponding to cancer (Table 3). However, because entering the diagnostic codes alone does not allow for the selection of patients who have actually received chemotherapy, a method of selecting a subject by including the entire drug codes of chemotherapeutic agents or treatment codes used when administering an anticancer drug can be considered (Table 4).
Disease, drug, treatment, material, and other codes should be used to define treatment, outcome variables, and confounding variables. In the case of a study that includes drug or treatment codes, changes in an insurance policy must be considered because changes in the new drug or treatment codes appear following changes in the insurance policy. Claim codes for hematopoietic stem cell transplantation, for example, vary by age, donor, and billing period, and some codes are no longer valid (Table 5). As a result, if the researchers want to add patients from previous periods, they should include outdated codes. To select patients who have received chemotherapy, it may be more effective to use the treatment code required for anticancer drug administration rather than the diagnostic codes (Table 4).
The quality of research can be improved if various data sources are combined with the claims data . It is impossible to ascertain the exact date of diagnosis, cancer stage, and pathology results while researching cancer patients using claims data. If this information is required for the study, the National Cancer Center’s cancer registration data can be used by requesting a service that integrates it with insurance claims data. It is possible to specify only actual patients with confirmed cancer by complementing the limitation of the diagnostic codes, which cannot accurately specify a patient even with an operational definition. However, only national cancer registration data from 2011 can be used for combination data, and only data up to 10 years old can be requested.
If the outcome variable is in-hospital death, it can be defined using a medical outcome variable or a diagnostic code. However, the definition is unclear when death occurs outside of a medical institution and cannot be identified by claims data. If the current death data from the Statistics Korea and customized research data are fused, the exact date of death and cause of death listed in the death certificate can be utilized.
A confounding variable that distorts the relationship between treatment and outcome variables can be defined as a confounding factor related to treatment and affects outcomes [8-10]. It is necessary to define confounding variables related to treatment and apply a proper analysis method that controls the defined confounders . The usual methods of controlling the confounding variables in study design are restriction and matching [5,9,12]. The restriction approach limits the study subjects by using appropriate inclusion/exclusion criteria to ensure the homogeneity of subjects included in the study. The matching process involves selecting a comparison group so that the distribution of the confounding variable is the same or similar to the reference group so that the two groups to be compared have identical characteristics.
The methods of controlling the confounding variables in terms of analysis are stratification, multiple regression models, and propensity score matching [9,10]. The limiting, matching, and stratification methods are appropriate when the number of confounding variables is small. If the number of confounding variables is large, it is more suited to apply the matching or stratification method by summarizing the confounding variables with propensity scores .
Case-control studies are frequently used as the initial step in research to evaluate whether exposure factors enhance disease risk [10,13]. They are less expensive than cohort studies and can be completed more quickly. It is also useful when the disease to be investigated is rare .
A nested case-control study is conducted by extracting case-control data according to the disease state from the cohort data (Fig. 1). For example, suppose that 1,000 out of 100,000 people developed acute lymphoblastic leukemia after ten years of follow-ups in a cohort study. The differences in gene expression can be seen by properly extracting cases and controls, which is an example of a nested case-control study. If you build a gene expression data set in the early stage of the cohort, you need to extract the blood of 100,000 people and make a gene expression DB. If you use the nested case-control study method, you can save time and money by extracting controls 1-4 times larger than the case and analyzing gene expression.
Research using healthcare big data is a representative observational study, and since it is already established data, it is a retrospective cohort study . In other words, after observation of the subject is completed, the results of observation are constructed as data, and then the research begins with the data (Fig. 2). It was the most frequent design among studies using the National Health Information DB in Korea, accounting for 56.7% of the total reports . The study time frame is after, not before, cohort recruitment, similar to the nested case-control research. A retrospective cohort study is essentially a cohort study, except that the research group is not chosen based on disease status, as in a case-control study.
Suppose you spent ten years following miners working in a mine. A retrospective cohort study design can be used if we wish to compare these miners’ lung cancer death rate to that of the general population. With this cohort of miners, nested case-control research is one in which a case-control is extracted based on lung cancer, and then exposure is compared to this case-control.
Propensity score matching (PSM) is a method of selecting a comparison group with the same distribution of the confounding factor to the standard group (Fig. 3) [10,14]. The two groups to be compared have similar characteristics. Confounding factors are usually selected as matching variables. Among all studies using the National Health Information DB in Korea, 18.58% of the studies used the matching method, and among them, the PSM study was the most commonly used in 68.25% .
The PSM method is a non-parametric method that creates similar conditions in observational studies where randomized trials are impossible . PSM is a method for minimizing selection bias in control selection by controlling the influence of confounding variables through multiple covariates. The covariate should not be a parameter, and the propensity scores between the two groups should have many overlapping parts. When there are many unexposed subjects in a cohort format, PSM matches unexposed patients approximately 1-10 times of exposed patients having similar demographic and socioeconomic characteristics, medical evaluations, and comorbidities of exposed participants. Measuring the incidence and mortality during the follow-up period is similar to the cohort study.
In research studies using big data, it is necessary to set a topic that is difficult to draw conclusions from existing small-scale studies and sufficiently describe the justification for using big data . The operational definition for big data analysis is the most critical component of big data research. If there is potential for error or confusion, the analysis results will be difficult to trust. Therefore, it is critical to validate the trustworthiness of the big data observational study by comparing the identification code or algorithm used to match the target patient to actual hospital data. [16,17]. If there are studies that validate the criteria for patient selection in previously published articles, these can be substituted for references. The method for determining risk factor exposure, clinical outcome, operation or surgery, or administration of drugs employed in the study should be detailed. Patient data initially extracted according to the selection and exclusion criteria should be presented in a flow chart for easy understanding .
Owing to the retrospective cohort nature of the secondary healthcare DB, there is a possibility of numerous biases emerging when performing pharmacoepidemiological investigations (Table 6) . Bias is primarily classified into confounding, selection, measurement, and time-related biases. Confounding factors impair the accuracy of experimental design and outcomes when they are not controlled. Therefore, it is critical to identify confounding variables and appropriately correct for bias.
Among the time-related biases, the immortal time bias, a representative bias, is a bias to be aware of in cohort studies. Immortal time means the follow-up time during which death (or specific event) cannot occur according to the study design (Fig. 4).
As follow-up time increases, treatment decisions are often delayed or followed without treatment. Therefore, as the observation period elapses, the subjects of the patient and control groups defined at the start of treatment may change. Immortal time bias includes misclassification bias and selection bias.
The immortal bias can be reduced by using the Land-mark method or the Mantel-Byar method (Fig. 5) [20,21]. In the Mantel-Byar method, the time starts at the moment of therapy initiation with all patients in the “non-response” state. Those who eventually respond to therapy enter the “response” state at the time of response and remain there until death or censoring, and those who do not respond always remain in the “non-response” state. This method removes the bias as patients are compared according to their response status at various periods during their follow-up . Landmark analysis is a method of analyzing patients who survived after the landmark period. In this method, time starts being measured at a fixed time after the initiation of therapy. This fixed time is arbitrary but must be clinically meaningful .
Globally, interest in and use of big data in health care has risen in recent years. In Korea, the use of big data in health care is accelerating due to the opening of big data through the NHIS and HIRA open data systems. However, because data collecting aims to perform reviewing for reimbursement, medical research requires a reprocessing procedure. There are restrictions, such as on linking with external data (from other institutions, nations, etc.) and the lack of non-reimbursement informa-tion. However, a comprehensive study is attainable by using a proper operational definition to define pediatric cancer patients and combine accurate cancer diagnosis data from national cancer registration data with mortality data from the Statistics Korea.
This work was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI21C2046).
The author has no conflict of interest to declare.