Poster Session A   |   11:45am Expo - Hall A & C   |   Poster ID #422

Enhancing Pediatric Cancer Prevention Research: A Reusable Pipeline for Data Integration and Harmonization

Program:
Prevention
Category:
Tertiary Prevention
FDA Status:
Not Applicable
CPRIT Grant:
Cancer Site(s):
All Cancers
Authors:
Shiming Zhang
The University of Texas Health Science Center at Houston
Maria C Swartz
The University of Texas M.D. Anderson Cancer Center
Keri Schadler
The University of Texas M.D. Anderson Cancer Center
Clark Andersen
The University of Texas M.D. Anderson Cancer Center
Donna Kelly
The University of Texas M.D. Anderson Cancer Center
Alakh P Rajan
The University of Texas M.D. Anderson Cancer Center
Eduardo Gonzalez Villarreal
The University of Texas M.D. Anderson Cancer Center
Stephanie J Wells
The University of Texas M.D. Anderson Cancer Center
Amy Heaton
The University of Texas M.D. Anderson Cancer Center
Michael D Swartz
The University of Texas Health Science Center at Houston
Kelly W Merriman
The University of Texas M.D. Anderson Cancer Center
Karen Moody
The University of Texas M.D. Anderson Cancer Center

Introduction

Pediatric cancer researchers often face challenges in acquiring sufficient amounts of data due to the rarity of the diseases and the vulnerability of the patient population they study. To address this, researchers often combine data from multiple sources, such as tumor registries and electronic medical records (EMR), to develop insights into cancer incidence, treatment response, and survivorship outcomes. Such data integration requires extensive data cleaning and optimization to ensure data harmony across disparate data sets and quality assurance. Still, these processes are often time-consuming and lack standardization. We propose a reusable data integration and harmonization pipeline for pediatric cancer research to streamline the process and promote rigorous research. Adopting this approach allows researchers to save time and improve the rigor and reproducibility of their studies, leading to increased research efficiency.

Methods

Initially, pediatric cancer patient data was collected through the MD Anderson cancer registry. To enhance the depth of the dataset, we integrated registry data with other datasets, including oncology, patient billing, patient acute care, hospital administration, and patient referral, using the unique patient ID as a linking factor. We developed R code to identify, harmonize, replace, transform, and standardize variable formats. Additionally, we collaborated with clinical practitioners possessing diverse expertise to establish rules for reclassifying various variables to facilitate analyses, which were retained for constructing classification models for future variables. For example, we consulted pediatric oncologists to reduce the number of cancer diagnosis categories from 291 to 5. We also conducted data validation procedures to ensure the finalized data's accuracy, consistency, and reliability.

Results

A cohort of 1335 pediatric patients (aged newborn to 19 years) between 3/6/2016 and 12/31/2019 was extracted from the MD Anderson cancer registry, with 84% completeness in demographics. After integrating data from five additional cohorts and harmonizing, the final dataset included 1,076 pediatric patients with 40 demographic and clinical variables. Among these, 5 covariates with more than 10 categories were reclassified into analyzable and reasonable levels, while 10 demographic/clinical covariates with varied formats were standardized. As a result, the final demographic variables achieved 100% completeness. The mean age at cancer diagnosis was 10.9 years, the mean age at data extraction was 15.6 years, 55.3% were males, 40.8% were non-Hispanic whites, 74.3% were English speakers, and 56.2% held private insurance. The sample included blood and solid tumor cancers, with 40.8% diagnosed as non-neural solid tumors. Among childhood cancer survivors, 34.9% had inpatient rehabilitation consultations, and 32.7% completed them. Among the treatment modalities, 71.3% had chemotherapy, 75.7% had radiation therapy, 4.3% received hormonal therapy, and 65.5% underwent surgery.

Conclusion

Our study developed and applied a rigorous and replicable pipeline for integrating and harmonizing pediatric cancer research data, enabling most EMRs to automate and standardize. For instance, this method could quickly analyze factors related to inpatient rehabilitation services for childhood cancer survivors. By employing this method, we enhanced research efficiency, ensured rigorous investigations, and lowered technical obstacles for non-technical researchers.