2023 Innovations in Cancer Prevention and Research Conference

Poster Session A | 11:45am Expo - Hall A & C | Poster ID #246

Accelerating Cancer Data Standardization and Sharing through CancerOntoGPT: A Novel Architecture Leveraging Large Language Models

Program:

Academic Research

Category:

Bioinformatics and Computational Biology

FDA Status:

Not Applicable

CPRIT Grant:

RR180012

Cancer Site(s):

All Cancers

Authors:

Xiaoqian Jiang
The University of Texas Health Science Center at Houston

Kai Zhang
The University of Texas Health Science Center at Houston

Introduction

The heterogeneity of cancer data presents significant challenges in standardization and sharing, which hinders the progress of cancer research and patient care. The mCODE™ initiative aims to address this issue by establishing a core set of structured data elements for oncology electronic health records (EHRs). However, the rapid advancements in large language models (LLMs) such as GPT-4 and LlaMa offer new opportunities to further accelerate this process. In this work, we introduce CancerOntoGPT, a novel architecture designed to leverage the power of LLMs for cancer data standardization and sharing.

Methods

CancerOntoGPT combines the strengths of mCODE™ and LLMs to create a more efficient and effective approach to data standardization. By integrating LLMs into the standardization process, CancerOntoGPT can better understand and interpret the complex language and semantics of oncology data, enabling more accurate and consistent data extraction and transformation. We utilized various novel components like code interpreter and nested parsing, as well as the advanced natural language processing capabilities of LLMs that can facilitate seamless data sharing and collaboration among researchers and clinicians, improving the overall quality and accessibility of cancer data.

Results

We evaluated CancerOntoGPT on 200 synthetic oncology notes, achieving an average accuracy of 88.76% in data extraction and standardization (based on the ground truth). This demonstrates the potential of our novel architecture in effectively handling complex cancer data. To showcase the capabilities of CancerOntoGPT, we have also developed a demo available on the HuggingFace platform (https://mcodegpt.org).

Conclusion

Through the implementation of CancerOntoGPT, we aim to accelerate the standardization and sharing of cancer data, ultimately contributing to the advancement of cancer research and the improvement of patient care. We hope to engage with leading experts in the field and foster collaborations that will drive the development and adoption of CancerOntoGPT in the oncology community.