From The Editor | March 13, 2026

Building Harmonized Real‑World Data In Oncology

By Erin Harris, Editor-In-Chief, Cell & Gene
Follow Me On Twitter @ErinHarris_1

medical chart, healthcare, medicine-GettyImages-2217159653

During the “Data to Decisions: Building Regulatory-Grade RWE from EHR Systems in Oncology” session at PMWC 2026, panelists discussed why the future of regulatory‑grade real‑world data (RWD) will depend on prospective, harmonized data. Indeed, Mayo Clinic’s Nadia Poluhina, moderated a timely session that featured Moffitt Cancer Center’s Alyssa Pybus, Ph.D., Flatiron Health’s Kate Estep, Mayo Clinic’s Jeremy Jones, M.D., and Julie Stein Deutsch, M.D., who shared why data collection does not depend on physicians acting as data entry clerks but instead on using subject‑matter expertise and relational data models to tie genomics, imaging, pathology, and outcomes together at the patient level.

Building Prospective, Scalable Data Pipelines

The discussion opened with a contrast between retrospective analyses of existing data and deliberately designed, prospective data collection. Retrospective RWD remains indispensable for clinical research, but the panel emphasized that it is no longer sufficient on its own for answering the most pressing questions in precision oncology. At Moffitt Cancer Center, the Total Cancer Care (TCC) protocol was highlighted as proof of concept that large‑scale, standardized, prospective data collection is both feasible and scientifically productive.

TCC began roughly two decades ago as a prospective, longitudinal outcomes study and has since expanded into a multi‑institutional consortium across the southeastern and midwestern United States. Patients consent to lifetime follow‑up, release of clinical data, and use of biospecimens for molecular characterization, enabling linked clinical, pathologic, and genomic datasets that now include hundreds of thousands of participants and have supported numerous publications. Panelists framed TCC and similar Institutional Review Board (IRB)–approved protocols as operational blueprints for how oncology can systematically generate research‑grade data within routine care.

Physicians are Not Data Entry Specialists

A recurring theme was the misalignment between current electronic health record (EHR) workflows and the realities of clinical practice. Dr. Jones noted that physicians are not data entry specialists and will never excel at detailed, structured annotations beyond what is essential to direct patient care. The burden of clicking through templates and filling research‑oriented fields competes directly with time spent counseling patients, managing complex therapies, or addressing acute issues.

Panelists warned that any model that relies on busy clinicians to perform extensive manual standardization at scale will fail, especially outside academic centers. Instead, they urged:

Limiting mandatory standardization to elements intrinsic to clinical care.
Extracting additional structure from data that are already generated (i.e. notes, reports, images, lab feeds) using informatics and AI tools.
Designing workflows where research data capture is as invisible as possible to front‑line clinicians.

Their message to data architects and regulators was that high‑quality RWD must be engineered into systems and workflows, not tacked onto the end of a clinic visit.

As the conversation shifted to tissue‑based and imaging‑based endpoints in oncology trials, the panel stressed that harmonization across tumor types and modalities has become essential for regulatory use. Here, harmonization means more than vocabulary alignment; it requires shared frameworks for how endpoints are defined, measured, and reported across diseases and therapies. Pathologists and radiologists were singled out as the source‑of‑truth experts for tissue‑based biomarkers, pathology endpoints, and radiology readouts, yet they are often engaged only after trial design is complete.

The panel advocated for earlier and deeper involvement of these specialists when:

Selecting which pathology or imaging variables truly matter.
Defining how those variables should be captured and structured.
Streamlining reporting workflows so that key fields can be reliably populated.

They shared that without that input, trial sponsors risk mining data that were never collected in a way that supports regulatory decisions, no matter how sophisticated their downstream analytics may be.

Integrating Multimodal Data at Scale

When the discussion turned to integrating genomics, imaging, and other modalities beyond the EHR, panelists highlighted the challenge of sheer data volume and heterogeneity. They shared why clinicians read radiology reports instead of re‑examining every CT scan: the underlying image is simply too large to review manually, especially across decades of care. At Mayo Clinic, de‑identified, longitudinal data from decades of patients is being organized into standardized, open‑source–compatible formats to enable cohort identification and then drill‑down to source data.

They shared a concrete example that centered on BRAF V600E‑mutated colorectal cancer, a biologically and clinically distinct subset of colorectal cancer in which the oncogenic BRAF mutation typically occurs in right‑sided tumors. Using standardized data, a colon cancer specialist can rapidly identify patients with left‑sided, BRAF V600E‑mutated tumors and then stepwise review colonoscopy reports and, ultimately, the original endoscopic images. This funnel approach (i.e. starting with harmonized structured data, then selectively returning to source‑level images) illustrates why a standard data model is needed as a starting point, even when the goal is deep phenotyping on raw images.

Open, shared standards were described as key to enabling not only intra‑institutional analyses but also collaborations across centers, increasing statistical power and diversity of patient populations.

How RWD Changes Clinical Understanding

Several panelists offered case examples where RWD reshaped clinical assumptions. One oncologist recounted early experiences with BRAF inhibitors in BRAF V600E‑mutant colorectal cancer, inspired by the striking response rates of BRAF‑targeted therapies in BRAF V600E‑mutant melanoma. While BRAF inhibition produced response rates of roughly 70–80% in melanoma, off‑label use in colorectal cancer patients at a major academic center yielded responses in fewer than 5% of cases.

Because the institution had systematically captured molecular and outcomes data, the team could rapidly assemble a cohort of BRAF V600E‑mutated colorectal cancer patients treated with BRAF inhibitors and quantify this poor response rate. Subsequent translational and clinical research revealed that feedback activation and up‑regulation of the epidermal growth factor receptor (EGFR) pathway underlies much of the resistance in colorectal tumors, leading to combination regimens that pair BRAF inhibitors with anti‑EGFR antibodies as a new standard of care in this subset. The panelists argued that if colorectal cancer had been the first setting in which BRAF inhibitors were tested, these drugs might have been abandoned prematurely, underscoring the importance of both disease context and real‑world observational data.

Accessibility, AI, and Relational Data Models

Returning to institutional infrastructure, the panel described how harmonized, institution‑wide databases linked by medical record number now allow clinicians to assemble cohorts quickly by combining genetic profiles, labs, vitals, and treatment histories through web‑based cohort builders. When these tools are accessible to clinicians, they allow practicing oncologists to answer complex questions quickly and at scale rather than waiting months for bespoke analyst support.

AI was highlighted as an accelerator for this vision. AI‑assisted coding and data wrangling tools already make data scientists substantially more productive. The panelists argued that similar approaches could help extract structured elements from unstructured notes, imaging reports, and pathology narratives, further reducing manual burden.

A Vision for Regulatory-Grade Data

In closing, the moderator asked each panelist what single change (i.e. technical, regulatory, or cultural) would most accelerate the generation of high‑quality algorithms and regulatory‑grade data in oncology. One health‑system leader imagined erasing the divide between messy EHR data used for care and the curated datasets built later for regulatory submissions and research. The goal would be to generate research‑grade data in the flow of clinical care, without placing that work on physicians, through better technology, workflow redesign, and organizational change management.

They called for national or even global standardized data collection protocols in oncology, flexible enough to adapt as new biomarkers and endpoints emerge. They expressed a need for a stronger emphasis on implementation science for data: funding and recognizing the work of building, maintaining, and governing harmonized data systems, even if it seems less glamorous than discovering new drugs or biomarkers. They called for interfaces that present concise, clinically meaningful information to patients and physicians, while still feeding rich, structured datasets on the back end.

Finally, the panelists agreed that discovering powerful drugs and biomarkers is not sufficient; without robust, harmonized, and relational data infrastructure, those discoveries cannot be fully implemented or fairly evaluated in the real world.