Why Big Medical Data Fails ML/AI

Question: How do you solve a problem or fix something when you’re not even sure exactly what a problem may be; only that you have a vague sense that there is one? This is something for which AI can supposedly help. By dumping in piles of data, you can find insights and patterns. Sounds Great. Until you run into the problem of missing critically meaningful data. For all the stunning benefits Artificial Intelligence (AI) and Machine Learning (ML) tools offer healthcare, we’re collectively missing something. And that something is comprehensive outcome data.

What is outcome data? Quite simply, it’s asking what happened? Not just after a singular healthcare encounter, but longer term. And what about at a population level. Think about your own experiences in the healthcare system. Ideally you’re being reasonably proactive and preventative with good behavior and checkups. (Because, of course, we all eat perfectly and work out just as we should.) Your other experiences were because something went wrong. You got sick or hurt. What happened? You went in, (physically or virtually), you got diagnosed, (ideally properly), and left with some treatment and perhaps a prescription. Then what? Often nothing. In some cases, you will have follow-ups. And your Electronic Health Record (EHR) will be updated accordingly. But much of the time? Not much happens. Maybe that’s fine for you. Who wants to be bothered with a checkup for nothing or yet another survey. Still, is there just an assumption that you got better? If so, was it quickly or after a months long struggle? Maybe you went to another doctor. Maybe you died! (Well, not YOU obviously.)

Why Does This Matter?

Machine learning tools may be used for all manner of reasons. Perhaps the most effective are point solutions to work on particular issues, rather than larger scale questions.

Some of the main benefits of ML/AI though, are insights from Big Data that would not occur via other means. We all know correlation does not necessarily mean causation. However, finding even seemingly possible dependencies can lead to two obvious questions. The first is simply, “Is this really true?” And the second, more important question, is “Why is this the case?” It’s been said that some of the greatest breakthroughs in science came from humble beginnings. That is, the breakthrough didn’t happen with a researcher’s study having great success and a scientist shouting, “Eureka!” It’s been more subtle. It’s been a test result that caused someone to remark, “Well, that’s interesting.” Or ask, “hmmm… why’s it doing that?” Big data analysis may not always find root cause answers. Part of the value is possibly just surfacing insights that lead to really interesting questions.

The core issue for healthcare often isn’t that we don’t have enough data. It’s that we don’t have the right kind. Even though AI tools are increasingly good at dealing with unlabeled data, machine learning thrives on clear patterns, well-labeled outcomes, and representative distributions. But healthcare data is riddled with bias, fragmentation, and missing context. EHRs are full of diagnosis and billing codes but can be sparse on actual clinical nuance. Devices collect signals but not outcomes. And most critically, patient journeys often end in the dataset just when the most important questions begin: Again, Did the treatment work? Did the patient recover? For how long? Even though many systems are optimized for cost management and billing, even claims data has major disadvantages here.

Without longitudinal, outcome-linked “ground truth” data ML/AI can’t fulfill its promise. It can optimize billing, suggest likely diagnoses, or flag anomalies. But to transform care? It’s still guessing in the dark.

So let’s sum this up: In patient records… We have ICD codes, (International Classification of Diseases), CPT codes (Current Procedural Terminology), HCPCS (Healthcare Common Procedure Coding System) codes, Vitals, labs, prescriptions, encounter notes (often unstructured), and more.

What we don’t typically have is patient-reported outcomes, including actual functional status. (E.g. “I can walk again.”) We don’t have quality of life measures, long-term follow ups beyond the immediate care window, recovery trajectories, (e.g., full, partial, (whatever those might mean), or relapsed.) Maybe this is understandable. EHRs were built mainly for billing and documentation, not clinical research. Longitudinal tracking across multiple providers is rare due to system fragmentation. No widespread standard for capturing patient-reported outcome measures (PROMs). And finally, providers lack incentives to track or input outcomes unless required by regulation or payment models.

The bottom line: Most EHRs capture events, not results. That’s a core reason why AI in healthcare, (while perhaps amazing in many respects), likely can’t do anything other than under deliver for the time being in terms of broad scope population health level insights.

Go Back to the Basics: Why AI At All?

AI tools in general, and Large Language Models (LLMs) in particular, have all manner of use cases. Core among them is synthesizing massive amounts of data to possibly offer insights. (Rather, they can synthesize summaries. And of course, non LLM ML tools will continue their traditional values from predictive modeling to classification, clustering, and so on.) As of this writing, some of the most interesting research, (my opinion), is regarding things like more advanced reasoning, and trying to get more visibility into why an LLM might be reaching it’s conclusions.

Really though, one main purpose is to try to make sense of the vast corpora of data we continue to amass in a world where structured data has… “lost.” Once upon a time, research papers were painstakingly categorized and tagged with abstracts professionally written. This was done largely for findability. In other words, primarily for indexed search. Along our evolution at least three things have happened: 1) There’s just become so very much that even if such document meta data were to be perfectly crafted, the results of even a meticulously done expert query are often still overwhelming. 2) Structured data takes so many different forms, (spreadsheets to relational databases to graph databases and so on), that it’s effectively incomparable with just about any type of traditional search. 3) Our overflowing storage media are chock full of information that is simply unstructured or poorly tagged in any case. Of course, medical data is often highly structured. Not always, but usually. Things are measured by all manner of instruments and reported upon accordingly. In fact, there are are standards for this. (See LOINC, Logical Observation Identifiers Names and Codes. This is a standard for identifying health measurements, observations, and such.) Though even this has challenges given that new tools, (sometimes not widely deployed), might have novel new measures we don’t fully understand yet. (E.g., UV imaging for skin health, AI-powered digital stethoscopes capturing waveforms not yet fully categorized, various wearables like breath analyzers, retinal scanners, portable EEG headbands, and more.)

So what have we done so far? We haven’t gone back to every person or machine that’s spewing data and said: “You really should study this massive taxonomy we’ve built and get your data in order. There’s tagging standards you know. Show a little classification in your work. Don’t be a data slob” No. That’s not even close to viable for all manner of what should be obvious reasons. Our solution so far has been to try to build something that can reason more like a human brain, (neural networks), only give it piles more memory and orders of magnitude more storage. (Such as the Foresight AI model in the UK being trained with data from 57 million people!) And yet, for all this, we’re still missing things.

Structural & Economic Barriers to Outcome Data

Why can’t we get outcome data? At least partly, because there’s no real incentive for anyone to gather it. As well, doing so would be challenging to do at scale, and likely impossible to do comprehensively.

For some tasks, we can perhaps invent synthetic data or do feature engineering. While some can argue this is false and might abstract away the real world to the point of any insights being fanciful, we can see on the face of it why this is sometimes necessary. For an example from vehicle research, we can gather some crash test information from real cars and instrumented crash test mannequins. However, to get more at scale it’s more practical to generate various crash scenarios and feed such info into crash models. In this example, we have two things; 1) A sensible path to get what should be reasonable data. Perhaps not as wholly realistic as real-world data, but a step-change better than nothing. 2) We have incentives to make these products better; both financial and perhaps to avoid regulatory issues. But can we engineer any healthcare data for outcomes like this? There may be proxies we can use and assumptions we can make, however any such attempts have obvious risks of accuracy.

Again, for gathering outcome data, we generally have few real incentives at individual, practice or facilitiy levels, even if collectively, we might. Clearly, at a population level there would be value and potentially massive cost savings if we could find ways to interdict in emergent problems earlier. But there’s no clear economic incentive for any single commercial entity to embark on such a project for patient well-being.

What Data Do We Have?

General population health data. General data from EHRs. Public studies done for point solutions from drug trials or general studies. Yes, there are plenty of data sharing platforms. PatientPing, (now Pings from Bamboo Health), and such for hospital readmissions and other events. Coordination platforms like Cosmos from Epic, Health Gorilla, Point, Click, Care, and more. But actual outcome data? Not specifically. From another vector altogether is data from Emergency Medical Services (EMS) systems, which could be wholly de-coupled from these other sources. Add to this the increasing use of Mobile Integrated Health, and that may be another source. (Mobile health is deeper at home care that allows for interdiction in what otherwise might be an Emergency Room visit. To varying degrees of increased capability, medics may visit a patient at their residence and treat in place. This is potentially much better for patients, healthcare systems and costs. I can inject here personally, as a long term community EMS volunteer, a great number of patients likely do not need to go to the ER. But until recently with these systems there was no legal or protocol clarity on how to offer more definitive treatment without transport to a healthcare facility.)

It’s not as if there’s been no work here. Some health systems or insurers are large enough that their data corpus alone may be large enough for solid insights. Kaiser Permanente (KP) is a leading example of an integrated healthcare system that combines health coverage and care delivery into a coordinated experience. This model allows for seamless data collection and utilization across various care settings. See how they think about integrated care. Look at the SEER Program. The Surveillance, Epidemiology, and End Results (SEER) Program, managed by the National Cancer Institute. This is a premier source of population-based information on cancer incidence and survival in the United States. SEER’s extensive datasets have been instrumental in cancer research, enabling studies on cancer trends, survival rates, and disparities among different populations. We have the “All of Us” Research program for NIH. And then there’s The Patient-Reported Outcomes Measurement Information System (PROMIS).

What you can see is that not only is outcome data either not collected at all, or sporadic and by happenstance in an EHR narrative, but there’s not even a standard taxonomy. While we have ICD and CPTs, and LOINC for data, (not to mention SNOMED for general healthcare terminology), what about outcomes? I’ve mentioned ICHOM and POMIS, and then there’s also CMS Measures for hospital readmissions and merit-based incentive payments, but no real standard. (Though SNOMED does include some concepts for outcomes.) There is the Patient-Reported Indicator Surveys (PaRIS) that the OECD is working on, but again, yet another disparate solution.

What do all these have in common? The answer is, “Not Much.” Yes, all have laudable goals. The thing is though, there’s certainly overlaps in methodologies; that would stand to reason. But are all of their data sets normalized? Are they focused broadly, or scoped to particular care categories? It might arguably good to have a variety of decentralized approaches in order to seek out what might be best practices in discovery. The bottom line is that I’ll stand with my basic assertion that most healthcare data, which was originally more focused on the needs of insurance and billing than anything else, remains a stunningly messy environment. (See the NIH paper, Administrative Database Studies: Goldmine or Goose Chase?)

By the way, if you look at the last few paragraphs and re-consider the question, “What do all these have in common,” an answer from one particular perspective might be, “The things clearly related to revenue are fairly well defined.”

It’s Not Just About EHR Data

Electronic Health Records (EHRs) are often cited as the holy grail of medical data, but they represent only a fraction of what’s needed for truly comprehensive outcome analysis. EHRs typically capture clinical encounters and interventions but miss critical contextual information that influences health outcomes.

What’s missing? Social determinants of health data, (SDOH), including housing stability, food security, transportation access, and social support networks. (Though this may exist in various forms.) Longitudinal behavioral data tracking patients’ adherence to treatment plans, lifestyle modifications, and preventative measures. Environmental factors like air quality, water safety, and neighborhood conditions that significantly impact chronic conditions.

Moreover, EHRs still often exist as siloed systems across different healthcare providers with limited interoperability, even within a particular healthcare system. A patient might receive care from multiple specialists, each with their own record system, creating a fragmented picture of their health journey. Without standardized data formats and robust integration, even the most sophisticated ML algorithms struggle to generate meaningful insights. We have a variety of technical interoperability standards like HL7, FHIR, and more. But they’re implemented at varying levels across the industry, though it seems like FHIR is becoming more common if only because it’s increasingly mandated for patient access and third-party integrations. Regardless of the existence of some standards, we can see there is an entire cottage industry of providers, practitioners and job openings focused on working on these issues. Which seems to allow for a logical belief to be there’s still a lot of work to be done in these areas. Also, let’s again recall that these standards are primarily about formats and exchange structures, not the actual content of the data. There’s minimal content standards, and the couple that exist are really also more about interoperability than clinical care.

EHRs should have supported a Common Clinical Data Set (CCDS). This evolved into the USCDI – U.S. Core Data for Interoperability). The transition from CCDS to USCDI officially took place with the release of the Office of the National Coordinator for Health Information Technology (ONC) Cures Act Final Rule on March 9, 2020, and became mandatory for certified health IT systems starting in 2022. These include some basic elements like demographics, diagnosis, meds, and similar. This is another great step towards structured, reportable data, and will help with the ongoing interoperability challenge, but still isn’t concerned with outcomes or long-term tracking.

As alluded to, EHR data often prioritizes billable events rather than patient-centered outcomes. This isn’t to say that clinical data isn’t important or isn’t captured. It’s just that core components of diagnostics and treatments are key required elements. So everything else somewhat drives or hangs off of these. The data captures what was done to the patient (procedures, medications prescribed) but doesn’t necessarily record how effectively these interventions improved quality of life, functional ability, or patient satisfaction. Sometimes this may exist. And a discharge note or follow-up may indicate something or another. And yet, it’s the end point at which data capture just goes silent for the legitimately obvious reason a patient may have just departed the system.

How Can Value Based Care Thrive?

Without comprehensive outcome data, value-based care models remain fundamentally handicapped. That may seem a bold statement since this is just my admittedly challenging to prove opinion anyway. But I think it’s fair to argue this is the case from the perspective of logic alone. Recall that Value-based care models are healthcare payment systems that reward providers based on patient health outcomes rather than the volume of services delivered. They aim to improve quality, reduce costs, and incentivize preventive care and long-term recovery instead of fee-for-service treatment. Value-based care models began gaining traction in the early 2010s, following the passage of the Affordable Care Act (ACA) in 2010, which introduced key programs like Accountable Care Organizations (ACOs) and bundled payment initiatives. They became more widely adopted around 2015–2016, when CMS (Centers for Medicare & Medicaid Services) began tying a larger share of Medicare payments to value-based models through programs like the Hospital Readmissions Reduction Program, Merit-based Incentive Payment System (MIPS), and the Value-Based Purchasing Program.

At the very least it seems challenging to prove value-based care is working without broad ranging outcome data, though several of the above programs do collect a variety of types of quality data. Yes, of course, one could commission studies with statistical validity. And yet, isn’t this still somewhat unsatisfying given that even though we are awash in data, gathering common data on simple endpoints is just not endemically part of existing collection and datasets?

Value care models aim to reward healthcare providers for improving patient outcomes rather than simply performing procedures, but how can we measure “value” when our outcome data is incomplete or widely disparate across programs? What seems to be occurring now is primarily use of proxy metrics like hospital readmissions, emergency room utilization, rates or screenings, medication adherence (via refill data), and things like follow-up appointments. And yet, do any of these actually tell how well a patient may be doing? Patient satisfaction surveys like HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems) may be used, but once again… you might get perceived quality, without actual clinical outcomes.

For value-based care to truly succeed, we need holistic outcome metrics that span beyond traditional clinical indicators. This includes patient-reported outcome measures (PROMs) that capture subjective improvement in symptoms, functional status, and quality of life. It requires long-term tracking of actual outcomes beyond the 30-90 day windows typically measured in current systems. (The 30-90 day window is my personal guesstimate, not something I can wholly prove. It’s based on looking at several standards such as the CMS readmission program HRPP focusing on 30-day readmission, MIPS measures user reporting periods of 30-90 days, and others are similar.) As a practical matter, this might not always be possible. But minimally, on unrelated follow-up visits to physicians, a care practitioner could be prompted to check on a prior issue.

Healthcare systems attempting to implement value-based care often find themselves making decisions based on proxy metrics or financial outcomes rather than true health outcomes. Without robust data connecting interventions to long-term results, organizations struggle to identify which practices genuinely drive value versus those that merely reduce short-term costs.

The transition to value-based care also requires, (or I should say at least suggests), predictive analytics to identify high-risk patients and personalize interventions. Yet current ML models trained on incomplete data sets may perpetuate existing healthcare disparities or miss critical opportunities for early intervention. What appears “valuable” in our current limited data frameworks may not represent true value from a patient perspective or population health standpoint.

Furthermore, value-based care depends on effective risk adjustment mechanisms, which themselves require comprehensive outcomes data to accurately predict resource utilization and set fair benchmarks. Without this foundation, value-based payment models risk creating perverse incentives that reward avoiding complex patients rather than improving care quality. (These, among others, are not new concerns of course. See: MedPAC March 2022 Report to the Congress, Current Challenges in Risk Adjustment, Challenges of Risk Adjustment in Value Based Care.)

Policy, Technology, and Economic Solutions

This is one of those items for which there is likely little alternative other than government incentives, either reward or punishment based. But government intervention alone won’t solve the problem. We need a multi-faceted approach:

First, we need standardized outcome measures that matter to patients, clinicians, and payers alike. Organizations like ICHOM (International Consortium for Health Outcomes Measurement) have begun this work, but widespread adoption requires regulatory support and financial incentives to implement. And yet, ICHOM was only founded in 2012. It might seem like more than a decade is a long time, but not in corralling this level of complexity. And for all their great work, their outcomes measures are still constrained to particular taxonomic sets.

Second, we must build infrastructure for longitudinal patient tracking that follows individuals across care settings and throughout their lifetimes. This requires not just technical solutions for data interoperability but also privacy frameworks that balance data utility with patient confidentiality. Though you could certainly accomplish such goals without it, things like blockchain technology seem potentially, (at least theoretically), great for this kind of thing,

Third, healthcare systems should partner with technology companies to develop ML models specifically designed to work with sparse, heterogeneous medical data. Rather than waiting for perfect datasets, we need algorithms that can learn incrementally and transparently communicate their confidence levels when making predictions.

Fourth, we must create sustainable economic models for outcome data collection. This might be the hardest part. This is essentially, (again just my opinion), why such data does not exist now. There’s just no real economic incentive for any private organization to try to get at it. Or even if there is, it’s such a large and expensive problem there’s major barriers to anyone doing so. Then there is unfortunately a cynical view as well. And this might be that there is a subtle disincentive to do better in some ways. After all, as an industry, the healthcare system is more successful in providing treatments rather than cures. Changing this might require shared data cooperatives where multiple stakeholders contribute to and benefit from pooled outcome data, or public-private partnerships that fund large-scale outcome registries for common conditions.

Finally, patients themselves must be empowered as data stewards. With appropriate digital tools and education, patients can contribute valuable outcome information directly, supplementing clinical data with real-world experiences that capture what matters most: whether healthcare interventions actually improve their lives. Here again it is likely fair to adopt a cynical view. While healthtech devices and apps have proliferated greatly over the past couple of decades, they are primarily in use by more socioeconomically successful cohorts and even so, unevenly and inconsistently. The “digital divide” literature consistently shows that access to and effective use of digital health technologies follows socioeconomic gradients. A 2019 “Watched by Apple” study in the New England Journal of Medicine found that wearable users were more likely to be younger, have higher incomes, and higher education levels than the general population. Research indicates that digital health interventions, such as apps and wearables aimed at increasing physical activity, tend to benefit individuals from higher socioeconomic backgrounds more than those from lower socioeconomic statuses. And access and Utilization in the U.S. show gaps as well.

Wrapping Up

The future of AI in healthcare depends not just on algorithmic innovations but on our collective will to build comprehensive outcome data systems. Without this foundation, even the most sophisticated ML tools will continue to build impressive castles on shifting sands. Even the best tools can only go so far when working on limited raw materials.

If we want AI to change healthcare, we have to start with a more complete picture of what happens to the people we treat. That means finally putting outcomes first.