Building better models with better data: Our journey with quality data and rapid model development

Published:

October 28, 2025

Building better models with better data tech blog graphic
Healthcare automation depends on reliability. This tech blog is where we share how our teams design, build, and refine the systems behind our platform—from the engineering decisions that shape performance to the data quality investments that ensure accuracy and scalability across healthcare.

It’s a well-known truth in machine learning that most of the effort isn’t spent on modeling but gathering, cleaning, and labeling high-quality data. This imbalance becomes especially visible with unstructured inputs like documents, forms, or faxes. Turning these inputs into structured data is the broader task of information extraction (IE), where we apply various methods as an ML team.

At Cohere Health, our fax intake pipeline directly highlights these challenges. Extracting structured information from scanned faxes is full of obstacles: inconsistent formats, noisy inputs, and ambiguous context. Below is an example of ML-assisted fax intake, where, at its core, information extraction involves handling a wide variety of medical field extractions from a document.

The complexity and diversity of clinical data pose unique challenges where traditional extraction approaches often struggle with two recurring challenges:

  • Label quality issues: Ground truth labels frequently diverge from the source text, reflecting the gap between what humans can intuitively infer and what machines can extract literally.
  • Slow experimentation cycles: With dozens of fields (date of service, diagnosis code, NPI, etc.), teams must build and validate models individually, creating bottlenecks and delaying iteration.

These limitations result in inconsistent accuracy, prolonged deployment timelines, and limited scalability, making it harder to deliver reliable automation at scale.

The birth of EMLIE

To overcome these barriers, we built the EMLIE (Enterprise Machine Learning Information Extraction) framework, pronounced “Emily” (or “MLE”). Our end-to-end experimentation framework is designed to unify labeling, model iteration, and evaluation into a single streamlined process, enabling cleaner labels, faster experimentation, and more reliable extraction across diverse field types. EMLIE combines three critical components:

  • Label Data Quality Processor: ensures the data feeding our models is trustworthy and representative of real-world inputs.
  • Model Iteration Engine: enables rapid experimentation across model types, from rules and boosting algorithms to transformers and LLMs.
  • Holdout Evaluation Suite: provides consistent, reliable performance measurement before deployment.

Consider EMLIE as a laboratory, where we experiment and refine ideas quickly, and a production line, where standardized processes ensure consistency and scalability. With EMLIE in place, we can tackle our first major challenge: label quality. This allows ground-truth labels to accurately reflect the information we want to extract from documents.

Label data quality

At Cohere, we know that high-quality labels are the foundation of every model we deploy. Missing, ambiguous, or inconsistent labels create ripple effects beyond model accuracy, affecting analytics, iteration speed, and scalability.

We ensure label data quality through a robust process, combining clear guidelines with automated validation checks. Additionally, expert oversight helps resolve edge cases and maintain consistency. This results in high-quality, model-ready labels that drive a continuous cycle of improvement.

Building reliable models faster with EMLIE

As the number of fields we extract grows, so does the need for a faster and more reliable development strategy. EMLIE addresses this through a standardized model development pipeline for rapid, iterative experimentation across diverse modeling approaches.

At its core, EMLIE connects data ingestion and preprocessing, experiment orchestration, evaluation, and error analysis in a seamless loop. Together, these components act as interconnected nodes in a pipeline. This approach allows teams to iterate quickly, maintain reliability, and scale model development without compromising quality.

Holdout evaluation generator

High-quality evaluation is just as critical as high-quality training data. To ensure our models are tested on the most representative data, EMLIE includes an automated Holdout Evaluation Generator.

New holdout sets are generated automatically regularly, drawing from the latest labeled faxes in our pipeline. By refreshing these holdouts regularly, we can detect drift early and maintain confidence that a model’s performance in EMLIE will translate into performance in production.

How this strengthens our products

Our fax intake automation products depend on reliable and consistent field extraction. By embedding label data quality, model iteration, and holdout evaluation inside EMLIE, we’ve created a foundation that directly improves our products:

  • Faster model development: Standardized pipelines and automated workflows cut prototype time by ~50%. 
  • Higher accuracy: Models trained on consistently validated data and evaluated on fresh holdout sets have delivered a 5-8% lift in accuracy.
  • Cross-payer scalability: New payer? New field? EMLIE enables rapid spin-up of experiments.
  • Product credibility: Accuracy, consistency, and repeatability have translated into stronger trust with customers and partners.

In short, EMLIE and our investments in data quality directly enhance our products and make automation more reliable, scalable, and impactful.

Closing thoughts

While the current design addresses the immediate need and establishes a strong foundation, we continue to pursue enhancement opportunities, such as extending integration compatibility to continuous learning.

Real-world machine learning is not glamorous. The hard truth is that models live or die by the quality of their training data. At Cohere Health, we embraced this reality and turned it into a strength: By investing in label data quality initiatives and building EMLIE, we’ve created a virtuous cycle of better data → better models → better automation.

EMLIE allows us to scale quickly, experiment broadly, and deploy confidently. Most importantly, it builds credibility into our products for providers and payers. In healthcare, accuracy is not optional; it’s essential.

Visit our Platform page to learn more about how we use clinical intelligence in utilization management, payment integrity, and more.

No items found.

Available For Download

Written by

Cohere Health

Cohere

Health

Cohere Health’s clinical intelligence platform delivers AI-powered solutions that streamline access to quality care by improving collaboration between physicians and health plans. Cohere works with 660,000 providers and processes millions of prior authorization requests annually. Its AI auto-approves up to 90% of requests for millions of health plan members. Cohere has been recognized in the Gartner® Hype Cycle™ for U.S. Healthcare Payers in 2024 and 2025, named a Top 5 LinkedIn™ Startup in 2023 and 2024, and is a three-time KLAS Points of Light award recipient.

Written by

Stay ahead with expert insights on transforming utilization management and payment integrity—delivered straight to your inbox.