|Articles|August 8, 2019

MIT Model Aims to Break Through AI 'Overfitting'

Better automating machine learning to tackle tough-to-find diseases.

^{The voicing patterns of patients with vocal cord nodules are shown in spectrograms. A new MIT model attempts to cut through the data noise and eliminate a problem common in machine learning: the phenomenon of “overfitting.” Image/Thumb have been modified. Courtesy of MIT.}

Artificial intelligence has virtually unlimited upside to tackling human health problems, almost all experts agree. But the machine needs guidance. The sheer scope of data can make manual training, by human hand, almost impossibly work-intensive. At the same time, a machine left to learn by itself will eventually just memorize its sample set, instead of picking out relevant points — leading to “overfitting,” with inaccurate results.

A model using data manipulation to pick out vocal cord disorders in a limited set of subjects may hold the key to eliminating the “overfitting” bugaboo, according to a new paper by data scientists from the Massachusetts Institute of Technology, Harvard, and the University of Toronto.

Getting the machine to better learn what it’s looking for could have applications in a wide range of applications where subjects are few, but the data mound is huge, according to the work scheduled for presentation at the Machine Learning for Healthcare conference later this week.

“If you have few subjects and lots of data, there’s a failure model that’s just memorizing who’s who,” said Jose Javier Gonzalez Ortiz, the lead author, a Ph.D. student at the MIT Computer Science and Artificial Intelligence Laboratory, in a phone interview with Inside Digital Health™ this week. “What we saw is, if you just split this process in two, you have better odds… By learning the features first, without knowing who is who, and then performing the classification, you have a better chance of not failing that way.”

The paper outlines the model, employed by Gonzalez Ortiz and the rest of the team. A group of 104 subjects, half diagnosed with vocal cord nodules (a growth somewhat like a callous in the throat). Each of the subjects were set up with an accelerometer, a node affixed to their neck, for tracking entire days’ worth of data for every time they spoke.

That data trove was huge — billions of time samples.

So the machine was tasked with picking out which of the patients had vocal cord nodules — but most importantly, picking out the features that identified them as such.

To automate the normally manual aspects of “featuring engineering” — picking the most pertinent decisive criteria – Gonzalez and Ortiz used the two-step data analysis to better discriminate among the data.

The voicing segments created spectrograms, which are a visual representation of the frequencies capturing speech. These, in turn, created huge complex matrices.

From there, and to help the machine learn the data inside and out, it was instructed to perform two operations.

First, to use an autoencoder to compress the spectrograms down to 30 values.

Second, to reverse course again — decompressing that spectrogram back into a new spectrogram entirely, according to the paper.

After the second operation, the model is instructed to make sure the new spectrogram resembles the initial data inputs. In this step — by being forced to learn to discriminate these apart from one another – the machine better learns what separates the spectrograms. That means discriminating different ones coming from the same patients, in addition to determining differences between subjects.

The authors concluded that the two-step method largely eliminated “overfitting.”

“By decoupling the feature extraction of the from downstream learning tasks, our learned representation prevents common overfitting issues that approaches with direct supervision experience,” they wrote. “The features generalize across subjects, while capturing relevant patterns for downstream clinical prediction tasks.”

Gonzalez Ortiz said, in this vocal-cord scenario with few subjects and lots of data, meant “overfitting” was extremely likely. But their model could have many more applications — especially when it comes to wearable devices, the researcher added. Monitoring for Parkinson’s disease, or sleep disorders, where you have long periods of observation punctuated by fleeting data points, could benefit from the two-step process of having machines distinguish the criteria, he said.

“You always have to account for overfitting, because pretty much all systems and algorithms, if given enough time and parameters, they will memorize the training dataset, and will fail to generalize to the test data,” said Gonzalez Ortiz.

Get the best insights in digital health directly to your inbox.

AI-Enabled ECG Accurate in Detecting A-Fib, Mayo Clinic Study Finds

AI Solution Reduces Clinical Trial Screening Time by 34%

Subscribe Now!

Latest CME

Multimedia

Mastering Epithelioid Sarcoma: Enhancing Diagnostic Precision and Tailoring Treatment Strategies

Mark Agulnik, MD; Mrinal M. Gounder, MD; Jacqueline M. Kraveka, DO; Daniel Lefler, MD; Shaina A. Rozell, MD, MPH; Lee M. Zuckerman, MD

Case-based Simulation

Clinical Showcase™: Selecting the Best Next Steps for a Patient with Epithelioid Sarcoma

Mark Agulnik, MD; Daniel Lefler, MD

In-Person Event

Brain Mets: Brain & Spine Metastases Research and Emerging Therapy Conference

January 22, 2026

In-Person Event

2nd Annual Hawaii Cancer Conference

January 24-25, 2026

MIT Model Aims to Break Through AI 'Overfitting'

Newsletter

Related Content

Nurses gain support in fight over professional degrees

ChristianaCare, Virtua drop plans to create $6B health system

Ryan Shazier’s NFL career ended with a spinal cord injury. Now he helps patients in need.

Strengthening the CFO/CISO partnership for cybersecurity | Viewpoint

Healthcare leaders fear possible changes to vaccine schedule

Latest CME

Mastering Epithelioid Sarcoma: Enhancing Diagnostic Precision and Tailoring Treatment Strategies

Clinical Showcase™: Selecting the Best Next Steps for a Patient with Epithelioid Sarcoma

Brain Mets: Brain & Spine Metastases Research and Emerging Therapy Conference

2nd Annual Hawaii Cancer Conference

Medical Crossfire®: Bridging Evidence to Practice in AML…Updates on FLT3, IDH1/2, Maintenance, Combos, and Clinical Trials

A Breath of Strength: Managing Cancer Associated LEMS and Lung Cancer as One

Show Me the Data™: Bridging Clinical Gaps Along the Continuum From Resectable, Early Stage to Advanced Gastric/Gastroesophageal Junction Cancers

Striking the Right Nerve: Managing Cancer Associated LEMS in Lung Cancer Patients

19th Annual New York GU Cancers Congress™

Medical Crossfire®: Expert Interpretations of the Latest Data in CLL Management – Understanding the Impact of Optimal Treatment Selection on Patient Outcomes

Virtual Testing Board: Digging Deeper on Your Testing Reports to Elevate Patient Outcomes in Advanced Non–Small Cell Lung Cancer

11th Annual School of Gastrointestinal Oncology® (SOGO®)

Addressing Unmet Needs in HER2+ Metastatic BTC

Community Practice Connections™: Tailored Treatment Approaches for Older Patients With Advanced HR+/HER2– Breast Cancer

Community Practice Connections™: Optimizing Treatment Outcomes and Preserving Fertility in Premenopausal HR+ Breast Cancer

From Bench to Bedside: Paradigm Shifts in HER2+ Metastatic BTC Treatment

Proactive Adverse Event Management for HER2+ BTC Treatments

Community Practice Connections™: Empowering Interventional Radiologists in the Emerging Era of Oncolytic Immunotherapies for Melanoma

A Case-Guided Discussion on Managing Immune Thrombocytopenic Purpura (ITP)

GI Tumor Board—Applying Recent Advances in Biomarker Testing and Treatment in Metastatic Colorectal Cancer

Evolving Treatment Strategies in Pancreatic Cancer: Current Standards, Emerging Targets, and the Role of Molecular Testing

Medical Crossfire®: Precision Medicine in Glioma Treatment — Integration of Molecular Profiling to Inform Targeted Therapies

Cases and Conversations™: Sorting Through the Expanding Treatment Options for Patients with Relapsed/Refractory Multiple Myeloma

PER Tumor Board®: Applying Recent Advances to Transform the Treatment Paradigm in SCLC—Expert Perspectives on New Approvals and Emerging Strategies

Medical Crossfire®: Harnessing the Power of Modern Therapies in Newly Diagnosed Multiple Myeloma

Medical Crossfire®: Improving Patient Outcomes in Myeloproliferative Neoplasms With Novel Therapeutic Approaches

Tumor Board: Expert Insights on Managing Classical 𝘌𝘎𝘍𝘙 Mutations, 𝘌𝘎𝘍𝘙 Exon 20 Insertions, and Atypical 𝘌𝘎𝘍𝘙 Mutations in Metastatic NSCLC

Medical Crossfire®: Expert Perspectives on Targeting c-Met Overexpression and 𝘔𝘌𝘛 Genomic Alterations in NSCLC – Unveiling the Complexities of 𝘔𝘌𝘛 Dysregulation

Cases & Conversations™: Transforming AML Care—Precision Strategies, Evolving Therapies, and Clinical Insights

Medical Crossfire®: Integrating Next-Generation Endocrine Targeting Therapies to Improve Outcomes for Patients With HR+/HER2- Breast Cancer

Medical Crossfire® in Adjunctive Testing: Charting a New Course in Prostate Cancer Risk Assessment

Trending on Chief Healthcare Executive

ChristianaCare, Virtua drop plans to create $6B health system

Strengthening the CFO/CISO partnership for cybersecurity | Viewpoint

Nurses gain support in fight over professional degrees

Ryan Shazier’s NFL career ended with a spinal cord injury. Now he helps patients in need.

Hospitals likely to face more financial pressures in 2026