MIT Model Aims to Break Through AI 'Overfitting'

Seth Augenstein

Better automating machine learning to tackle tough-to-find diseases.

The voicing patterns of patients with vocal cord nodules are shown in spectrograms. A new MIT model attempts to cut through the data noise and eliminate a problem common in machine learning: the phenomenon of “overfitting.” Image/Thumb have been modified. Courtesy of MIT.

Artificial intelligence has virtually unlimited upside to tackling human health problems, almost all experts agree. But the machine needs guidance. The sheer scope of data can make manual training, by human hand, almost impossibly work-intensive. At the same time, a machine left to learn by itself will eventually just memorize its sample set, instead of picking out relevant points — leading to “overfitting,” with inaccurate results.

A model using data manipulation to pick out vocal cord disorders in a limited set of subjects may hold the key to eliminating the “overfitting” bugaboo, according to a new paper by data scientists from the Massachusetts Institute of Technology, Harvard, and the University of Toronto.

Getting the machine to better learn what it’s looking for could have applications in a wide range of applications where subjects are few, but the data mound is huge, according to the work scheduled for presentation at the Machine Learning for Healthcare conference later this week.

“If you have few subjects and lots of data, there’s a failure model that’s just memorizing who’s who,” said Jose Javier Gonzalez Ortiz, the lead author, a Ph.D. student at the MIT Computer Science and Artificial Intelligence Laboratory, in a phone interview with Inside Digital Health™ this week. “What we saw is, if you just split this process in two, you have better odds… By learning the features first, without knowing who is who, and then performing the classification, you have a better chance of not failing that way.”

The paper outlines the model, employed by Gonzalez Ortiz and the rest of the team. A group of 104 subjects, half diagnosed with vocal cord nodules (a growth somewhat like a callous in the throat). Each of the subjects were set up with an accelerometer, a node affixed to their neck, for tracking entire days’ worth of data for every time they spoke.

That data trove was huge — billions of time samples.

So the machine was tasked with picking out which of the patients had vocal cord nodules — but most importantly, picking out the features that identified them as such.

To automate the normally manual aspects of “featuring engineering” — picking the most pertinent decisive criteria – Gonzalez and Ortiz used the two-step data analysis to better discriminate among the data.

The voicing segments created spectrograms, which are a visual representation of the frequencies capturing speech. These, in turn, created huge complex matrices.

From there, and to help the machine learn the data inside and out, it was instructed to perform two operations.

First, to use an autoencoder to compress the spectrograms down to 30 values.

Second, to reverse course again — decompressing that spectrogram back into a new spectrogram entirely, according to the paper.

After the second operation, the model is instructed to make sure the new spectrogram resembles the initial data inputs. In this step — by being forced to learn to discriminate these apart from one another – the machine better learns what separates the spectrograms. That means discriminating different ones coming from the same patients, in addition to determining differences between subjects.

The authors concluded that the two-step method largely eliminated “overfitting.”

“By decoupling the feature extraction of the from downstream learning tasks, our learned representation prevents common overfitting issues that approaches with direct supervision experience,” they wrote. “The features generalize across subjects, while capturing relevant patterns for downstream clinical prediction tasks.”

Gonzalez Ortiz said, in this vocal-cord scenario with few subjects and lots of data, meant “overfitting” was extremely likely. But their model could have many more applications — especially when it comes to wearable devices, the researcher added. Monitoring for Parkinson’s disease, or sleep disorders, where you have long periods of observation punctuated by fleeting data points, could benefit from the two-step process of having machines distinguish the criteria, he said.

“You always have to account for overfitting, because pretty much all systems and algorithms, if given enough time and parameters, they will memorize the training dataset, and will fail to generalize to the test data,” said Gonzalez Ortiz.

Get the best insights in digital health directly to your inbox.


AI Models Identify Smoking Environments, Could Lead to Timely Interventions

AI-Enabled ECG Accurate in Detecting A-Fib, Mayo Clinic Study Finds

AI Solution Reduces Clinical Trial Screening Time by 34%