How NLP and Genomics Can Scrape Psychiatric Insights Out of Unstructured EHR Data

A team of researchers from Harvard and Brigham and Women's says that their new methodology will be made freely available to other researchers.

A common knock on electronic health records (EHRs) is that they can be difficult to mine for meaningful insights. That might have less to do with the information they contain, however, than the ways they are traditionally processed. For complex conditions like mental disorders, the problems are particularly pronounced.

A team of researchers in Boston, however, is exploring how natural language processing (NLP) and genomics to develop a solution—and according to the lead author, the software they have developed will be made freely available to other researchers.

"Many efforts to use clinical documentation in electronic health records for research aim to identify individual symptoms, like the presence or absence of psychosis," Thomas McCoy Jr., MD, of Massachusetts General Hospital and Harvard Medical School said. "My co-authors and I developed a method that instead captures symptom dimensions, or sets of symptoms.”

The team based their categories on National Institute of Mental Health Research Domain Criteria standards, and today they published 2 new studies. The first used NLP to specifically extract symptom information from the unstructured data buried within EHRs of over 3,600 adults with psychiatric hospitalizations for a range of conditions, schizophrenia, major depressive disorder, and post-traumatic stress disorder among them.

The researchers developed a list of “seed words” that appeared in between 10% and 90% of the EHRs, and for each of those terms 50 unigrams and bigrams of similar terms were developed. Unrelated or ancillary terms were then preened out, allowing the team to characterize condition severity across the cohort based on the appearance of phrases.

Traditionally, when health systems are looking to use EHRs to predict condition severity and related metrics (like length of hospital stay), they pair the data with billing information. By focusing instead on the language within the detailed physician notes in the EHRs, the researchers developed a system that could predictively correlate symptoms with length of stay and cognitive performance scores, as validated by adjusted Cox regression models.

That study, the team concluded, “shows that natural language processing can be used to efficiently and transparently score clinical notes in terms of cognitive and psychopathologic domains.”

A second study by the same set of authors tried to further that effort by applying genomics.

"The recognition that the genetic basis of psychiatric illness crosses traditional boundaries has encouraged efforts to understand psychopathology according to dimensions, rather than simply presence or absence of symptoms," McCoy said.

The group drew from the Partners Biobank program, a sequencing collaboration between Brigham and Women’s Hospital and Massachusetts General Hospital, and applied the NLP methodology developed in the earlier study to extract symptom dimensions from the population. They outlined loci based on the EHR symptom sets and went to work checking for them in the genomic records.

Four of the loci exceeded a genome-wide threshold for statistical significance. “Two of these span genes are associated with neurodevelopment (RFPL3) or neurodegeneration (PFR3),” the authors wrote. “While both are known to be brain expressed, neither has previously been strongly associated with neuropsychiatric disease, suggesting the potential utility of the approach we describe in understanding brain function in a manner that is unbiased by traditional nosology.”

Both studies were published today in Biological Psychiatry. “The ability to combine large DNA data sets with meaningful psychiatric information from the electronic health record is an important step in facilitating large scale medical genetics research in psychiatry," the editor of the journal, John Krystal, MD, said in a statement.

Citing the decision to make the software available to other researchers, McCoy said that he and his team “hope this work will enable transdiagnostic dimensional phenotypes to be used in efforts to achieve precision psychiatry.”

Related Coverage:

Lost in the CRISPR Hype, a Gene-Editing Giant Is Fighting Back

An Edible QR Code Might Advance Precision Medicine

AI Is Analyzing Faces to Aid Rare Disease Diagnosis