What Healthcare Can Learn From a Proud Data Parasite

The creator of a digital genetics database says it’s all about harmonization.

Some might call him a data parasite, but Paul Pavlidis, PhD, doesn’t mind. “It’s a slur that we now embrace,” he tells Healthcare Analytics News™. “It’s a good thing.”

He borrowed the title from a 2016 New England Journal of Medicine op-ed in which its editor-in-chief described the potential for “research parasites” to take advantage of an open data-sharing system, though forms of the label had been around before that article. So, what is a data parasite? “We don’t generate data; we just take it from other people,” says Pavlidis, a psychiatry professor at the University of British Columbia in Vancouver, Canada.

And that is a good thing: It’s researchers like him who scrutinize the work of others, ensuring information reliability and integrity, and examine data sets to identify other uses that the original investigators might have overlooked.

Data parasites may also compile disparate data and build new databases, like Pavlidis and his colleagues did when they created NeuroExpresso, a searchable, open-access, online repository of gene expression profiles for 36 types of brain cells, based on mouse data. Healthcare Analytics News™ caught up with Pavlidis late last year after he published a corresponding paper, “Cross-Laboratory Analysis of Brain Cell Type Transcriptomes with Applications to Interpretation of Bulk Tissue Data,” in the journal eNeuro.

As the conversation progressed, it kept returning to 1 theme: What would a data parasite recommend? Do data parasites have any advice for this particular sort of institution or researcher or healthcare organization? So, as healthcare stakeholders of all backgrounds grapple with ever-growing piles of data on their plates—from electronic medical records, wearables, genes, lab tests, research, biospecimens, and more—what can they learn from someone who’s just far removed from the data-gathering work to see the strengths and flaws?

Go Public

When researchers and other healthcare institutions place de-identified information in public databases, it benefits future studies, Pavlidis says. NeuroExpresso drew much of its data from the Gene Expression Omnibus run by the National Center for Biotechnology Information. This is great for investigators who are generating data and want its value to be “greater than what they write in their own publication.”

Another plus is that public databases do a certain amount of work to harmonize the data. That means that the gene expression profiles might match up, and particular genes may be comparable.

Still, it does not always work out so well. Pavlidis and his colleagues often must harmonize data that live in public registries. “The numbers might not be on the same scale. We have to normalize it,” he says. “That’s somewhat inherently an imperfect process.”

Keep It Simple

Quality control is the first step toward this goal. Too often, data parasites encounter samples that are flawed in some way. For example, a cell type might be contaminated with another cell type, Pavlidis says. It is crucial that data generators strive to keep the data simple, in that they are actually what they are supposed to be.

The same goes for digital data interfaces like NeuroExpresso. Data generators and parasites alike should build software that is simple, easy to use, and intuitive. The shiniest bell and whistle should be the inclusion of an access point to the underlying data. “That’s what I think is going to be the big win here,” Pavlidis says of NeuroExpresso’s use of that feature.

Embrace the Data Parasites

Pavlidis and his ilk are trying to improve the data situation, whether that be in terms of quality or access. He wants to be open about what exactly he is doing, providing tools and resources along the way. And he’s happy to teach data generators about his work—and how they can help improve it. “It’s something a lot of scientists are realizing adds value to their work,” he says. “We’re hoping it’s a positive thing.”