|Articles|February 20, 2018

The Data Toolkit That Can Analyze More Than 1M Cells

Why the technology is empowering researchers to analyze once-impenetrable data sets.

F. Alexander Wolf, PhD, and his team at the Institute of Computational Biology (ICB) at Helmholtz Zentrum München, the German Research Center for Environmental Health, have already been to the “Data Science Bowl”—a sort of “championship game” for machine learning.

Now, they may be at the forefront of a monumental breakthrough in the analysis of single-cell gene expression data with the advent of SCANPY, a “scalable toolkit” offering “preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks.” In fact, SCANPY, which stands for Single-Cell Analysis in Python (the programming language), may be the only currently available software package that can analyze data sets containing more than 1 million cells, including the Human Cell Atlas, a reference database of maps designed to describe and define the cellular basis of health and disease, which was developed by an international team of researchers.

A summary of Wolf’s work with SCANPY to date was published on February 6 in the journal Genome Biology.

“The Human Cell Atlas could profit from SCANPY,” Wolf, team leader in machine learning at ICB, told Healthcare Analytics News™. “Generating a cell atlas of the whole human body poses unseen computational challenges; we're talking about analyzing millions and millions of cells here. SCANPY makes a very good effort of resolving this.”

Wolf—who developed the software with his colleague Philipp Angerer in the Machine Learning Group of Fabian Theis, PhD, professor of mathematical modelling of biological systems at the Technical University of Munich—said the team has been asked to present SCANPY to the computational analysis committee of the Human Cell Atlas later this year. The Human Cell Atlas is only 1 of many “exploding” data sets (to use Wolf’s description) in healthcare research that, to date, have confounded investigators. Currently available software systems for gene-expression analysis simply haven’t been able to process data sets of this magnitude.

A key to SCANPY’s capabilities lies in the programming language upon which it is based. Python, which is more commonly used in the machine learning field, enables software to be more intuitive than conventional biostatics packages, which are typically written using the R programming language. With Python as its base, SCANPY is able to combine the preprocessing, cell visualization, and “pseudotemporal ordering” of separate systems in a single platform. Unlike conventional systems, which assess cells as points in a coordinate system, SCANPY uses algorithms (modelled on those used by social media platforms) that assess cells on a graph-like coordinate system that maps cells by identifying their closest neighbors, rather than characterizing a single cell by the expression value for thousands of genes.

In assessing its capabilities for the Genome Biology paper, Wolf and his colleagues found that SCANPY could perform specific cell analysis steps several times faster than existing platforms. They believe the platform is capable of analyzing 1.3 million cells in just a few hours, without subsampling.

“Quite generally, as soon as large data sets with many observations arise, [or] when you want to integrate data sets from many studies, SCANPY will either enable this or, if it’s possible already, make it much faster,” Wolf said. “Another goal is to use SCANPY as a back-end for data portals that are now created to simplify analyzing data for non-computational-expert users: visualizing cells, clustering them to find new cell types, finding trajectories and branchings, be it in the context of development, disease progression, or dose response, and finding the genes that mark all these effects in an interactive data exploration.”

Although SCANPY is still very much in the developmental stage, experts within the field believe it could have a significant impact on research in the short term. Martin Hemberg, PhD, of Wellcome Trust Sanger Institute, Cambridge, in the United Kingdom, who has expertise in bioinformatics, systems biology, and applied mathematics, told HCA that he can see the software playing a role in “every area” of basic research because “it provides broad support for processing scRNA-seq data.

“Processing scRNA-seq data remains challenging today for 2 main reasons: 1) The field has not reached a consensus for what is the best practice; and, 2), large volumes of data are computationally challenging to analyze,” continued Hemberg, who was not involved with the SCANPY project. “SCANPY provides a massive step forward for and it makes it much more feasible for researchers to analyze data sets that previously were intractable.”

Stay ahead of the evolving healthcare landscape with expert insights on leadership, operations, policy, innovation, and workforce strategy. Subscribe to Chief Healthcare Executive today.

The Data Toolkit That Can Analyze More Than 1M Cells

Related Content

Cost of breaches: Healthcare still leads all sectors, and AI is used in more attacks

After some turbulence, CDC nominee nears Senate approval: Questions and answers

The government is going after healthcare fraud. What health systems should know.

Leapfrog Group touts record number of ratings for hospitals, surgery centers

Measles cases reach 35-year high: What doctors and hospitals should do

Latest CME

Breast Cancer Tumor Board: Targeting TROP2 – Innovations in Triple-Negative Breast Cancer Treatment

Expert Guidance on Frequently Asked Questions Regarding the Use of ADCs in TNBC

Evaluating the Latest Data and Ongoing Trials for Novel ADC Approaches in TNBC

Establishing the Rationale for ADC and ICI Combinations in TNBC

Breaking Down the Rationale for Targeting TROP2 in TNBC

Dissecting Clinical Trial and Real-World Data for ADCs in TNBC

Breaking Down the Latest Clinical Data for First-line Maintenance and R/R SCLC

Cross-Disease Integration of Immunotherapy Innovations

Broadening the Frontline—Studies Informing the Use of Immunotherapy in Hepatocellular Carcinoma

Optimizing Treatment for Biliary Tract Cancers

PER Resource Center: Integrating Novel Approaches in TNBC – New Avenues for TROP2-Targeting ADCs and Beyond – Nursing

Practical Considerations and Future Directions for New Treatment Strategies in SCLC

Expert Roundtable and Panel Discussions: Current and Future Landscape of TNBC

Show Me the Data®: New and Emerging Roles for Oral SERD Therapy in the Treatment of ER+/HER2– Breast Cancer

Navigating Treatment Gaps in SCLC: Relapse, Resistance, and Need for New Options

Medical Crossfire® in Adjunctive Testing: Charting a New Course in Prostate Cancer Risk Assessment

BURST CME™ Resource Center: Integrating Novel PSMA-Directed Radioligand Approaches for Diagnosis and Management of Prostate Cancer

Radioligand Therapy 101: The Science Behind the Strategy

Ready for Radioligand Therapy? Patient Selection and Sequencing Simplified

Working Together: Overcoming Barriers to Optimize Outcomes in Patients Treated With Radioligand Therapy Through Multidisciplinary Care

Imaging Matters: Decoding PSMA PET for Better Decision-Making

A New Era of Targeted Therapy for Advanced NSCLC: Exploring Future Directions for Bispecific Antibodies and ADCs

Community Practice Connections™: Enhancing Melanoma Outcomes With Intratumoral Oncolytic Immunotherapy–Strategies for the Multidisciplinary Team

Advances in Managing EGFR-Mutant NSCLC: Applying Evidence Across the Disease Continuum

Navigating Advances in Neovascular Retinal Disease: Translating Evidence to Practice in AMD, DME, and RVO

Enhancing Prostate Cancer Outcomes – The Role of PSMA and Targeted Treatment Strategies

(CME Track) Antibody–Drug Conjugates in Oncology: The Essentials of AE Management for Better Patient Outcomes

Community Practice Connections™: Optimizing SCLC Treatment Strategies and Managing Adverse Events Across Disease Stages

Personalized Approaches in NSCLC: Early Detection, Molecular Testing, and Targeted Therapies

9th Annual School of Nursing Oncology™

Community Practice Connections™: DLL3-Targeting Bispecific Antibodies for Small Cell Lung Cancer—From Innovation to Practice

Hot Seat: How Experts Are Integrating the Latest Practice-Changing Data Into Their Breast Cancer Clinics

Cases and Conversations™: Transforming Small Cell Lung Cancer Treatment Through Emerging Evidence and Expert Insights

Biomarker Testing in HER2+ GEA: Diagnosis and Treatment Implications

Navigating the Adverse Event Landscape in HER2+ GEA Therapy

Hot Seat: Converging Lines in the Management of RAS-Altered Cancers

(CME Track) Tackling Oncologic Emergencies in Patients Treated With High-Dose Methotrexate

Cases & Conversations™: Unmasking Epithelioid Sarcoma – Enhancing Early Diagnosis and Multidisciplinary Care

Expert Illustrations & Commentaries: Translating the Science of Bispecific Antibodies in Solid Tumors – From Mechanisms to Emerging Data

SimulatEd™: A Roadmap to Personalized Care Plans and Shared Decision-Making in Low-Grade Serous Ovarian Cancer

The Rise of Novel HER2-Targeting Therapies in GEA: Mechanisms and Clinical Data

Show Me the Data™: Personalizing First-Line and Maintenance Therapy in HER2+ Metastatic Breast Cancer to Extend Survival and Elevate Quality of Life

Medical Crossfire®: The Who, When, and How of TROP2-Targeting ADCs, ICIs, and PARP inhibition in Triple-Negative Breast Cancer

Optimizing Multidisciplinary Care in TGCT

Revolutionizing TGCT Care with Multidisciplinary Perspectives and Cutting-Edge Targeted Therapies

From Frontline to Heavily Pretreated HR+/HER2- Metastatic Breast Cancer: Expert Perspectives on Optimizing the Expanding Treatment Armamentarium

Beyond Primary End Points: Digging Into Randomized and Real-World Data to Guide Challenging Treatment Decisions in HR+/HER2− Metastatic Breast Cancer

Diagnosis and Management of TGCT

Trending on Chief Healthcare Executive

The government is going after healthcare fraud. What health systems should know.

Intermountain Health plans $1.15B deal to expand presence in Idaho

Cost of breaches: Healthcare still leads all sectors, and AI is used in more attacks

Changes: The hospital’s mission has gone beyond medicine

After some turbulence, CDC nominee nears Senate approval: Questions and answers