For Better or Worse, Big Data's Here to Stay

Lorne Baskin, PharmaD

Although it is true that data streams are now quite voluminous and opportunities abound, there are plenty of caveats.

“Big data” has been described by various futurists and health industry pundits as the source of the answers to all problems in health care. The promise of vast amounts of data about patients, doctors, and hospitals that can be sorted and summarized numerous ways has led some stakeholders to conclude that the reasons for any problem can be determined and the solution predicted with great certainty. The reality: although it is true that the data streams are now quite voluminous and opportunities abound, there are plenty of caveats to the notion of elegant and accurate data sorting and analyzing, as well as the prediction of outcomes.

Big data describes any large data set that has the potential to be mined for information. It is high volume, has a wide variety, can be quickly generated and aggregated, and is often (regrettably) incompatible with different databases. Big data allows for faster identification of high-risk patients, more effective interventions, and closer monitoring. It has earned the label of “big” because it comes from so many more sources than in the past. Everything everyone does can now be stored capable of being stored in, and potentially recalled, from a computer, cell phone, tablet, etc. Every clinical outcome or lab, charge, cost, and provider identity can be both stored and combined with another database for any patient, disease, or doctor. Detailed reports can be prepared for any category and drilled down to any population subset within seconds rather than days.

Big data is data now coming from many diverse corners of the health care system: research from drug manufacturers, digitized patient records, clinical trial information, and claims databases from public payers such as Medicare and Medicaid. In addition, an individual patient’s clinical data now come from a variety of sources, as well: payers, hospitals, outpatient clinics, doctors’ offices, and the patient themselves. Electronic medical records (EMRs) have become a major source of data thanks to federal incentives. With EMRs, every lab, drug, intervention, order sheet, physician order, progress note, and (potential) clinical outcome is available for aggregation by population and identification for future trends.

Big Data Takeaways

  • Big data streams are high volume and quickly generated and aggregated.
  • Big data allows for faster identification of high-risk and high-cost patients.
  • Self-reported data by patients are particularly powerful in predictive terms and useful for patient satisfaction, quality of life, and correlation with clinical data; they come via cellphone, social media, or online surveys.
  • Big data enables searching of data for relationships and trends between outcomes, costs, providers, hospitals, and certain disease states that be used to predict future behavior and performance or identify areas for improvement.

Big Data Caveats

  • Big data streams often reside in separate databases that may be incompatible with other databases.
  • Missing, unverified, or incomplete data can limit usefulness. Different databases may lack standardization in definitions and of terminology.
  • Big data doesn’t equal big evidence; well-designed research to build a case is still needed. Big data correlations do not necessarily establish cause and effect, and can result in ridiculous conclusions. Potential for serious sampling errors exist with retrospective analysis of big data as opposed to more rigorous (but also more expensive and time-consuming) clinical trials.
  • Data mining (also known as data dredging) may enable business intelligence, but may be problematic when attempting to establish relationships between costs, outcomes, and providers. It is easy to get “accidental,” and incorrect conclusions, from this approach.

Examples of Contemporary Uses of Big Data

  • Prediction of patient behavior (adherence, emergency department utilization, and other outcomes and behaviors of interest).
  • Establish cost-effectiveness and use patterns among competing hospitals, drugs, and providers by comparing costs and clinical outcomes with providers and facilities, but by grouping the patients by things they have in common (eg, location, disease state, age, gender, etc).
  • Develop recommendations for clinical pathways, clinical guidelines, protocols, and formularies for better outcomes based on past experience.
  • Enable the targeting of patient groups and focused interventions for patients who are the most expensive in a system (eg, high-cost patients, readmissions, patients whose condition worsens, adverse events, and patients with complicated, multiorgan diseases).
  • Determine which drugs are associated with high rates of adverse events.
  • Monitor patients and providers for compliance with treatment guidelines, and educate or penalize those who fail to comply (or reward the compliant ones).
  • Pharmacoeconomic analysis to determine which drug, device, or service is the most cost-effective.

Big Data vs Big Evidence: What’s the Difference?

Doing research is just like cooking food your guests will enjoy and want more of: you need the right ingredients (data) and a method of preparing and combining them (research and statistics) to create the end product (evidence). Until all the ingredients come together properly, you do not have something worthy of presentation. According to the International Society for Pharmacoeconomics and Outcomes Research Task Force Report on Real World Data, “Evidence is generated according to a research plan and interpreted accordingly, whereas data is but one component of the research plan. Evidence is shaped, while data simply are raw materials and alone are noninformative.”1

Much has been written in recent years suggesting that medicine decision making should be evidence-based. Evidence-based medicine (EBM) has been defined as “the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.”2 Although EBM requires data, it also requires a rigor for both collection and analysis of the data so that the conclusions are not due to bias, statistical sampling error, or errors in data analysis. Simply copying data to a spreadsheet and looking at the average values may be quick, but it may also be inaccurate. No matter how much data exist, researchers still need to ask the right questions to create a hypothesis, design a test, and use the data to determine whether their hypothesis is true.

Self-Reported Data

Big data includes data from patients or unverified sources: patient registries, social media, and government sites that allow users and providers to enter data directly. These data can be aggregated and sorted anonymously or, with patient permission, be tied directly to objective clinical data, charges, cost of care, their disease states, and the medications they are taking. They can be used to measure a patient’s quality of life; their experience with physicians, hospitals, or other providers; or even home monitoring. Using GPS-enabled devices and smartphone apps, it is possible to directly report heart rate, blood pressure, arrhythmias, medication use or refill information, and blood glucose levels.

Data Mining and Correlations

A trend seen among less-experienced database users is the improper use of data mining. Individuals may search and re-sort the database until they find something that looks significant, even if it seems illogical. There are 3 problems with this approach:

  • As you data mine, you tend to shrink the size of the sample because fewer and fewer people have all the characteristics you add. If you search long enough, you can find results that may not be statistically significant.
  • It is easy to bias the data mining by first looking at which drugs had desirable outcomes and then choosing patients to match.
  • It is possible to generate spurious correlations, or things that correlate with each other but in fact have no relationship with each other. There is an entire website ( devoted to such correlations between unrelated findings, such as the number of films a particular actor appeared in compared with the number of drowning deaths during a given time period.

The biggest risk of error from these correlation conclusions is inferring cause and effect. When 2 things occur together (which is all that correlation confirms), the researcher has the chance to show bias by declaring which happened first, naming which is the cause and which is the effect.

Predictive Modelling and Confounders

Big data by itself has limited value. The usefulness lies in the ability of pharmacy stakeholders to determine the trends and relationships between data points for any single population member. There are always “hidden variables,” or confounders, that may not be seen in the data but could serve to be important predictors of the outcome. These confounders include other concurrent therapy, severity of an illness, standard of care, concurrent diseases, and a patient’s genetic makeup.

A field of science called “predictive analytics” is used to predict how a situation will play out in the future based on results from the past. A prediction can be used to treat an entire population similar to the one studied or can be used to tailor treatments for individual patients based on determining the provider, hospitals, or medications most likely to achieve a given outcome. In essence, big data serves to substitute the experience of thousands of similar patients who had a variety of outcomes for the clinical judgment of the treating physician.

Can Big Data Be Used to Tell Us Which Care Is Cost Effective?

To choose a cost-effective intervention (whether medications, clinical services, or devices), provider, or facility for a patient or group of patients, providers need to know the cost from the perspective of the user. Costs differ from the provider and payer perspectives, and a hospital's costs and its charges are not the same. We also need to know how effective an intervention is in achieving the primary clinical outcome, whatever that may be: a quicker cure, a longer life, a disability prevented, a successful surgery. Costs and efficacy may then be compared, and the following rules developed by the author of this article may be used to determine the most cost-effective treatment:

  • If 2 drugs have the same cost, choose the more effective drug.
  • If 2 drugs have equal efficacy, choose the less expensive drug.
  • If 1 drug costs less and is more effective, choose it because it is dominant (a no-brainer).
  • If 1 drug costs more and is more effective, the more expensive drug is considered cost effective if the extra benefits are worth the extra cost (ie, it has greater value).

Potential for errors in cost-effectiveness research include:

  • Mixing perspectives.
  • Failure to capture costs outside the area served by the database, such as the cost of care of the home-care provider or a physician paid directly by the patient rather than the insurer.
  • Insufficient clinical data or faulty assumptions for missing data.
  • Cost figures that are an ill-defined mix of direct and indirect costs, and fixed and variable costs.


Big data seems to offer some real potential to improve the quality of care and related outcomes by trying to determine which procedures and providers offer cost-effective treatment. Database users need to know the information they will be accessing is complete and accurate. Big data needs to provide enough detail so that when users need to “drill down” to a specific treatment, patient category, or provider, the data can be accessed and summarized. Providers will, assuming that all the relevant databases can be tied together, have more information from multiples places where care has been provided: pharmacists, labs, physicians, hospitals, nursing homes, emergency departments, and outpatient surgery centers.

Because privacy and security will be concerns since significant database breaches are reported weekly among large companies, it will be important to ensure data do not fall into the wrong hands. Consumers will always be concerned that some of this data may fall into the hands of an employer, insurance company, or even an ex-spouse and be used in a prejudicial manner. How to collect and sort this data while keeping it away from the “wrong” people and getting to the “right” people will be a challenge for years to come.

The potential reward of big data is tremendous, but it coexists with the possibility of serious problems resulting from its misuse. Unverified data from patients; databases that cannot communicate and share information; the risk of missing, incomplete, or inaccurate information; and a lack of rigor in research, including false conclusions of cause and effect based on incorrect association or correlation, are all issues that must be addressed.

Lorne Basskin, PharmD, is a consultant on outcomes research, formulary decision making and pharmacoeconomics, and teaches in the School of Public Health at Brown University.