Analyzing massive stores of medical data can be overwhelming. Still, it’s an important mission: data analysis could provide new, more tailored treatments. Terms like “personalized medicine,” “precision medicine,” and “individualized medicine” all refer to a data-driven approach toward to goal of customizing medical treatment for every patient’s unique genetic and molecular composition. However noble, that goal is somewhat limited.
In his 2015 State of Union address, President Barack Obama promised to dedicate $215 million in the 2016 budget to the Precision Medicine Initiative. This project, which Obama says has bipartisan support in Congress, will fund research aimed at understanding how to best develop personalized medical treatments.
Planned disbursements include $130 million for the National Institutes of Health (NIH) to form a research group of one million volunteers who will contribute their health data for analysis, $10 million to the U.S. Food and Drug Administration (FDA) to build the necessary databases, and $5 million to secure patient data. The budget also includes $70 million for a pilot project by the National Cancer Institute (NCI), aimed at developing precision treatments for cancer.
By Dr. Jans Aasman, CEO, Franz Inc.
Analyzing massive stores of medical data can be overwhelming. Still, it’s an important mission: data analysis could provide new, more tailored treatments. Terms like “personalized medicine,” “precision medicine,” and “individualized medicine” all refer to a data-driven approach toward to goal of customizing medical treatment for every patient’s unique genetic and molecular composition. However noble, that goal is somewhat limited.
Personalized medicine, often described as a way to provide “the right patient with the right drug at the right dose at the right time,” in fact goes beyond custom treatment – it encompasses the entire healthcare process, from prevention, to treatment, to disease management, and considers each patient as an individual.
In his 2015 State of Union address, President Barack Obama promised to dedicate $215 million in the 2016 budget to the Precision Medicine Initiative. This project, which Obama says has bipartisan support in Congress, will fund research aimed at understanding how to best develop personalized medical treatments.
Planned disbursements include $130 million for the National Institutes of Health (NIH) to form a research group of one million volunteers who will contribute their health data for analysis, $10 million to the U.S. Food and Drug Administration (FDA) to build the necessary databases, and $5 million to secure patient data. The budget also includes $70 million for a pilot project by the National Cancer Institute (NCI), aimed at developing precision treatments for cancer.
The President can expect a sizable return on his investment. The McKinsey Global Institute estimates data-driven decision-making could generate up to $100 billion in value annually for the U.S. healthcare system.
Personalized medicine is neither a new nor wishful theory. Molecular testing is a growing part of cancer patient care (and has been used in cancer care since the 1990s). Also, processes like the genetic profiling of tumors is already helping healthcare professionals develop treatment plans that boost the chances of an optimum outcome and reduce unpleasant side effects. Still, we’re not leveraging the full potential of personalized medicine, due in part to a lack of the proper tools – we need solutions that can handle huge sets of structured, semi-structured and unstructured data from a myriad of sources.
Graph Databases and Their Role in Personalized Medicine
Personalized medicine gives clinicians tools to better understand the complex mechanisms underlying a patient’s health, disease, or condition, and to better predict which treatments will be most effective. Typically, diagnostic testing is used to determine the best-suited therapies for a particular patient based on his genetic makeup, age, gender, family history, disease(s) to be treated, other health issues, and general physical condition.
Diagnostic testing generates vast quantities of data. In 2011, when medical data was still in the early stages of being digitized, the U.S. healthcare system alone was estimated to encompass 150 exabytes (150 billion gigabytes) of data. IDC predicts that U.S. healthcare data will grow to 2,314 exabytes by 2020. Being able to securely store, easily access, and analyze all this data in real-time is the key to taking full advantage of personalized medicine.
The traditional ways of collecting and consuming data for analysis don’t fully support the needs of personalized medicine research. One primary issue is that medical diagnostic data increasingly includes semi-structured and unstructured data – images, data from sensors and mobile devices, text messages, notes, verbal conversations, graphics, videos and other types of “freeform” data – that can’t be organized into the “rows and columns” format used by relational databases.
Additionally, given the structure that relational databases use to link associated data, searching for a specific pattern across numerous fields in a large database often results in performance problems.
NoSQL databases were developed to address the limitations of relational databases. NoSQL is highly scalable, works well with structured, semi-structured, and unstructured data, and can deliver high performance on commodity servers, making storage of vast volumes of data cost effective. This is a great solution for companies working with unstructured big data, but medical data tends to be too structured to take full advantage of NoSQL databases.
Graph databases provide the best of the SQL (relational) and NoSQL worlds for many medical use cases. Graph databases feature unique networks of relationships that link nodes (concepts, words, objects, conditions, systems, etc.), which can be labeled with metadata or domain role information. Relationships are defined semantically (adding the necessary “sharing” descriptive structure to unstructured and semi-structured data), and can have any number or type of associated relationships.
This architecture makes a graph database particularly well suited for exploring huge sets of highly connected data to find commonalities and anomalies, and also for quickly producing highly relevant results to queries. Like personalized medicine, graph databases aren’t new – they’re well-established in the data science industry for use in advanced analytics, and have recently gone into wider use as a way of working with data gleaned from social media. Given their flexibility in optimizing search and uniting data from disparate sources, graph databases are becoming the preferred solution for scientific and medical big data. The market has grown by 400 percent in the past two years, according to a recent DBMS ranking by DB Engines.
Graph Databases and the Semantic Data Lake
The data we need to power personalized medicine may exist in the data stores scattered across a single medical facility—more likely, it exists in isolation on multiple devices, data marts, and warehouses across numerous facilities. This creates data silos, which complicates the process of gaining access to the information, and of conducting a complete analysis.
Data silos are not cost-effective – they require a significant budget and vast human resources to maintain, secure, and manage, as explained in a video by Dr. Parsa Mirhaji, lead of a personalized medicine project at Montefiore Medical Center, speaking at the Intel Precision Medicine panel at HIMSS.
Data lakes were created to address the data silo problem. Rather than transforming data for analysis and then storing it in a specific data mart, the data is placed, in its original format, into a data lake. This reduces the cost of storage and the time spent on data prep and maintenance, and improves self-service accessibility for analytics.
A properly built data lake provides both accessibility and data protection, as defined by the data governance rules of the organization. Using semantic analytics tools, healthcare providers can quickly and securely analyze electronic medical records (EMRs) along with semi and unstructured data, along with distributed data (Linked Open Data). Robust healthcare-specific solutions standardize terminology and “languages” commonly used in the medical industry, allowing users to include broader datasets in queries. The final result should be a vast collection of linked data from both external and internal stores that can be transparently accessed for analysis in real-time.
The rule of thumb in analytics is that bigger datasets provide more reliable results to queries, unless the query in question is hyper-focused. Linking common datasets will be crucial in the development of personalized medicine—we need to identify similarities and differences across the broadest possible data pool in order to understand what works best within a specific set of genetic constraints.
These analytics tools, called semantic graph databases, use a W3C standard called RDF (or Resource Description Framework) to name and link different nodes. Developers working with semantic graph databases adhere to industry-standard vocabularies and taxonomies to facilitate the linking process. In other words: as long as you use the same terminology to describe the same concepts and relationships between concepts, data integration isn’t an issue. There are literally thousands of data repositories that can be directly read by semantic graph databases, called Linked Open Data. A recent scientific article titled KaBOB: Ontology-based semantic integration of biomedical database, describes how to use these repositories in healthcare and life sciences by showing how to use a semantic graph database to link drugs, diseases, side effects, clinical trials, genes, cellular pathways on a single graph.
Semantic Data Analysis
Semantic graph analytics tools enable the fast integration and analysis of large, complex datasets, the unique integration and processing of complex, heterogeneous clinical and biomedical data for predictive modeling and machine learning. W3C recognizes SPARQL, a SQL-similar language, as the standard language for a semantic graph database.
The connective nature of semantic graph databases enables the trustworthy predictive and prescriptive analysis necessary for personalized medicine. Basic analytics focuses on reporting—confirming what we already know or providing an answer to a specific question. Advanced analytics is based in discovery, shaping the exploration process and revealing questions (and answers) as we progress through a query. This process requires the ability to add new data sources instantly, and to explore new relationships between data—both are provided by the semantic graph database. It is a dynamic solution that will undoubtedly change the way we treat patients, and propel the future of personalized medicine.
About the Author
Dr. Jans Aasman, Ph.d. is the CEO of Franz Inc. Dr. Aasman’s previous experience and educational background includes: KPN Research, the research lab of the major Dutch telecommunication company; tenured professor in Industrial Design at the Technical University of Delft; Carnegie Mellon University –visiting Scientist at the Computer Science Department of Prof. Dr. Allan Newell; researcher at the Traffic Research Center of the University of Groningen (The Netherlands); experimental and cognitive psychology at the University of Groningen, specializing in psychophysiology and cognitive psychology.