Topic 1: Introduction to Health Data Science

Overview

What is data science? What kinds of data are there in the healthcare sector and what can one do with them? The purpose of this topic is to introduce you to key concepts in health data science. You will also be introduced to the course team and learn how the course works.

Health data: what, why, where and how

What is health data

Health data refers to any kind of data related to health conditions, reproductive outcomes, causes of death and quality of life for an individual or population. This includes biometric information, prescription data, diagnoses and screening test results, as well as data about hospital visits or sleeping and exercise patterns. There is a shift lately in thinking more broadly about health, and in addition to clinical data, this also includes environmental (e.g. pollution levels), socioeconomic (e.g. income level) and lifestyle information. In fact, it is not uncommon nowadays to consider things that are not traditionally associated with health, such as social media posts (e.g. researchers have used data science methods to detect signs of depression in tweets). In this course, we will adopt such an open-minded view of health data.

Sources of health data

There are lots of different sources of health data. Clinical data is produced in hospitals and it is stored in hospital information systems, picture archiving and communication systems, laboratory databases, etc. Clinical data is also produced in primary care (e.g. GP practices in the UK) and stored in local systems.

Biomedical research data, such as clinical trial data, is produced by academia and pharmaceutical companies and it is stored in databases and libraries.

Health business data (e.g. around accounting, billing, management, etc.) is produced by healthcare providers, insurance companies and other organisations, and is typically stored in local information and transaction systems.

Finally, there is a growing volume of patient-generated data. This includes data produced by wearable devices (e.g. activity trackers that capture data including heart rate, steps and calories) and web-based apps and platforms (e.g. self-management platforms for conditions like diabetes or cancer), among others. This data is typically stored on the cloud or in local databases of the corresponding companies. One of the challenges in health data science is that data currently sits in silos. This means that data from different sources is not integrated. In fact, even data within a single healthcare provider is often fragmented and not joined up.

Health data levels

Health data can be captured at different levels, from the microscopic world (e.g. genomics, epigenomics, metagenomics, proteomics, metabolomics) to individuals (e.g. data captured in electronic patient records) and populations (e.g. data about disease spreading across different communities). The data published on the WHO Coronavirus Disease (COVID-19) Dashboard is an example of population-level data.

A large volume of data is produced at all these levels. For example, we estimate many terabytes of genomic data for each individual. Considering the size of the world population, this is really “big data”. Data generated by sensors in wearable devices is also very big, as they continuously generate data. Medical imaging data (e.g. X-rays) are also very heavy, making it challenging to store and computationally expensive to analyse.

Data forms

Health data comes in lots of different forms, from highly structured to unstructured data. Relational databases e.g. in hospital information systems provide a clear structure to data captured. You can think of a relational database as a collection of tables with rows and columns and links between them (e.g. you could have a Patient table containing demographic information about thousands of patients, a Hospital table with key information about different hospitals in Scotland, and an Admissions table with information about hospital admissions while pointing to the other two tables).

Data captured in spreadsheets also has some structure, but very often there is redundancy or duplication of information captured.

Even less structure is present in clinical notes, which is natural language data, or in medical images. The lack of clear structure in the way this data is captured makes it harder to analyse and make sense of in an automated fashion. It is also very important to note that the choice of a data representation paradigm, and its underlying structure or absence of, can play a key role in data integration and linkage.

Benefits of analysing data

And why is it important to analyse health data? Learning more from data can contribute to important discoveries that allow us to better understand disease and improve the way we deliver care. This includes improved diagnostics, better decision-making and more effective predictions, among others.

Data Science & Data Analytics

Watch the following 2 videos to find out what data science is, and what people mean when they use the term “data analytics”.

Precision Medicine and Stratified Healthcare

Precision medicine is an emerging field and it is transforming the medical approach to disease treatment and prevention. It focuses on identifying the most effective strategy for each patient, based on genetic, environmental, and lifestyle factors.

The ability to characterise individuals much more precisely then allows us to identify key differences across human populations and to act accordingly in healthcare provision – this is stratified healthcare. So, one can consider stratified healthcare and precision medicine to go hand-in-hand.

Health data science is a key enabler to the development of precision medicine and stratified healthcare. By bringing different types of data together for each patient, we can take a more personalised approach to therapies, tailoring them to suit each individual. This is a rapidly changing environment, and a very exciting one!