Harmonising electronic health records for reproducible research: challenges, solutions and recommendations

The UK-wide Cardiovascular COVID-19 research collaboration known as the CVD-COVID-UK Consortium has published new research in BMC Medical Informatics and Decision Making, describing the implementation of an efficient, transparent, scalable, and reproducible health data harmonisation method.

This new method enables multi-nation collaborative research across the UK’s Trusted Research Environments (TRE’s), highly secure computing environments that provide remote access to health and administrative data for approved researchers that can be used in research to improve lives.

The CVD-COVID-UK consortium was established by the British Heart Foundation (BHF) Data Science Centre at Health Data Research UK (HDR UK). The consortium aims to understand the relationship between COVID-19 and cardiovascular disease through analyses of harmonised electronic health records (EHRs) across the four UK nations.

This research was led by the Population Data Science at Swansea University using anonymised data held within the SAIL Databank and NHS Digital TRE for England. The main linkable data sources included in this research are primary and secondary care data, critical and intensive care data, prescribing and dispensing records, COVID-19 testing and vaccination data, mortality records, maternity services and a range of other data sources.

Harmonisation of EHRs poses several technical challenges as healthcare data generally differ by underlying healthcare systems, type of information collected, drug/vaccine and medical event coding systems and language. In this research, the main challenges for harmonising EHRs are characterised as follows:

  • 1. Achieving consistent definition and derivation of analysis variables: Different healthcare systems use different coding systems and clinical terminologies (e.g. SNOMED and Read codes used in primary care data in the UK), resulting in many permutations and options for the code lists that are needed for research using data sources within a single TRE and more strongly across multiple TREs with more diverse data.

  • 2. Developing a reliable population denominator: Using EHRs to achieve a reliable population denominator with a consistent set of demographic characteristics is challenging due to potential conflicts around differences in multiple recordings such as date of birth, sex, and ethnic group, as well as the longitudinal nature of health records, leading to an accumulation of individuals exceeding the general population in number.

  • 3. Establishing transparency and effective communication of approaches: Closely documenting the processes used for data harmonisation and establishing effective communication across various members of the project as well as stakeholders could be challenging, but it is essential.

  • 4. IT infrastructure: Where harmonising EHRs from different TREs, the main aspects regarding the IT infrastructure to be considered are version control system, data storage platform, statistical analysis software, and the availability of performant hardware. Any differences among these aspects mean that divergences in how data preparation and analysis are implemented should be expected, and in fact, greater levels of programming expertise may be required.

  • 5. Disclosure control and pooling analyses: To combine analysis results from each TRE, researchers need to be aware of the disclosure control processes each TRE has in place, and how they differ. The main principle behind each disclosure process is to minimise the risk of publishing identifiable data. However, there will be fundamental differences in the restrictions over content, format, structure and granularity of the results.

Lead researcher Hoda Abbasizanjani tells us how they achieved this new harmonisation method:

“The harmonisation method was implemented as a four-layer process to achieve reproducibility, reusability and scalability. The first layer consists of raw data sources. Then each of the layers two to four is framed by, but not limited to, the characterised challenges. We curated data as part of our second layer, followed by extracting phenotyped data in the third layer. We captured any project-specific requirements in the fourth layer.”

“We have used best practices, recommendations and rules established within the SAIL Databank for Wales to ensure the effective organisation of files and folders and any data assets created and maintained for wider understanding for users who may be actively working on a proposal or wish to learn or reuse existing components of the resources.”

“Using the implemented four-layer harmonisation method, we retrieved approximately 100 health-related variables for the 3.2 million individuals in Wales, which are harmonised with corresponding data for > 56 million individuals in England. Harmonised variables were grouped into the following categories: demographic variables, ethnic group, socio-economic and geographical characteristics, disease phenotypes including COVID-19 related and CVD related phenotypes, biomarkers, lifestyle risk factors, comorbidity indexes, hospital interventions and procedures, and medications. We processed 13 data sources into the first layer of our harmonisation method: five of these are updated daily or weekly, and the rest at various frequencies providing sufficient data flow updates for frequent capturing of up-to-date demographic, administrative and clinical information.”

Associate Professor and Senior Author Ashley Akbari tells us:

“Sharing our knowledge, experience, code, method and best practices is key to delivering research in an ever more scaleable, fast-paced and responsive research environment. In this paper, we have summarised how through harmonisation and embedding these key practices and principles, reproducible research can be achieved across a community of researchers and projects, both in the delivery of COVID-19 research, and more widely can be embedded longer term to give opportunities and set reproducible foundations to all researchers within and between any trusted research environments. These principles are key ways of working at Population Data Science Swansea University, the BHF Data Science Centre, and our many collaborators and partners.”

Professor Angela Wood, Associate Director and Theme Lead for Structured Data at the BHF Data Science Centre tells us:

“This work is the result of some fantastic team science efforts as we all pull together across the nations of the UK to better our approaches to data science research. Improvements in harmonisation and the ways we work will ultimately make safe and secure data research more efficient for researchers working to improve the lives of patients and the public.”

Read the paper in full here on BMC website – https://doi.org/10.1186/s12911-022-02093-0