Registry Data usage (ESR6)

How do we link health registry data to environmental exposures?

WP1
python
data
registry
health
environment
exposure
Author

ESR6: Alejandro Fontal

Introduction

I will use this blog post as a way to showcase the basic usage of registry data and linkage to environmental data typically done as part of my work as a member of HELICAL’S Work Package 1, whose main objective is to help understand the relationship between environmental exposures and vasculitis onset.

I will try to display a simplified example of my usage of healthcare registries data. I make use of individual data just as a basis to aggregate and obtain incidence counts per spatial unit (zip-code, province, electoral district) and time-unit (daily, weekly, monthly) based on each patients’ residence and date of onset/diagnosis information.

To illustrate the linkage process I will generate an environmental and healthcare record toy dataset and perform the linkage as I usually would.

Show Python Imports
import numpy as np
import pandas as pd

Environmental dataset

In general, I fetch different datasets of publicly available or self-generated daily observations of several environmental variables:

  • Weather
  • Pollution
  • Biological air diversity
  • Chemical composition (via LIDAR or inplace sampling).

A toy example would be the following table, spanning only 5 days for two different regions, A and B:

Show Code
environment_df = pd.DataFrame({
    'Date': np.repeat(pd.date_range('2021-01-01', '2021-01-05'), 2),
    'Region': np.tile(['A', 'B'], 5),
    'Temperature': np.random.normal(20, 5, 10).round(2),
    'NO₂': np.random.normal(5, 1, 10).round(2),
    'Fungal Sp. 1': np.random.normal(1000, 100, 10).astype(int),
    'Bacterial Sp. 2': np.random.normal(750, 75, 10).astype(int)
})


# this is just for styling the table on the blog

(environment_df.style
 .hide(axis='index')
 .format({'Temperature': '{:.2f}',
          'NO₂': '{:.2f}',
          'Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover', style='text-align: center'")
)
Date Region Temperature NO₂ Fungal Sp. 1 Bacterial Sp. 2
2021-01-01 A 21.21 6.15 959 723
2021-01-01 B 25.24 5.42 981 705
2021-01-02 A 19.03 5.26 1046 790
2021-01-02 B 24.04 5.68 1168 752
2021-01-03 A 21.56 5.70 1061 755
2021-01-03 B 13.82 6.04 964 759
2021-01-04 A 22.23 3.55 924 649
2021-01-04 B 26.51 4.06 996 804
2021-01-05 A 12.87 6.35 862 626
2021-01-05 B 24.84 4.01 939 650

Healthcare records dataset

The minimal example of a healthcare records dataset that I use would contain, at the individual level, the patient’s residence region, and the (vasculitis) onset date recorded.

Show Code
healthcare_records = pd.DataFrame({
    'Patient ID': range(1, 16),
    'Region': np.random.choice(['A', 'B'], 15),
    'Onset Date': np.random.choice(pd.date_range('2021-01-01', '2021-01-05'), 15)
})

# this is just for styling the table on the blog

(healthcare_records
 .style
 .hide(axis='index')
 .format({'Onset Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover'")
)
Patient ID Region Onset Date
1 B 2021-01-03
2 B 2021-01-05
3 B 2021-01-03
4 A 2021-01-03
5 A 2021-01-02
6 B 2021-01-03
7 A 2021-01-02
8 A 2021-01-03
9 B 2021-01-03
10 A 2021-01-02
11 B 2021-01-02
12 A 2021-01-01
13 A 2021-01-02
14 B 2021-01-01
15 B 2021-01-01

I then go from individual level record to population level records aggregating by date and region, such that the data table I use looks like the following:

Show Code
daily_cases = (healthcare_records
               .groupby(['Onset Date', 'Region'])
               .size()
               .rename('Cases')
               .astype(int)
               .reset_index()
               .rename(columns={'Onset Date': 'Date'})
)

# this is just for styling the table on the blog

(daily_cases
 .style
 .hide(axis='index')
 .format({'Date': '{:%Y-%m-%d}'})
 .set_table_attributes("class='dataframe table-hover'")
)
Date Region Cases
2021-01-01 A 1
2021-01-01 B 2
2021-01-02 A 4
2021-01-02 B 1
2021-01-03 A 2
2021-01-03 B 4
2021-01-05 B 1

Linkage

The final linkage, which leads us to the table on which most of the analyses will be made, is based on merging both the environmental and epidemiological daily incidence counts in a single table based on the Date and Region columns, such that:

Show Code
(environment_df
 .merge(daily_cases, on=['Date', 'Region'], how='left')
 .fillna(0)
 .assign(Cases=lambda df: df.Cases.astype(int))
 .sort_values(['Region', 'Date'])
 .set_index(['Region', 'Date'])
)
Temperature NO₂ Fungal Sp. 1 Bacterial Sp. 2 Cases
Region Date
A 2021-01-01 15.31 5.12 981 758 1
2021-01-02 20.83 3.09 1086 782 4
2021-01-03 22.09 4.66 903 801 2
2021-01-04 25.52 4.47 823 716 0
2021-01-05 27.61 4.41 1059 763 0
B 2021-01-01 20.98 5.58 1008 692 2
2021-01-02 18.13 4.02 1206 779 1
2021-01-03 17.26 5.11 879 748 4
2021-01-04 21.32 5.40 1124 882 0
2021-01-05 18.66 5.42 897 660 1