Data harmonization
Data harmonization in DiNetxify means loading phenotype data and medical-record data into a single DiseaseNetworkData object. During this process, diagnosis codes are mapped to phecodes, dates are standardized, follow-up windows are applied, and phenotype covariates are converted into analysis-ready variables.
Initializing the data object
First, import DiNetxify and create a DiseaseNetworkData object with your study design and phecode settings.
import DiNetxify as dnt
# Matched cohort
data = dnt.DiseaseNetworkData(
study_design="matched cohort",
phecode_level=1,
)
# Standard cohort
data = dnt.DiseaseNetworkData(
study_design="cohort",
phecode_level=1,
)
# Exposed-only cohort
data = dnt.DiseaseNetworkData(
study_design="exposed-only cohort",
phecode_level=1,
)
study_design: One of
'cohort','matched cohort', or'exposed-only cohort'. Default is'cohort'.phecode_level: Use
1or2. Level 1 is broader and usually more stable for smaller datasets; level 2 is more granular and often more suitable for larger cohorts.
Optional parameters:
min_required_icd_codes: Minimum number of ICD records mapping to the same phecode before that phecode is counted as present for an individual. Default is
1.date_fmt: Date format used in the phenotype file. This is also the default format for medical-record files unless overridden later. Default is
'%Y-%m-%d'.phecode_version: Phecode mapping version.
'1.2'is the recommended general-purpose choice.'1.3a'is also available for special use cases.
Load phenotype data
After creating the data object, use phenotype_data() to load the cohort phenotype data.
# Matched cohort
col_dict = {
"Participant ID": "ID",
"Exposure": "exposure",
"Sex": "sex",
"Index date": "date_start",
"End date": "date_end",
"Match ID": "group_id",
}
covariates = ["age", "BMI"]
data.phenotype_data(
phenotype_data_path="tests/data/dummy_phenotype.csv",
column_names=col_dict,
covariates=covariates,
)
# Standard cohort
col_dict = {
"Participant ID": "ID",
"Exposure": "exposure",
"Sex": "sex",
"Index date": "date_start",
"End date": "date_end",
}
data.phenotype_data(
phenotype_data_path="tests/data/dummy_phenotype.csv",
column_names=col_dict,
covariates=["age", "BMI"],
)
# Exposed-only cohort
col_dict = {
"Participant ID": "ID",
"Sex": "sex",
"Index date": "date_start",
"End date": "date_end",
}
data.phenotype_data(
phenotype_data_path="tests/data/dummy_phenotype.csv",
column_names=col_dict,
covariates=["age", "BMI"],
)
phenotype_data_path: Path to a CSV or TSV phenotype file.
column_names: Dictionary mapping required DiNetxify field names to the corresponding column names in your dataset.
covariates: List of additional phenotype variables to load.
Optional parameters:
is_single_sex: Set to
Trueif the cohort contains only one sex. Default isFalse.force: If
True, overwrite phenotype and medical-record information already stored in the object. Default isFalse.
Important phenotype input rules:
For
cohortandmatched cohort,Exposuremust be coded as1for exposed and0for unexposed.Sexmust be coded as1for female and0for male.Required columns cannot contain missing values.
Continuous covariates with missing values will remove those participants during loading.
Categorical covariates with missing values are retained and treated as an
"NA"category.Column names used for covariates must not conflict with DiNetxify reserved variables.
After loading phenotype data:
You can inspect the cohort summary by printing the object:
print(data)
You can also generate a phenotype summary table with:
table1_df = data.Table1()
print(table1_df.head())
Table1() summarizes sex, covariates included and follow-up time. For cohort and matched cohort, the table compares exposed and unexposed groups; for exposed-only cohort, it produces a single-group summary.
Load medical record data
After the phenotype data are loaded, merge one or more diagnosis files with merge_medical_records(). Call the method once per file if your records use different ICD systems.
data.merge_medical_records(
medical_records_data_path="tests/data/dummy_EHR_ICD10.csv",
diagnosis_code="ICD-10-WHO",
column_names={
"Participant ID": "ID",
"Diagnosis code": "diag_icd10",
"Date of diagnosis": "dia_date",
},
)
data.merge_medical_records(
medical_records_data_path="tests/data/dummy_EHR_ICD9.csv",
diagnosis_code="ICD-9-WHO",
column_names={
"Participant ID": "ID",
"Diagnosis code": "diag_icd9",
"Date of diagnosis": "dia_date",
},
diagnosis_code_exclusion=[],
)
medical_records_data_path: Path to a CSV or TSV diagnosis file.
diagnosis_code: One ICD system per file. Supported values are
'ICD-9-CM','ICD-9-WHO','ICD-10-CM', and'ICD-10-WHO'.column_names: Dictionary mapping
'Participant ID','Diagnosis code', and'Date of diagnosis'to the actual columns in your file.
Optional parameters:
date_fmt: Date format for the diagnosis file. If
None, DiNetxify uses thedate_fmtdefined in theDiseaseNetworkDataobject.chunksize: Number of rows processed per chunk. Default is
1000000.diagnosis_code_exclusion: List of diagnosis codes to exclude before phecode mapping.
Important medical-record input rules:
Use the same participant IDs as the phenotype file.
Do not mix ICD-9 and ICD-10 codes in one file.
Do not mix CM and WHO coding systems in one file.
Do not restrict records to first occurrences only.
Do not pre-filter records to the follow-up period; DiNetxify handles follow-up filtering internally.
During loading, the package reports:
how many rows were read
how many rows were excluded because of missing values or ID/code filtering
how many diagnosis codes mapped directly or after truncation
how many codes were not mapped to any phecode
how many mapped records were invalid because the diagnosis date fell outside the usable follow-up window
After loading medical record data:
Print the object again to see the merged-data summary:
print(data)
This summary includes the number of merged files, the total number of processed diagnosis records, mean numbers of recorded phecodes during and before follow-up, and any mapping or follow-up warnings collected during harmonization.
Save DiseaseNetworkData object
Once phenotype data and medical records have been harmonized, you can save the DiseaseNetworkData object for later reuse.
# Save as gzip-compressed pickle
data.save("results/cohort_data")
# Save as compressed NumPy archive
data.save_npz("results/cohort_data")
You do not need to add the extension yourself. save() appends .pkl.gz, and save_npz() writes a .npz file.
Reload DiseaseNetworkData object
To reload a saved object, create a new DiseaseNetworkData instance and call load() or load_npz().
import DiNetxify as dnt
data = dnt.DiseaseNetworkData()
# Load from gzip-compressed pickle
data.load("results/cohort_data")
# Or load from compressed NumPy archive
data.load_npz("results/cohort_data")
Use force=True only if you are loading into an object that already contains data and you want to overwrite it.