Input data preparation
Overview
DiNetxify requires two types of input:
A phenotype file describing the cohort
One or more medical-record files containing diagnosis codes and diagnosis dates
The package supports three study designs:
cohortmatched cohortexposed-only cohort
Before loading any data, create a DiseaseNetworkData object.
import DiNetxify as dnt
data = dnt.DiseaseNetworkData(
study_design="cohort",
phecode_level=1,
min_required_icd_codes=1,
date_fmt="%Y-%m-%d",
phecode_version="1.2",
)
Key initialization arguments:
study_design: One of'cohort','matched cohort', or'exposed-only cohort'.phecode_level: Use1or2.min_required_icd_codes: Minimum number of mapped ICD codes required for a phecode to count as valid.date_fmt: Date format used in the phenotype file. This is also the default for medical-record files unless you override it later.phecode_version: Version1.2is the recommended general-purpose option.
Phenotype data
Phenotype data must be provided as a CSV or TSV file with a header row and one row per participant.
Required columns
For cohort:
Participant IDExposureSexIndex dateEnd date
For matched cohort:
Participant IDExposureSexIndex dateEnd dateMatch ID
For exposed-only cohort:
Participant IDSexIndex dateEnd date
You may also provide any number of additional covariates, such as age, BMI, smoking, or education.
Input rules
Required columns cannot contain missing values.
Exposuremust be coded as1for exposed and0for unexposed when the study design includes an exposure group.Sexmust be coded as1for female and0for male.Dates must use a consistent format.
Covariate types are detected automatically and converted internally.
Continuous covariates with missing values lead to participant removal during loading.
Covariate names must not conflict with reserved internal variable names.
Load phenotype data
col_dict = {
"Participant ID": "ID",
"Exposure": "exposure",
"Sex": "sex",
"Index date": "date_start",
"End date": "date_end",
}
covariates = ["age", "BMI"]
data.phenotype_data(
phenotype_data_path="tests/data/dummy_phenotype.csv",
column_names=col_dict,
covariates=covariates,
is_single_sex=False,
force=False,
)
Notes:
For a matched cohort, include
"Match ID"incolumn_names.For an exposed-only cohort, omit
"Exposure"fromcolumn_names.If your cohort contains only one sex, set
is_single_sex=True.If you want to overwrite already loaded data, set
force=True.
Medical records data
Medical records must be provided in long format, with one diagnosis event per row.
Required columns
Each medical-record file must contain:
Participant IDDiagnosis codeDate of diagnosis
Supported diagnosis code systems
Each file must use exactly one supported coding system:
ICD-9-CMICD-9-WHOICD-10-CMICD-10-WHO
If you have multiple coding systems, load them as separate files by calling merge_medical_records() multiple times.
Input rules
Use the same participant IDs as in the phenotype file.
Do not mix ICD-9 and ICD-10 codes in one file.
Do not mix CM and WHO code systems in one file.
Do not pre-filter diagnoses to first occurrence only.
Do not pre-filter diagnoses to the follow-up period;
DiNetxifyhandles follow-up filtering internally.ICD-10 codes may be provided with or without decimal points.
ICD-9 codes may be provided in decimal or short format.
Load medical records
data.merge_medical_records(
medical_records_data_path="tests/data/dummy_EHR_ICD9.csv",
diagnosis_code="ICD-9-WHO",
column_names={
"Participant ID": "ID",
"Diagnosis code": "diag_icd9",
"Date of diagnosis": "dia_date",
},
date_fmt=None,
chunksize=1000000,
diagnosis_code_exclusion=[],
)
data.merge_medical_records(
medical_records_data_path="tests/data/dummy_EHR_ICD10.csv",
diagnosis_code="ICD-10-WHO",
column_names={
"Participant ID": "ID",
"Diagnosis code": "diag_icd10",
"Date of diagnosis": "dia_date",
},
)
Notes:
If
date_fmt=None, the medical-record file uses the same date format as the phenotype file.chunksizecontrols how many rows are processed at a time and is useful for large files.diagnosis_code_exclusionlets you exclude specific diagnosis codes before mapping.After each merge, the package updates diagnosis, diagnosis-count, and history information inside the
DiseaseNetworkDataobject.
Dummy dataset
A dummy dataset is included under tests/data so you can test the full workflow before using your own data.
The dataset contains:
dummy_phenotype.csv: 60,000 participants in a matched-cohort-style exampledummy_EHR_ICD9.csv: 10,188 ICD-9 diagnosis recordsdummy_EHR_ICD10.csv: 1,668,795 ICD-10 diagnosis records
Important columns in the dummy phenotype file:
ID: Participant identifiergroup_id: Matching group identifierexposure: Exposure statusdate_start: Follow-up start datedate_end: Follow-up end dateage: Baseline agesex: Biological sexBMI: BMI category
Important columns in the dummy medical-record files:
ID: Participant identifierdia_date: Diagnosis datediag_icd9ordiag_icd10: Diagnosis code
Note: The dummy data are simulated for demonstration only. They are useful for learning the workflow and checking that the software runs, but they should not be interpreted as real clinical findings.