# Data harmonization Data harmonization involves loading and merging the **phenotype data** and **medical record data** into a single `DiseaseNetworkData` object for analysis. During this process, the software ensures consistent coding (e.g., converting diagnosis codes to phecodes) and standardized formatting (e.g., datetime parsing for diagnosis and follow-up periods). ## Initializing the data object First, import the ***DiNetxify*** package and instantiate a `DiseaseNetworkData` object with your chosen study design, phecode level, and any optional parameters. For example: ```python import DiNetxify as dnt # For a standard cohort study data = dnt.DiseaseNetworkData( study_design='cohort', phecode_level=1, ) # For a matched cohort study data = dnt.DiseaseNetworkData( study_design='matched cohort', phecode_level=1, ) # For an exposed-only cohort study data = dnt.DiseaseNetworkData( study_design='exposed-only cohort', phecode_level=1, ) ``` - **study_design** – Type of study design. Options: `'cohort'`, `'matched cohort'`, or `'exposed-only cohort'`. *(Default: 'cohort')*. - **phecode_level** – Level of phecode to use for grouping diagnoses. Level 1 provides broader categories (~585 conditions) while level 2 offers more detailed categories (~1257 conditions). For smaller datasets, level 1 is recommended to maintain statistical power; for larger datasets, level 2 can provide finer granularity. *(Options: 1 or 2; Default: 1)*. **Optional parameters:** - **min_required_icd_codes** – Minimum number of ICD diagnosis records mapping to the same phecode for that phecode to be considered “present” in an individual. For example, `min_required_icd_codes=2` means a single occurrence of a code isn’t enough to count the person as having that phecode; at least two occurrences are required. Ensure your medical record are comprehensive (not limited to first occurrences) if using this parameter. *(Default: 1)*. - **date_fmt** – Date format of the **Index date** and **End date** columns in your phenotype data. *(Default: '%Y-%m-%d', i.e. YYYY-MM-DD)*. - **phecode_version** – Phecode version for mapping diagnosis codes. Currently, version `'1.2'` is the recommended official version (with mapping files for ICD-9-CM/WHO and ICD-10-CM/WHO). An unofficial `'1.3a'` is available in the package for special use cases but is **not** recommended for general use. *(Default: '1.2')*. ## Load phenotype data After initializing the `DiseaseNetworkData` object, use the `phenotype_data()` method to load your phenotype data file. You need to provide the file path, a dictionary mapping the required column names to your file’s column headers, and a list of any additional covariate column names. Below are examples demonstrating how to load the dummy phenotype dataset under different study designs. The dummy file is structured for a matched cohort study, but it can be adapted for other designs by dropping certain columns when mapping: if you omit the **Match ID** column in the mapping, the data will be treated as a standard cohort (ignoring the matching groups); if you omit both **Match ID** and **Exposure**, it will be treated as an exposed-only cohort (all individuals considered “exposed”). ```python # Load phenotype data for a matched cohort study col_dict = { 'Participant ID': 'ID', 'Exposure': 'exposure', 'Sex': 'sex', 'Index date': 'date_start', 'End date': 'date_end', 'Match ID': 'group_id' } covariates_list = ['age', 'BMI'] data.phenotype_data( phenotype_data_path=r"/test/data/dummy_phenotype.csv", column_names=col_dict, covariates=covariates_list ) # Load phenotype data for a standard cohort study (no matching) col_dict = { 'Participant ID': 'ID', 'Exposure': 'exposure', 'Sex': 'sex', 'Index date': 'date_start', 'End date': 'date_end' } covariates_list = ['age', 'BMI'] data.phenotype_data( phenotype_data_path=r"/test/data/dummy_phenotype.csv", column_names=col_dict, covariates=covariates_list ) # Load phenotype data for an exposed-only cohort study (only exposed group, no comparator) col_dict = { 'Participant ID': 'ID', 'Sex': 'sex', 'Index date': 'date_start', 'End date': 'date_end' } covariates_list = ['age', 'BMI'] data.phenotype_data( phenotype_data_path=r"/test/data/dummy_phenotype.csv", column_names=col_dict, covariates=covariates_list ) ``` - **phenotype_data_path** – Path to your phenotype data file (CSV or TSV). - **column_names** – Dictionary mapping the required column names (`'Participant ID'`, `'Index date'`, `'End date'`, `'Sex'`, and depending on design `'Exposure'` and `'Match ID'`) to the corresponding column headers in your file. Include `'Exposure'` for cohort and matched cohort designs, and include `'Match ID'` only for matched cohorts. - **covariates** – List of additional covariate column names to load (if any). Use an empty list `[]` if there are none. The function will automatically detect each covariate’s type and process it appropriately (e.g., encode categorical variables). For continuous covariates, any rows with missing values will be dropped; for categorical covariates, missing values will be categorized as "NA". **Optional parameters:** - **is_single_sex** – If your cohort contains only one sex (all male or all female), set this to `True` so the software knows to treat the Sex column accordingly. *(Default: False)*. - **force** – If `False`, the method will raise an error if phenotype data has already been loaded into this `DiseaseNetworkData` object (to prevent accidental overwrite). Setting `force=True` will overwrite any existing data in the object with the new data. *(Default: False)*. **After loading phenotype data:** Once the phenotype data is loaded, you can inspect the basic characteristics by printing the `data` object: ```python print(data) # Example output (for a matched cohort study): """ DiNetxify.DiseaseNetworkData Study design: matched cohort Phenotype data Total number of individuals: 60,000 (10,000 exposed and 50,000 unexposed) The average group size is: 6.00 Average follow-up years: 10.44 (exposed) and 10.46 (unexposed) Warning: 102 exposed individuals and 440 unexposed individuals have negative or zero follow-up time. Consider removing them before merge. """ ``` The printed summary confirms the number of individuals, breakdown by exposure, average matching group size (for matched cohorts), and average follow-up times. Warnings are provided if any participants have non-positive follow-up lengths, which you may want to address (e.g., by removing those individuals) before proceeding. Additionally, you can generate a basic descriptive table (Table 1) of the phenotype data using the `Table1()` method. This returns a pandas DataFrame summarizing each variable (e.g., medians/IQRs for continuous variables, counts/percentages for categorical variables) and performing simple statistical comparisons between exposed and unexposed groups: ```python table1_df = data.Table1() print(table1_df) # Example (truncated) output: """ Variable exposure=1 (n=10,000) exposure=0 (n=50,000) Test and p-value 0 age (median, IQR) 57.08 (48.91–65.32) 57.05 (48.87–65.35) Mann-Whitney U p=0.9824 1 follow_up (median, IQR) 9.18 (5.77–13.70) 9.22 (5.80–13.75) Mann-Whitney U p=0.6806 2 sex (n, %) 3 sex=Female 5,045 (50.45%) 25,225 (50.45%) … ... """ ``` This Table 1 gives a quick overview of how the exposed and unexposed groups compare on key variables. You can save this `DataFrame` to a CSV/TSV/Excel file using pandas if needed. ## Load medical record data After loading the phenotype data, use the `merge_medical_records()` method to load and merge each medical record file. You will call this method for each separate file (e.g., one for ICD-10 and one for ICD-9 in our dummy data). Provide the file path, specify the ICD coding standard used in that file, and a dictionary mapping required columns. The following example code shows how to load the dummy EHR ICD-10 and ICD-9 files: ```python # Merge the first medical record file (dummy_EHR_ICD10.csv) data.merge_medical_records( medical_records_data_path=r"/test/data/dummy_EHR_ICD10.csv", diagnosis_code='ICD-10-WHO', column_names={ 'Participant ID': 'ID', 'Diagnosis code': 'diag_icd10', 'Date of diagnosis': 'dia_date' } ) # Merge the second medical record file (dummy_EHR_ICD9.csv) data.merge_medical_records( medical_records_data_path=r"/test/data/dummy_EHR_ICD9.csv", diagnosis_code="ICD-9-WHO", column_names={ 'Participant ID': 'ID', 'Diagnosis code': 'diag_icd9', 'Date of diagnosis': 'dia_date' } ) ``` - **medical_records_data_path** – Path to a medical record data file (CSV or TSV). - **diagnosis_code** – The diagnosis coding system used in that file. Options include `'ICD-9-CM'`, `'ICD-9-WHO'`, `'ICD-10-CM'`, `'ICD-10-WHO'` (case-sensitive). - **column_names** – Dictionary mapping the required column names (`'Participant ID'`, `'Diagnosis code'`, `'Date of diagnosis'`) to your file’s column headers. **Optional parameters:** - **date_fmt** – Date format of the **Date of diagnosis** column in this file. If not provided, it defaults to the same format used for phenotype dates (`date_fmt` specified in the `DiseaseNetworkData` initialization). - **chunksize** – If the file is very large, you can specify a number of rows to read per chunk (the function will stream through the file in chunks to manage memory usage). *(Default: 1,000,000 rows per chunk.)* **During data loading:** As each medical record file is processed, ***DiNetxify*** will output progress messages and basic stats. For example: ```python """ 1,000,000 records read (1,000,000 included after filtering on participant ID), 0 records with missing values excluded. 1,668,795 records read (1,668,795 included after filtering on participant ID), 0 records with missing values excluded. Total: 1,668,795 diagnosis records processed, 0 records with missing values were excluded. 1,286,386 diagnosis records mapped to phecode without truncating. 0 diagnosis records mapped to phecode after truncating to 4 digits. 72,073 diagnosis records mapped to phecode after truncating to 3 digits. 302,908 diagnosis records not mapped to any phecode. Phecode diagnosis records successfully merged (18,486 invalid records were not merged, typically due to diagnosis date beyond follow-up end). 1 medical record file already merged, merging with a new one. 10,188 records read (10,188 included after filtering on participant ID), 0 records with missing values excluded. Total: 10,188 diagnosis records processed, 0 records with missing values were excluded. 9,711 diagnosis records mapped to phecode without truncating. 0 diagnosis records mapped to phecode after truncating to 4 digits. 266 diagnosis records mapped to phecode after truncating to 3 digits. 211 diagnosis records not mapped to any phecode. Phecode diagnosis records successfully merged (0 invalid records were not merged). """ ``` From these logs, you can see how many records were read and included, how many were excluded (e.g., missing values or out-of-follow-up-range dates), and how many diagnosis codes were successfully mapped to phecodes versus not mapped. The logs also indicate when multiple files are being merged sequentially. **After loading medical record data:** After merging all medical record files, you can print the `data` object again to see a summary of the combined dataset: ```python print(data) # Example output (matched cohort study): """ Merged Medical records 2 medical record files with 1,678,983 diagnosis records were merged (0 with missing values). Average number of disease diagnoses during follow-up: 18.99 (exposed) and 7.31 (unexposed) Average number of disease diagnoses before follow-up: 8.40 (exposed) and 3.46 (unexposed) Warning: 102 exposed individuals and 440 unexposed individuals have negative or zero follow-up time. Consider removing them before merge. Warning: 18.15% of ICD-10-WHO codes were not mapped to phecodes for file /test/data/dummy_EHR_ICD10.csv. Warning: 2.07% of ICD-9-WHO codes were not mapped to phecodes for file /test/data/dummy_EHR_ICD9.csv. """ ``` This output confirms the number of diagnosis records merged and provides average counts of diagnoses per person (during and before follow-up, by exposure group). Warnings indicate the percentage of codes that could not be mapped to a phecode for each file, so you’re aware of any unmapped codes. ## Save DiseaseNetworkData object At this stage, after loading phenotype and medical record data, you may want to save the `DiseaseNetworkData` object for later use. Saving allows you to reuse the prepared data without re-reading and processing raw files each time, facilitating reproducibility and easy sharing of the processed data. ***DiNetxify*** provides two methods: `save()` (which uses Python’s pickle serialization, saving to a compressed `.pkl.gz` file) and `save_npz()` (which saves to a compressed NumPy `.npz` file). You can use either or both depending on your needs. For example: ```python # Save the data object to a gzipped pickle file data.save('/your/project/path/cohort_data') # (This will produce a file named "cohort_data.pkl.gz") # Save the data object to a NumPy .npz file data.save_npz('/your/project/path/cohort_data') # (This will produce a file named "cohort_data.npz") ``` You do not need to add the file extension in the path; the functions will append `.pkl.gz` or `.npz` automatically. Make sure to choose a directory where you have write permissions and enough storage space (the files can be large if your dataset is large). ## Reload DiseaseNetworkData object If you have previously saved a `DiseaseNetworkData` object, you can reload it instead of re-reading all input files. This is especially useful for large datasets or when sharing the processed object with collaborators. To reload, first instantiate a new `DiseaseNetworkData` object with the same `study_design` and `phecode_level` that the data was created with, then call the corresponding load function (`load()` or `load_npz()`). For example: ```python import DiNetxify as dnt # Create a new DiseaseNetworkData object with the same design/parameters data = dnt.DiseaseNetworkData( study_design='cohort', phecode_level=1, ) # Load from a .pkl.gz file data.load('/your/project/path/cohort_data') # Or load from a .npz file data.load_npz('/your/project/path/cohort_data') ```