API Reference
Below is a concise reference for DiNetxify’s classes and functions, summarizing their signatures and parameters. This is useful when writing your own scripts or if you need to quickly recall how to call a function.
Class DiseaseNetworkData
class DiseaseNetworkData(
study_design: str = 'cohort',
phecode_level: int = 1,
min_required_icd_codes: int = 1,
date_fmt: str = '%Y-%m-%d',
phecode_version: str = '1.2'
)
A class for handling disease network data creation and operations, for use in DiNetxify module.
Parameters:
study_design
(str
): Specify the type of study design, either “cohort”, “matched cohort”, or “exposed-only cohort”. Defaults to'cohort'
.phecode_level
(int
): The level of phecode to use for analysis, where level 1 (with a total of 585 medical conditions) corresponds to 3-digit ICD-10 codes and level 2 (with a total of 1257 medical conditions) to 4-digit ICD-10 codes. Level 2 phecodes offer a more granular analysis with potentially smaller sample sizes per disease category. For larger studies, level 2 phecodes may enhance result interpretation. For smaller studies, level 1 is recommended to maintain statistical power. Defaults to1
.min_required_icd_codes
(int
): The minimum number of ICD codes mapping to a specific phecode required for the phecode to be considered valid. For example, if set to 2, a single diagnosis record will not be sufficient to count as an occurrence. Ensure that your medical record are complete (i.e., not limited to only the first occurrence for each code) when using this parameter. Defaults to1
.date_fmt
(str
): The format of the date fields in your phenotype and medical record data. Defaults to'%Y-%m-%d'
.phecode_version
(str
): The version of the phecode system used for converting diagnosis codes. Version 1.2 is the official version of the phecode system, with mapping files available for ICD-9-CM, ICD-9-WHO, ICD-10-CM, and ICD-10-WHO codes. While option 1.3a is provided, it’s an unofficial version and not recommended for general use. Defaults to'1.2'
.
Instance Methods
phenotype_data
phenotype_data(
self,
phenotype_data_path: str,
column_names: dict,
covariates: list,
is_single_sex: bool = False,
force: bool = False
) -> None
Load phenotype data into the object.
Parameters:
phenotype_data_path
(str
): Path to CSV/TSV phenotype file with header row.column_names
(dict
): Mapping of dataset column names. Required keys:'Participant ID'
,'Index date'
,'End date'
,'Exposure'
,'Sex'
,'Match ID'
.covariates
(list
): List of additional covariate names (e.g.,['age', 'BMI']
).is_single_sex
(bool
): True if dataset contains only one sex. Defaults toFalse
.force
(bool
): If True, overwrite existing data attributes. Defaults toFalse
.
Returns:
None
Table1
Table1(
self,
continuous_stat_mode: str = 'auto'
) -> pd.DataFrame
Generate a descriptive summary table of phenotype data.
Parameters:
continuous_stat_mode
(str
): Method for continuous variable statistics. Choices:auto
: Automatic normality-based choice.normal
: Mean and standard deviation.nonnormal
: Median and interquartile range. Defaults to'auto'
.
Returns:
pd.DataFrame
merge_medical_records
merge_medical_records(
self,
medical_records_data_path: str,
diagnosis_code: str,
column_names: dict,
date_fmt: str = None,
chunksize: int = 1000000
) -> None
Load one or more medical record datasets.
Parameters:
medical_records_data_path
(str
): Path to CSV/TSV medical record file.diagnosis_code
(str
): Code type:'ICD-9-CM'
,'ICD-9-WHO'
,'ICD-10-CM'
, or'ICD-10-WHO'
.column_names
(dict
): Mapping for dataset columns. Required keys:'Participant ID'
,'Diagnosis code'
,'Date of diagnosis'
.date_fmt
(str
): Date format (defaults to phenotype data format). Defaults toNone
.chunksize
(int
): Rows per chunk for large files. Defaults to1000000
.
Returns:
None
get_attribute
get_attribute(
self,
attr_name: str
) -> any
Retrieve the value of a private or protected attribute.
Parameters:
attr_name
(str
): Name of the attribute to retrieve.
Returns:
Attribute value (
any
)
medical_records_to_dataframe
concat(
self,
phecode_list: list,
medical_history: bool=False
) -> DiseaseNetworkData
Convert stored medical record into a tidy pandas DataFrame.
Parameters:
phecode_list
(list
): List of phecodes to extract from the medical record. Only phecodes valid for the current phecode_level are accepted.medical_history
(bool
): Include a binary history column for each phecode if set to True. Default toFalse
Returns:
pd.DataFrame
modify_phecode_level
modify_phecode_level(
self,
phecode_level: int
) -> None
Update the phecode level setting.
Parameters:
phecode_level
(int
): New phecode level (1 or 2).
Returns:
None
disease_pair
disease_pair(
self,
phewas_result: pd.DataFrame,
min_interval_days: int = 0,
max_interval_days: float = float('inf'),
force: bool = False,
n_process: int = 1,
**kwargs
) -> None
Construct temporal and non-temporal disease pairs.
Parameters:
phewas_result
(pd.DataFrame
): DataFrame fromphewas()
.min_interval_days
(int
): Minimum days between diagnoses. Defaults to0
.max_interval_days
(float
): Maximum days between diagnoses. Defaults toinf
.force
(bool
): Overwrite existing data. Defaults toFalse
.n_process
(int
): Number of parallel processes. Defaults to1
.**kwargs
: Additional mappings:phecode_col
(str
): Column for phecode. Defaults to'phecode'
.significance_col
(str
): Column for significance. Defaults to'phewas_p_significance'
.
Returns:
None
save
save(
self,
file: str
) -> None
Save object state to a gzip-compressed pickle file (.pkl.gz
).
Parameters:
file
(str
): Filename or prefix (adds.pkl.gz
).
Returns:
None
load
load(
self,
file: str,
force: bool = False
) -> None
Load object state from a gzip-compressed pickle file.
Parameters:
file
(str
): Filename or prefix (adds.pkl.gz
).force
(bool
): Overwrite if True. Defaults toFalse
.
Returns:
None
save_npz
save_npz(
self,
file: str
) -> None
Save object state to a NumPy .npz
file.
Parameters:
file
(str
): Filename or prefix (adds.npz
).
Returns:
None
load_npz
load_npz(
self,
file: str,
force: bool = False
) -> None
Load object state from a NumPy .npz
file.
Parameters:
file
(str
): Filename or prefix (adds.npz
).force
(bool
): Overwrite if True. Defaults toFalse
.
Returns:
None
Analysis Functions
Function: disease_network_pipeline
disease_network_pipeline(
data: DiseaseNetworkData,
n_process: int,
n_threshold_phewas: int,
n_threshold_comorbidity: int,
output_dir: str,
project_prefix: str,
keep_positive_associations: bool = False,
save_intermediate_data: bool = False,
system_exl: list = None,
pipeline_mode: str = 'v1',
method: str = 'RPCN',
covariates: list = None,
matching_var_dict: dict = {'sex':'exact'},
matching_n: int = 2,
min_interval_days: int = 0,
max_interval_days: float = float('inf'),
enforce_temporal_order: bool = False,
correction: str = 'bonferroni',
cutoff: float = 0.05,
**kwargs
) -> dict
Parameters:
data
(DiseaseNetworkData
): The DiseaseNetworkData object.n_process
(int
): Specifies the number of parallel processes to use. Defaults to required.n_threshold_phewas
(int
): Minimum cases in exposed group for PheWAS inclusion. Passed tophewas()
.n_threshold_comorbidity
(int
): Minimum co-occurrences for comorbidity strength. Passed tocomorbidity_strength()
.output_dir
(str
): Directory path for pipeline outputs.project_prefix
(str
): Prefix for naming outputs.keep_positive_associations
(bool
): Retain only positive associations. Defaults toFalse
.save_intermediate_data
(bool
): Save intermediate data objects. Defaults toFalse
.system_exl
(list
): Phecode systems to exclude. Defaults toNone
.pipeline_mode
(str
): Analysis order mode ('v1'
or'v2'
). Defaults to'v1'
.method
(str
): Comorbidity network / trajectory method ('RPCN'
,'PCN_PCA'
,'CN'
). Defaults to'RPCN'
.covariates
(list
): Covariates for models. Defaults toNone
.matching_var_dict
(dict
): Matching variables and criteria. Defaults to{'sex':'exact'}
.matching_n
(int
): Number of matched controls per case. Defaults to2
.min_interval_days
(int
): Minimum days between diagnoses. Defaults to0
.max_interval_days
(float
): Maximum days between diagnoses. Defaults toinf
.enforce_temporal_order
(bool
): Enforce temporal order in testing. Defaults toFalse
.correction
(str
): p-value correction method. Defaults to'bonferroni'
.cutoff
(float
): Significance threshold. Defaults to0.05
.**kwargs
:alpha
(float
): L1 penalty weight. Defaults per method.auto_penalty
(bool
): Auto-select alpha. Defaults toTrue
.alpha_range
(tuple
): Search range for alpha. Defaults to(1,15)
.scaling_factor
(float
): Scaling factor for alpha. Defaults to1
.n_PC
(int
): Number of principal components. Defaults to5
.explained_variance
(float
): Variance threshold for PCs.
Returns:
dict
: Summary of significant results count.
Function: phewas
phewas(
data: DiseaseNetworkData,
covariates: list = None,
proportion_threshold: float = None,
n_threshold: int = None,
n_process: int = 1,
correction: str = 'bonferroni',
cutoff: float = 0.05,
system_inc: list = None,
system_exl: list = None,
phecode_inc: list = None,
phecode_exl: list = None,
log_file: str = None,
lifelines_disable: bool = False
) -> pd.DataFrame
Parameters:
data
(DiseaseNetworkData
): Input data object.covariates
(list
): Phenotypic covariates. Defaults toNone
.proportion_threshold
(float
): Minimum proportion of cases. Mutually exclusive withn_threshold
. Defaults toNone
.n_threshold
(int
): Minimum case count. Mutually exclusive withproportion_threshold
. Defaults toNone
.n_process
(int
): Parallel processes. Defaults to1
.correction
(str
): p-value correction method. Defaults to'bonferroni'
.cutoff
(float
): Significance threshold. Defaults to0.05
.system_inc
(list
): Systems to include. Defaults toNone
.system_exl
(list
): Systems to exclude. Defaults toNone
.phecode_inc
(list
): Specific phecodes to include. Defaults toNone
.phecode_exl
(list
): Specific phecodes to exclude. Defaults toNone
.log_file
(str
): Log file prefix. Defaults toNone
.lifelines_disable
(bool
): Disable lifelines. Defaults toFalse
.
Returns:
pd.DataFrame
: PheWAS results.
Function: phewas_multipletests
phewas_multipletests(
df: pd.DataFrame,
correction: str = 'bonferroni',
cutoff: float = 0.05
) -> pd.DataFrame
Parameters:
df
(pd.DataFrame
): Input results fromphewas()
.correction
(str
): p-value correction method. Defaults to'bonferroni'
.cutoff
(float
): Significance threshold. Defaults to0.05
.
Returns:
pd.DataFrame
: Adjusted results.
Function: comorbidity_strength
comorbidity_strength(
data: DiseaseNetworkData,
proportion_threshold: float = None,
n_threshold: int = None,
n_process: int = 1,
log_file: str = None,
correction_phi: str = 'bonferroni',
cutoff_phi: float = 0.05,
correction_RR: str = 'bonferroni',
cutoff_RR: float = 0.05
) -> pd.DataFrame
Parameters:
data
(DiseaseNetworkData): DiseaseNetworkData object.proportion_threshold
(float): The minimum proportion of individuals in the exposed group in which a disease pair must co-occur (temporal or non-temporal) to be included in the comorbidity strength estimation. If the proportion of co-occurrence is below this threshold, the disease pair is excluded from the analysis. proportion_threshold and n_threshold are mutually exclusive.n_threshold
(int): The minimum number of individuals in the exposed group in which a disease pair must co-occur (temporal or non-temporal) to be included in the comorbidity strength estimation. If the number of co-occurrences is below this threshold, the disease pair is excluded from the analysis. n_threshold and proportion_threshold are mutually exclusive.n_process
(int, default=1): Specifies the number of parallel processes to use for the analysis. Multiprocessing is enabled whenn_process
is set to a value greater than one.correction_phi
(str, default=’bonferroni’): Method for phi-correlation p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff_phi
(float, default=0.05): The significance threshold for adjusted phi-correlatio p-values.correction_RR
(str, default=’bonferroni’): Method for RR p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff_RR
(float, default=0.05): The significance threshold for adjusted RR p-values.log_file
(str, default=None): Path and prefix for the text file where log will be recorded. If None, the log will be written to the temporary files directory with file prefix of DiseaseNet_com_strength_.
Function: comorbidity_strength_multipletests
comorbidity_strength_multipletests(
df: pd.DataFrame,
correction_phi: str = 'bonferroni',
cutoff_phi: float = 0.05,
correction_RR: str = 'bonferroni',
cutoff_RR: float = 0.05
) -> pd.DataFrame
Parameters:
df
(pd.DataFrame): DataFrame containing the results from the comorbidity_strength function.correction_phi
(str, default=’bonferroni’): Method for phi-correlation p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff_phi
(float, default=0.05): The significance threshold for adjusted phi-correlatio p-values.correction_RR
(str, default=’bonferroni’): Method for RR p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff_RR
(float, default=0.05): The significance threshold for adjusted RR p-values.
Function: binomial_test
binomial_test(
data: DiseaseNetworkData,
comorbidity_strength_result: pd.DataFrame,
comorbidity_network_result: pd.DataFrame = None,
n_process: int = 1,
log_file: str = None,
correction: str = 'bonferroni',
cutoff: float = 0.05,
enforce_temporal_order: bool = False,
**kwargs
) -> pd.DataFrame
Parameters:
data
(DiseaseNetworkData): DiseaseNetworkData object.comorbidity_strength_result
(pd.DataFrame): DataFrame containing comorbidity strength analysis results produced by the ‘DiNetxify.comorbidity_strength’ function.comorbidity_network_result
(pd.DataFrame, default=None): DataFrame containing comorbidity network analysis results produced by the ‘DiNetxify.comorbidity_network’ function. When provided, the binomial test is limited to disease pairs deemed significant in the comorbidity network analysis.n_process
(int, default=1): Multiprocessing is disabled for this analysis.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted binomial p-values.log_file
(str, default=None): Path and prefix for the text file where log will be recorded. If None, the log will be written to the temporary files directory with file prefix of DiseaseNet_binomial_test_.enforce_temporal_order
(bool, default=False): If True, exclude individuals with non-temporal D1-D2 pair when performing the test. If False, include all individuals, including those with non-temporal D1-D2 pair.**kwargs
phecode_d1_col
: str, default=’phecode_d1’ Name of the column in ‘comorbidity_strength_result’ and ‘comorbidity_network_result’ that specifies the phecode identifiers for disease 1 of the disease pair.phecode_d2_col
: str, default=’phecode_d2’ Name of the column in ‘comorbidity_strength_result’ and ‘comorbidity_network_result’ that specifies the phecode identifiers for disease 2 of the disease pair.n_nontemporal_col
: str, default=’n_d1d2_nontemporal’ Name of the column in ‘comorbidity_strength_result’ that specifies the number of individuals with non-temporal d1-d2 disease pairn_temporal_d1d2_col
: str, default=’n_d1d2_temporal’ Name of the column in ‘comorbidity_strength_result’ that specifies the number of individuals with temporal d1->d2 disease pair.n_temporal_d2d1_col
: str, default=’n_d2d1_temporal’ Name of the column in ‘comorbidity_strength_result’ that specifies the number of individuals with temporal d2->d1 disease pair.significance_phi_col
: str, default=’phi_p_significance’ Name of the column in ‘comorbidity_strength_result’ that indicates the significance of phi-correlation for each disease pair.significance_RR_col
: str, default=’RR_p_significance’ Name of the column in ‘comorbidity_strength_result’ that indicates the significance of RR for each disease pair.significance_coef_col
: str, default=’comorbidity_p_significance’ Name of the column in ‘comorbidity_network_result’ that indicates the significance of comorbidity network analysis for each disease pair.
Function: binomial_multipletests
binomial_multipletests(
df: pd.DataFrame,
correction: str = 'bonferroni',
cutoff: float = 0.05
) -> pd.DataFrame
Parameters:
df
(pd.DataFrame): DataFrame containing the results from the comorbidity_strength function.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted binomial p-values.
Function: comorbidity_network
comorbidity_network(
data: DiseaseNetworkData,
comorbidity_strength_result: pd.DataFrame,
binomial_test_result: pd.DataFrame = None,
method: str = 'RPCN',
covariates: list = None,
n_process: int = 1,
log_file: str = None,
correction: str = 'bonferroni',
cutoff: float = 0.05,
**kwargs
) -> pd.DataFrame
Parameters:
data
(DiseaseNetworkData): DiseaseNetworkData object.comorbidity_strength_result
(pd.DataFrame): DataFrame containing comorbidity strength analysis results produced by the ‘DiNetxify.comorbidity_strength’ function.binomial_test_result
(pd.DataFrame, default=None): DataFrame containing binomial test analysis results produced by theDiNetxify.binomial_test
function.method
(str, default=’RPCN’): Specifies the comorbidity network analysis method to use. Choices are: - ‘RPCN: Regularized Partial Correlation Network. - ‘PCN_PCA: Partial Correlation Network with PCA. - ‘CN’: Correlation Network. Additional Options for RPCN: - ‘alpha’ : non-negative scalar The weight multiplying the l1 penalty term for other diseases covariates. Ignored if ‘auto_penalty’ is enabled. - ‘auto_penalty’ : bool, default=True If ‘True’, automatically determine the optimal ‘alpha’ based on model AIC value. - ‘alpha_range’ : tuple, default=(1,15) When ‘auto_penalty’ is True, search the optimal ‘alpha’ in this range. - ‘scaling_factor’ : positive scalar, default=1 The scaling factor for the alpha when ‘auto_penalty’ is True. Additional Options for PCN_PCA: - ‘n_PC’ : int, default=5 Fixed number of principal components to include in each model. - ‘explained_variance’ : float Determines the number of principal components based on the cumulative explained variance. Overrides ‘n_PC’ if specified.covariates
(list, default=None): List of phenotypic covariates to include in the model. By default, includes [‘sex’] and all covariates specified in theDiNetxify.DiseaseNetworkData.phenotype_data()
function. To include the required variable sex as a covariate, always use ‘sex’ instead of its original column name. For other covariates specified in theDiNetxify.DiseaseNetworkData.phenotype_data()
function, use their original column names.n_process
(int, default=1): Specifies the number of parallel processes to use for the analysis. Multiprocessing is enabled whenn_process
is set to a value greater than one.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted comorbidity network analysis p-values.log_file
(str, default=None): Path and prefix for the text file where log will be recorded. If None, the log will be written to the temporary files directory with file prefix of DiseaseNet_comorbidity_network_.**kwargs
phecode_d1_col
: str, default=’phecode_d1’ Name of the column in ‘comorbidity_strength_result’ and ‘binomial_test_result’ that specifies the phecode identifiers for disease 1 of the disease pair.phecode_d2_col
: str, default=’phecode_d2’ Name of the column in ‘comorbidity_strength_result’ and ‘binomial_test_result’ that specifies the phecode identifiers for disease 2 of the disease pair.significance_phi_col
: str, default=’phi_p_significance’ Name of the column in ‘comorbidity_strength_result’ that indicates the significance of phi-correlation for each disease pair.significance_RR_col
: str, default=’RR_p_significance’ Name of the column in ‘comorbidity_strength_result’ that indicates the significance of RR for each disease pair.significance_binomial_col
: str default=’binomial_p_significance’ Name of the column in ‘binomial_test_result’ that indicates the significance of binomial test for each disease pair.alpha
: non-negative scalar The weight multiplying the l1 penalty term for other diseases covariates. Ignored if ‘auto_penalty’ is enabled.auto_penalty
: bool, default=True If ‘True’, automatically determines the best ‘alpha’ based on model AIC value.alpha_range
: tuple, default=(1,15) When ‘auto_penalty’ is True, search the optimal ‘alpha’ in this range.scaling_factor
: positive scalar, default=1 The scaling factor for the alpha when ‘auto_penalty’ is True.n_PC
: int, default=5 Fixed number of principal components to include in each model.explained_variance
: float Cumulative explained variance threshold to determine the number of principal components. Overrides'n_PC'
if specified.
Function: comorbidity_multipletests
comorbidity_multipletests(
df: pd.DataFrame,
correction: str = 'bonferroni',
cutoff: float = 0.05
) -> pd.DataFrame
Parameters:
df
(pd.DataFrame): DataFrame containing the results from the ‘comorbidity_network’ function.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted binomial p-values.
Function: disease_trajectory
disease_trajectory(
data: DiseaseNetworkData,
comorbidity_strength_result: pd.DataFrame,
binomial_test_result: pd.DataFrame,
method: str = 'RPCN',
matching_var_dict: dict = {'sex':'exact'},
matching_n: int = 2,
max_n_cases: float = np.inf,
global_sampling: bool = False,
covariates: list = None,
n_process: int = 1,
log_file: str = None,
correction: str = 'bonferroni',
cutoff: float = 0.05,
**kwargs
) -> pd.DataFrame
Parameters:
data
(DiseaseNetworkData): DESCRIPTION.comorbidity_strength_result
(pd.DataFrame): DataFrame containing comorbidity strength analysis results produced by theDiNetxify.comorbidity_strength()
function.binomial_test_result
(pd.DataFrame): DataFrame containing binomial test analysis results produced by theDiNetxify.binomial_test()
function.method
(str, default=’RPCN’): Specifies the comorbidity network analysis method to use. Choices are:'RPCN'
: Regularized Partial Correlation Network.'PCN_PCA'
: Partial Correlation Network with PCA.'CN'
: Correlation Network.
matching_var_dict
(dict, default={‘sex’:’exact’}): Specifies the matching variables and the criteria used for incidence density sampling. For categorical and binary variables, the matching criteria should always be'exact'
. For continuous variables, provide a scalar greater than 0 as the matching criterion, indicating the maximum allowed difference when matching. To include the required variable sex as a matching variable, always use'sex'
instead of its original column name. For other covariates specified in theDiNetxify.DiseaseNetworkData.phenotype_data()
function, use their original column names.matching_n
(int, default=2): Specifies the maximum number of matched controls for each case.max_n_cases
(int, default=np.inf): Specifies the maximum number of D2 cases to include in the analysis. If the number of D2 cases exceeds this value, a random sample of cases will be selected.global_sampling
(bool, default=False): Indicates whether to perform independent incidence density sampling for each D1→D2 pair (if False), or to perform a single incidence density sampling for all Dx→D2 pairs with separate regression models for each D1→D2 pair (if True). Global sampling is recommended when processing large datasets, though it might reduce result heterogeneity.covariates
(list, default=None): List of phenotypic covariates to include in the model. By default, includes all covariates specified in theDiNetxify.DiseaseNetworkData.phenotype_data()
function. Categorical and binary variables used for matching should not be included as covariates. Continuous variables used for matching can be included as covariates, but caution is advised. To include the required variable sex as a covariate, always usesex
instead of its original column name. For other covariates specified in theDiNetxify.DiseaseNetworkData.phenotype_data()
function, use their original column names.n_process
(int, default=1): Specifies the number of parallel processes to use for the analysis. Multiprocessing is enabled whenn_process
is set to a value greater than one.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted comorbidity network analysis p-values.log_file
(str, default=None): Path and prefix for the text file where log will be recorded. If None, the log will be written to the temporary files directory with file prefix of DiseaseNet_trajectory_.**kwargs
Analysis optionenforce_time_interval
: bool, default=True If set to True, applies the specified minimum and maximum time intervals when determining the D2 outcome among individuals diagnosed with D1. These time interval requirements should be defined using theDiNetxify.DiseaseNetworkData.disease_pair()
function.phecode_d1_col
: str, default=’phecode_d1’ Name of the column incomorbidity_strength_result
andbinomial_test_result
that specifies the phecode identifiers for disease 1 of the disease pair.phecode_d2_col
: str, default=’phecode_d2’ Name of the column incomorbidity_strength_result
andbinomial_test_result
that specifies the phecode identifiers for disease 2 of the disease pair.significance_phi_col
: str, default=’phi_p_significance’ Name of the column incomorbidity_strength_result
that indicates the significance of phi-correlation for each disease pair.significance_RR_col
: str, default=’RR_p_significance’ Name of the column incomorbidity_strength_result
that indicates the significance of RR for each disease pair.significance_binomial_col
: str default=’binomial_p_significance’ Name of the column inbinomial_test_result
that indicates the significance of binomial test for each disease pair.alpha
: non-negative scalar The weight multiplying the l1 penalty term for other diseases covariates. Ignored ifauto_penalty
is enabled.auto_penalty
: bool, default=True IfTrue
, automatically determines the bestalpha
based on model AIC value.alpha_range
: tuple, default=(1,15) Whenauto_penalty
is True, search the optimalalpha
in this range.scaling_factor
: positive scalar, default=1 The scaling factor for thealpha
when ‘auto_penalty’ is True.n_PC
: int, default=5 Fixed number of principal components to include in each model.explained_variance
: float Cumulative explained variance threshold to determine the number of principal components. Overrides'n_PC'
if specified.
Function: trajectory_multipletests
trajectory_multipletests(
df: pd.DataFrame,
correction: str = 'bonferroni',
cutoff: float = 0.05
) -> pd.DataFrame
Parameters:
df
(pd.DataFrame): DataFrame containing the results from the ‘disease_trajectory’ function.correction
(str, default=’bonferroni’): Method for binomial p-value correction from the statsmodels.stats.multitest.multipletests.Available methods are:
none : no correction
bonferroni : one-step correction
sidak : one-step correction
holm-sidak : step down method using Sidak adjustments
holm : step-down method using Bonferroni adjustments
simes-hochberg : step-up method (independent)
hommel : closed method based on Simes tests (non-negative)
fdr_bh : Benjamini/Hochberg (non-negative)
fdr_by : Benjamini/Yekutieli (negative) fdr_tsbh : two stage fdr correction (non-negative)
fdr_tsbky : two stage fdr correction (non-negative)
See https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html for more details.
cutoff
(float, default=0.05): The significance threshold for adjusted binomial p-values.
Class Plot
class Plot(
comorbidity_result: pd.DataFrame,
trajectory_result: pd.DataFrame,
phewas_result: pd.DataFrame,
exposure_name: str = None,
exposure_location: Tuple[float, float, float] = None,
exposure_size: float = None,
source: str = 'phecode_d1',
target: str = 'phecode_d2',
phewas_phecode: str = 'phecode',
phewas_number: str = 'N_cases_exposed',
system_col: str = 'system',
col_disease_pair: str = 'name_disease_pair',
filter_phewas_col: str = 'phewas_p_significance',
filter_comorbidity_col: str = 'comorbidity_p_significance',
filter_trajectory_col: str = 'trajectory_p_significance',
)
A class for integrating and visualizing disease relationships from PHEWAS, comorbidity network, and trajectory analyses.
Constructor Parameters:
comorbidity_result
(pd.DataFrame
): Non-temporal disease pairs with association metrics and significance flag.trajectory_result
(pd.DataFrame
): Temporal disease pairs (source→target) with metrics and significance flag.phewas_result
(pd.DataFrame
): PheWAS results including phecode, effect sizes, case counts, and system classifications.exposure_name
(float
, optional): Name of exposure. Default isNone
. IfNone
, it means that this is an exposed-only cohort study.exposure_location
(Tuple[float, float, float]
, optional): 3D coordinates for exposure node. Defaults to origin ifNone
.exposure_size
(float
, optional): Scaling factor for exposure node size. Defaults to automatic.source
(str
): Column name for source disease (default:'phecode_d1'
).target
(str
): Column name for target disease (default:'phecode_d2'
).phewas_phecode
(str
): Column for phecode in PHEWAS results (default:'phecode'
).phewas_number
(str
): Column for case counts (default:'N_cases_exposed'
).system_col
(str
): Column for disease system (default:'system'
).col_disease_pair
(str
): Column for pair identifier (default'name_disease_pair'
).filter_phewas_col
(str
): Column for PHEWAS significance filter.filter_comorbidity_col
(str
): Column for comorbidity significance filter.filter_trajectory_col
(str
): Column for trajectory significance filter.**kwargs
SYSTEM
(List[str]
, optional): List of systems to visualize; defaults to all from PHEWAS ifNone
.COLOR
(List[str]
, optional): Colors corresponding to systems; default palette used ifNone
.
Instance Methods
three_dimension_plot
three_dimension_plot(
self,
path: str,
max_radius: float = 180.0,
min_radius: float = 35.0,
line_color: str = 'black',
line_width: float = 1.0,
size_reduction: float = 0.5,
cluster_reduction_ratio: float = 0.4,
cluster_weight: str = 'comorbidity_beta',
layer_distance: float = 40.0,
layout_width: float = 900.0,
layout_height: float = 900.0,
font_style: str = 'Times New Roman',
font_size: float = 15.0
) -> None
Generate and save a 3D interactive HTML visualization.
Parameters:
path
: File path to save the HTML visualizationmax_radius
: Maximum radial distance for node placement (default:180.0
)min_radius
: Minimum radial distance for node placement (default:35.0
)line_color
: Color for trajectory lines (default:"black"
)line_width
: Width for trajectory lines (default:1.0
)size_reduction
: Scaling factor for node sizes (default:0.5
)cluster_reduction_ratio
: Cluster compression factor for layout (default:0.4
)cluster_weight
: Edge weight metric used for clustering (default:"comorbidity_beta"
)layer_distance
: Vertical distance between layers (default:40.0
)layout_width
: Figure width in pixels (default:900.0
)layout_height
: Figure height in pixels (default:900.0
)font_style
: Font family for text elements (default:'Times New Roman'
)font_size
: Base font size in points (default:15.0
)
comorbidity_network_plot
comorbidity_network_plot(
self,
path: str,
max_radius: float = 180.0,
min_radius: float = 35.0,
size_reduction: float = 0.5,
cluster_reduction_ratio: float = 0.4,
cluster_weight: str = 'comorbidity_beta',
line_width: float = 1.0,
line_color: str = 'black',
layer_distance: float = 40.0,
font_style: str = 'Times New Roman'
) -> None
Generate and save a 2D HTML visualization of the comorbidity network.
Parameters:
path
: Output file path for saving HTML visualizationmax_radius
: Maximum radial position for nodes (default:90.0
)min_radius
: Minimum radial position for nodes (default:35.0
)size_reduction
: Scaling factor for node sizes (default:0.5
)cluster_reduction_ratio
: Compression factor for cluster layout (default:0.4
)cluster_weight
: Edge weight metric for clustering (default:"comorbidity_beta"
)line_width
: Width of comorbidity lines (default:1.0
)line_color
: Color of comorbidity lines (default:"black"
)layer_distance
: Distance between concentric circles (default:40.0
)font_style
: Font family for text elements (default:"Times New Roman"
)
trajectory_plot
trajectory_plot(
self,
path: str,
cluster_weight: str = 'comorbidity_beta',
source: str='phecode_d1',
target: str='phecode_d2',
dpi: float=500
) -> None
Generate and save trajectory plots per cluster as (.png files).
Parameters:
path
: Directory path to save output imagescluster_weight
: Edge weight metric used for clustering (default:"comorbidity_beta"
)source
: Column name representing source nodes (disease onset points) in trajectory data (default:'phecode_d1'
)target
: Column name representing target nodes (subsequent disease points) in trajectory data (default:'phecode_d2'
)dpi
: Image resolution in dots per inch for output files (default:500
)
phewas_plot
phewas_plot(
self,
path: str,
system_font_size: float=17,
disese_font_size: float=10,
col_coef: str = 'phewas_coef',
col_system: str = 'system',
col_se: str = 'phewas_se',
col_disease: str = 'disease',
is_exposure_only: bool = False,
col_exposure: str = 'N_cases_exposed',
dpi: float=200
) -> None
Creates a polar bar plot visualizing disease associations across different disease categories (systems)
Parameters:
path
: Output file path for saving the plotsystem_font_size
: Font size for disease system/category labels (default:17
)disease_font_size
: Font size for disease labels (default:10
)col_coef
: Column name for effect size coefficients (default:"phewas_coef"
)col_system
: Column name for disease system/category (default:"system"
)col_se
: Column name for standard errors (default:"phewas_se"
)col_disease
: Column name for disease names (default:"disease"
)is_exposure_only
: Identifier of exposure (default:False
)col_exposure
: Column name for exposure number (default:"N_cases_exposed"
)dpi
: Image resolution in dots per inch for output files (default:200
)