napistu.ingestion.harmonizome
Ingestion utilities for Harmonizome datasets like Achilles and the Replogle et al. Perturb-seq datasets, Cell 2022.
Functions
|
Load multiple datasets into nested dictionary of DataFrames. |
|
Download and process multiple datasets with statistics. |
Classes
|
Configuration and methods for a Harmonizome dataset. |
- class napistu.ingestion.harmonizome.HarmonizomeDataset(*, name: str, short_name: str, description: str, download_files: ~typing.List[str] = <factory>, custom_urls: ~typing.Dict[str, str] = <factory>, custom_loaders: ~typing.Dict[str, ~typing.Callable[[~pathlib.Path], ~pandas.core.frame.DataFrame]] = <factory>)
Bases:
BaseModelConfiguration and methods for a Harmonizome dataset.
- classmethod ensure_harmonizome_dataset(dataset: HarmonizomeDataset | Dict) HarmonizomeDataset
Ensure input is a HarmonizomeDataset instance.
If a dict is provided, it must have ‘short_name’ key.
- Parameters:
dataset (HarmonizomeDataset | Dict) – Either a HarmonizomeDataset instance or a config dict
- Returns:
A HarmonizomeDataset instance
- Return type:
Examples
>>> # Pass through existing instance >>> ds = HarmonizomeDataset(name='test', short_name='test', description='test') >>> result = HarmonizomeDataset.ensure_harmonizome_dataset(ds) >>> assert result is ds
>>> # Convert from dict >>> config = {'short_name': 'test', 'name': 'Test Dataset', 'description': 'A test'} >>> result = HarmonizomeDataset.ensure_harmonizome_dataset(config) >>> isinstance(result, HarmonizomeDataset) True
- classmethod from_dict(short_name: str, config: Dict) HarmonizomeDataset
Create a HarmonizomeDataset from a configuration dictionary.
- Parameters:
short_name (str) – The short name/key for the dataset
config (Dict) – Configuration dictionary with keys: name, description, download_files
- Returns:
A new dataset instance
- Return type:
Examples
>>> config = { ... 'name': 'My Dataset', ... 'description': 'A test dataset', ... 'download_files': ['interactions', 'genes'] ... } >>> dataset = HarmonizomeDataset.from_dict('mydataset', config)
- classmethod validate_download_files(v: List[str]) List[str]
Ensure download_files contains valid file types.
- static _default_loader(filepath: Path) DataFrame
Default file loading logic.
- download(output_dir: Path, overwrite: bool = False) Dict[str, Path]
Download all files for this dataset using napistu’s download_wget utility.
- download_and_process(output_dir: Path, overwrite: bool = False) Dict[str, any]
Download all files and process them.
- get_download_urls() Dict[str, str]
Generate download URLs, using custom URLs when provided.
- load(output_dir: Path, file_type: str) DataFrame
Load a dataset file as a DataFrame.
- Parameters:
output_dir (Path) – Directory containing the dataset files
file_type (str) – Type of file to load (e.g., ‘interactions’, ‘genes’, ‘attributes’)
- Returns:
The loaded data
- Return type:
pd.DataFrame
- process_edge_list(dataset_dir: Path) dict
Parse gene-attribute edge list and return statistics.
- _abc_impl = <_abc._abc_data object>
- custom_loaders: Dict[str, Callable[[Path], pd.DataFrame]]
- custom_urls: Dict[str, str]
- description: str
- download_files: List[str]
- model_config = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- name: str
- short_name: str
- napistu.ingestion.harmonizome._apply_custom_loaders(dataset: HarmonizomeDataset, custom_loaders_registry: Dict[str, Dict[str, Callable[[Path], DataFrame]]]) HarmonizomeDataset
Apply custom loaders to a HarmonizomeDataset instance if available.
This creates a modified copy of the dataset with custom loaders added.
- Parameters:
dataset (HarmonizomeDataset) – The dataset instance to potentially add custom loaders to
custom_loaders_registry (Dict[str, Dict[str, Callable]]) – Registry mapping dataset short_name to file_type loaders
- Returns:
Dataset instance with custom loaders applied (if any exist for this dataset)
- Return type:
- napistu.ingestion.harmonizome._format_perturbatalas_perturbations(interactions: DataFrame) DataFrame
- napistu.ingestion.harmonizome._load_achilles_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_clinvar_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_dbgap_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_drugbank_attributes(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_drugbank_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_gad_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_perturbatlas_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome._load_replogle_interactions(filepath: Path) DataFrame
- napistu.ingestion.harmonizome.load_harmonizome_datasets(dataset_short_names: List[str], output_dir: Path, datasets_dict: Dict[str, HarmonizomeDataset | Dict] = None, file_types: List[str] | None = None, custom_loaders_registry: Dict[str, Dict[str, Callable]] | None = None) Dict[str, Dict[str, DataFrame]]
Load multiple datasets into nested dictionary of DataFrames.
- Parameters:
dataset_short_names (List[str]) – List of dataset short names from datasets_dict to load
output_dir (Path) – Directory containing the downloaded files
datasets_dict (Optional[Dict[str, Union[HarmonizomeDataset, Dict]]],) – Dictionary of available datasets. Defaults to HARMONIZOME_DATASETS if None.
file_types (Optional[List[str]], optional) – List of file types to load. If None, loads all available files. Options: ‘interactions’, ‘genes’, ‘attributes’
custom_loaders_registry (Optional[Dict[str, Dict[str, Callable]]], optional) – Registry of custom loader functions. If None, uses CUSTOM_LOADERS. Maps dataset short_name -> file_type -> loader function.
- Returns:
Nested dictionary: {dataset_short_name: {file_type: DataFrame}}
- Return type:
Dict[str, Dict[str, pd.DataFrame]]
Examples
>>> data = load_harmonizome_datasets(['reploglek562essential', 'achilles'], Path('data')) >>> interactions = data['reploglek562essential']['interactions'] >>> genes = data['achilles']['genes']
>>> # Load only specific file types >>> data = load_harmonizome_datasets( ... ['reploglek562essential'], ... Path('data'), ... file_types=['interactions'] ... )
>>> # Provide custom loaders >>> my_loaders = { ... 'achilles': {'interactions': load_achilles_interactions} ... } >>> data = load_harmonizome_datasets( ... ['achilles'], ... Path('data'), ... custom_loaders_registry=my_loaders ... )
- napistu.ingestion.harmonizome.process_harmonizome_datasets(dataset_short_names: List[str], output_dir: Path, datasets_dict: Dict[str, HarmonizomeDataset | Dict] = {'achilles': {'description': 'CRISPR knockout essentiality - called (-1, 1) fitnesses for individual cell lines.', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Achilles Cell Line Gene Essentiality Profiles', 'short_name': 'achilles'}, 'clinvar25': {'description': 'SNP-phenotype associations curated by ClinVar users from various sources', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'ClinVar', 'short_name': 'clinvar25'}, 'dbgap': {'description': 'Database of gene-trait associations curated from genetic association studies', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Database of Genotypes and Phenotypes', 'short_name': 'dbgap'}, 'drugbank': {'description': 'Protein-drug associations by manual literature curation', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'DrugBank', 'short_name': 'drugbank'}, 'gad': {'description': 'Gene-disease associations curated from genetic association studies', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Genetic Association Database (GAD)', 'short_name': 'gad'}, 'perturbatlas': {'description': 'Gene expression profiles for cell lines, cell types, tissues, and models following genetic perturbation ', 'download_files': ['interactions', 'genes'], 'name': 'PerturbAtlas', 'short_name': 'perturbatlas'}, 'perturbatlasmouse': {'description': 'Gene expression profiles for mouse cell lines, cell types, tissues, and models following genetic perturbation', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'PerturbAtlas Mouse', 'short_name': 'perturbatlasmouse'}, 'reploglek562essential': {'description': 'K562 CRISPRi essential genes', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: K562 Essential Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglek562essential'}, 'reploglek562genomewide': {'description': 'K562 CRISPRi gene perturbation signatures', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: K562 Genome-wide Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglek562genomewide'}, 'reploglerpe1essential': {'description': 'RPE1 CRISPRi essential genes', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: RPE1 Essential Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglerpe1essential'}}, overwrite: bool = False, custom_loaders_registry: Dict[str, Dict[str, Callable]] | None = None) Dict[str, Dict]
Download and process multiple datasets with statistics.
This is the main function for downloading Harmonizome datasets.
- Parameters:
dataset_short_names (List[str]) – List of dataset short names from datasets_dict to process
output_dir (Path) – Directory to save downloaded files
datasets_dict (Optional[Dict]) – Dictionary of available datasets. Defaults to HARMONIZOME_DATASETS if None.
overwrite (bool, optional) – Whether to overwrite existing files
custom_loaders_registry (Optional[Dict[str, Dict[str, Callable]]], optional) – Registry of custom loader functions. If None, uses CUSTOM_LOADERS. Maps dataset short_name -> file_type -> loader function.
- Returns:
Nested dictionary with files and stats for each dataset: {
- ’dataset_short_name’: {
‘files’: {file_type: filepath}, ‘stats’: {
’n_edges’: int, ‘n_genes’: int, ‘n_attributes’: int, ‘dataframe’: pd.DataFrame
}
}
}
- Return type:
Dict[str, Dict]
Examples
>>> results = process_harmonizome_datasets(['reploglek562essential'], Path('data')) >>> results['reploglek562essential']['stats']['n_edges'] 12345 >>> results['reploglek562essential']['files']['interactions'] PosixPath('data/reploglek562essential/interactions.tsv')
>>> # Provide custom loaders >>> my_loaders = { ... 'achilles': {'interactions': load_achilles_interactions} ... } >>> results = process_harmonizome_datasets( ... ['achilles'], ... Path('data'), ... custom_loaders_registry=my_loaders ... )