napistu.ingestion.harmonizome

Ingestion utilities for Harmonizome datasets like Achilles and the Replogle et al. Perturb-seq datasets, Cell 2022.

Functions

`load_harmonizome_datasets`(...[, ...])	Load multiple datasets into nested dictionary of DataFrames.
`process_harmonizome_datasets`(...[, ...])	Download and process multiple datasets with statistics.

Classes

HarmonizomeDataset(*, name, short_name, ...)

Configuration and methods for a Harmonizome dataset.

class napistu.ingestion.harmonizome.HarmonizomeDataset(*, name: str, short_name: str, description: str, download_files: ~typing.List[str] = <factory>, custom_urls: ~typing.Dict[str, str] = <factory>, custom_loaders: ~typing.Dict[str, ~typing.Callable[[~pathlib.Path], ~pandas.core.frame.DataFrame]] = <factory>)

Bases: BaseModel

Configuration and methods for a Harmonizome dataset.

classmethod ensure_harmonizome_dataset(dataset: HarmonizomeDataset | Dict) → HarmonizomeDataset

Ensure input is a HarmonizomeDataset instance.

If a dict is provided, it must have ‘short_name’ key.

Parameters:: dataset (HarmonizomeDataset | Dict) – Either a HarmonizomeDataset instance or a config dict
Returns:: A HarmonizomeDataset instance
Return type:: HarmonizomeDataset

Examples

>>> # Pass through existing instance
>>> ds = HarmonizomeDataset(name='test', short_name='test', description='test')
>>> result = HarmonizomeDataset.ensure_harmonizome_dataset(ds)
>>> assert result is ds

>>> # Convert from dict
>>> config = {'short_name': 'test', 'name': 'Test Dataset', 'description': 'A test'}
>>> result = HarmonizomeDataset.ensure_harmonizome_dataset(config)
>>> isinstance(result, HarmonizomeDataset)
True

classmethod from_dict(short_name: str, config: Dict) → HarmonizomeDataset

Create a HarmonizomeDataset from a configuration dictionary.

Parameters:

short_name (str) – The short name/key for the dataset
config (Dict) – Configuration dictionary with keys: name, description, download_files

Returns:

A new dataset instance

Return type:

HarmonizomeDataset

Examples

>>> config = {
...     'name': 'My Dataset',
...     'description': 'A test dataset',
...     'download_files': ['interactions', 'genes']
... }
>>> dataset = HarmonizomeDataset.from_dict('mydataset', config)

classmethod validate_download_files(v: List[str]) → List[str]: Ensure download_files contains valid file types.

static _default_loader(filepath: Path) → DataFrame: Default file loading logic.

download(output_dir: Path, overwrite: bool = False) → Dict[str, Path]: Download all files for this dataset using napistu’s download_wget utility.

download_and_process(output_dir: Path, overwrite: bool = False) → Dict[str, any]: Download all files and process them.

get_download_urls() → Dict[str, str]: Generate download URLs, using custom URLs when provided.

load(output_dir: Path, file_type: str) → DataFrame

Load a dataset file as a DataFrame.

Parameters:

output_dir (Path) – Directory containing the dataset files
file_type (str) – Type of file to load (e.g., ‘interactions’, ‘genes’, ‘attributes’)

Returns:

The loaded data

Return type:

pd.DataFrame

process_edge_list(dataset_dir: Path) → dict: Parse gene-attribute edge list and return statistics.

_abc_impl = <_abc._abc_data object>

custom_loaders: Dict[str, Callable[[Path], pd.DataFrame]]

custom_urls: Dict[str, str]

description: str

download_files: List[str]

model_config = {'arbitrary_types_allowed': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

name: str

short_name: str

napistu.ingestion.harmonizome._apply_custom_loaders(dataset: HarmonizomeDataset, custom_loaders_registry: Dict[str, Dict[str, Callable[[Path], DataFrame]]]) → HarmonizomeDataset

Apply custom loaders to a HarmonizomeDataset instance if available.

This creates a modified copy of the dataset with custom loaders added.

Parameters:

dataset (HarmonizomeDataset) – The dataset instance to potentially add custom loaders to
custom_loaders_registry (Dict[str, Dict[str, Callable]]) – Registry mapping dataset short_name to file_type loaders

Returns:

Dataset instance with custom loaders applied (if any exist for this dataset)

Return type:

HarmonizomeDataset

napistu.ingestion.harmonizome._format_perturbatalas_perturbations(interactions: DataFrame) → DataFrame

napistu.ingestion.harmonizome._load_achilles_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_clinvar_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_dbgap_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_drugbank_attributes(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_drugbank_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_gad_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_perturbatlas_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome._load_replogle_interactions(filepath: Path) → DataFrame

napistu.ingestion.harmonizome.load_harmonizome_datasets(dataset_short_names: List[str], output_dir: Path, datasets_dict: Dict[str, HarmonizomeDataset | Dict] = None, file_types: List[str] | None = None, custom_loaders_registry: Dict[str, Dict[str, Callable]] | None = None) → Dict[str, Dict[str, DataFrame]]

Load multiple datasets into nested dictionary of DataFrames.

Parameters:

dataset_short_names (List[str]) – List of dataset short names from datasets_dict to load
output_dir (Path) – Directory containing the downloaded files
datasets_dict (Optional[Dict[str, Union[HarmonizomeDataset, Dict]]],) – Dictionary of available datasets. Defaults to HARMONIZOME_DATASETS if None.
file_types (Optional[List[str]], optional) – List of file types to load. If None, loads all available files. Options: ‘interactions’, ‘genes’, ‘attributes’
custom_loaders_registry (Optional[Dict[str, Dict[str, Callable]]], optional) – Registry of custom loader functions. If None, uses CUSTOM_LOADERS. Maps dataset short_name -> file_type -> loader function.

Returns:

Nested dictionary: {dataset_short_name: {file_type: DataFrame}}

Return type:

Dict[str, Dict[str, pd.DataFrame]]

Examples

>>> data = load_harmonizome_datasets(['reploglek562essential', 'achilles'], Path('data'))
>>> interactions = data['reploglek562essential']['interactions']
>>> genes = data['achilles']['genes']

>>> # Load only specific file types
>>> data = load_harmonizome_datasets(
...     ['reploglek562essential'],
...     Path('data'),
...     file_types=['interactions']
... )

>>> # Provide custom loaders
>>> my_loaders = {
...     'achilles': {'interactions': load_achilles_interactions}
... }
>>> data = load_harmonizome_datasets(
...     ['achilles'],
...     Path('data'),
...     custom_loaders_registry=my_loaders
... )

napistu.ingestion.harmonizome.process_harmonizome_datasets(dataset_short_names: List[str], output_dir: Path, datasets_dict: Dict[str, HarmonizomeDataset | Dict] = {'achilles': {'description': 'CRISPR knockout essentiality - called (-1, 1) fitnesses for individual cell lines.', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Achilles Cell Line Gene Essentiality Profiles', 'short_name': 'achilles'}, 'clinvar25': {'description': 'SNP-phenotype associations curated by ClinVar users from various sources', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'ClinVar', 'short_name': 'clinvar25'}, 'dbgap': {'description': 'Database of gene-trait associations curated from genetic association studies', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Database of Genotypes and Phenotypes', 'short_name': 'dbgap'}, 'drugbank': {'description': 'Protein-drug associations by manual literature curation', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'DrugBank', 'short_name': 'drugbank'}, 'gad': {'description': 'Gene-disease associations curated from genetic association studies', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Genetic Association Database (GAD)', 'short_name': 'gad'}, 'perturbatlas': {'description': 'Gene expression profiles for cell lines, cell types, tissues, and models following genetic perturbation ', 'download_files': ['interactions', 'genes'], 'name': 'PerturbAtlas', 'short_name': 'perturbatlas'}, 'perturbatlasmouse': {'description': 'Gene expression profiles for mouse cell lines, cell types, tissues, and models following genetic perturbation', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'PerturbAtlas Mouse', 'short_name': 'perturbatlasmouse'}, 'reploglek562essential': {'description': 'K562 CRISPRi essential genes', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: K562 Essential Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglek562essential'}, 'reploglek562genomewide': {'description': 'K562 CRISPRi gene perturbation signatures', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: K562 Genome-wide Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglek562genomewide'}, 'reploglerpe1essential': {'description': 'RPE1 CRISPRi essential genes', 'download_files': ['interactions', 'genes', 'attributes'], 'name': 'Replogle et al., Cell 2022: RPE1 Essential Perturb-seq Gene Perturbation Signatures', 'short_name': 'reploglerpe1essential'}}, overwrite: bool = False, custom_loaders_registry: Dict[str, Dict[str, Callable]] | None = None) → Dict[str, Dict]

Download and process multiple datasets with statistics.

This is the main function for downloading Harmonizome datasets.

Parameters:

dataset_short_names (List[str]) – List of dataset short names from datasets_dict to process
output_dir (Path) – Directory to save downloaded files
datasets_dict (Optional[Dict]) – Dictionary of available datasets. Defaults to HARMONIZOME_DATASETS if None.
overwrite (bool, optional) – Whether to overwrite existing files
custom_loaders_registry (Optional[Dict[str, Dict[str, Callable]]], optional) – Registry of custom loader functions. If None, uses CUSTOM_LOADERS. Maps dataset short_name -> file_type -> loader function.

Returns:

Nested dictionary with files and stats for each dataset: {

’dataset_short_name’: {
‘files’: {file_type: filepath}, ‘stats’: {

’n_edges’: int, ‘n_genes’: int, ‘n_attributes’: int, ‘dataframe’: pd.DataFrame

}

}

}

Return type:

Dict[str, Dict]

Examples

>>> results = process_harmonizome_datasets(['reploglek562essential'], Path('data'))
>>> results['reploglek562essential']['stats']['n_edges']
12345
>>> results['reploglek562essential']['files']['interactions']
PosixPath('data/reploglek562essential/interactions.tsv')

>>> # Provide custom loaders
>>> my_loaders = {
...     'achilles': {'interactions': load_achilles_interactions}
... }
>>> results = process_harmonizome_datasets(
...     ['achilles'],
...     Path('data'),
...     custom_loaders_registry=my_loaders
... )