napistu.ingestion.perturbseq

Ingestion and formatting utilities for Perturb-seq datasets.

Functions

assign_predicted_direction(df[, ...])

Assign predicted direction categories based on perturbation type and fold-change.

ingest_replogle_pvalues(target_uri)

Ingest Replogle et al. Perturb-seq p-values.

load_harmonizome_perturbseq_datasets(...[, ...])

Load aggregated perturbseq data with species IDs.

load_replogle_pvalues_with_species_ids(...)

Load Replogle et al. Perturb-seq p-values with species IDs.

napistu.ingestion.perturbseq._categorize_perturbseq_row(row: Series, perturbation_type_col: str, standardized_value_col: str, threshold_value_col: str) str

Categorize a row of perturbseq data into a direction category.

Parameters:
  • row (pd.Series) – Row of perturbseq data

  • perturbation_type_col (str) – Column name for perturbation type

  • standardized_value_col (str) – Column name for standardized value

  • threshold_value_col (str) – Column name for threshold value

Returns:

Series with direction category

Return type:

pd.Series

Examples

df = pd.Series({

‘perturbation_type’: ‘OE’, ‘standardized_value’: 1.0, ‘threshold_value’: 0.5

}) _categorize_perturbseq_row(df, ‘perturbation_type’, ‘standardized_value’, ‘threshold_value’)

napistu.ingestion.perturbseq._format_harmonizome_replogle_with_species_ids(harmonizome_replogle_interactions: DataFrame, species_identifiers: DataFrame) DataFrame

Format Replogle interactions from Harmonizome with species IDs.

Parameters:
  • harmonizome_replogle_interactions (pd.DataFrame) – Harmonizome’s Replogle interactions dataframe.

  • species_identifiers (pd.DataFrame) – Species identifiers dataframe.

Returns:

Replogle interactions with species IDs.

Return type:

pd.DataFrame

Examples

datasets = [HARMONIZOME_DATASET_SHORTNAMES.REPLOGLE_K562_ESSENTIAL] _ = process_harmonizome_datasets(datasets, “/tmp/harmonizome_data”) perturbseq_data = load_harmonizome_datasets(datasets, “/tmp/harmonizome_data”) harmonizome_replogle_interactions_with_species_ids = format_harmonizome_replogle_with_species_ids(perturbseq_data[datasets[0]][“interactions”], species_identifiers)

napistu.ingestion.perturbseq._format_perturbatlas_with_species_ids(perturbatlas_interactions: DataFrame, species_identifiers: DataFrame) DataFrame

Format PerturbAtlas interactions with species IDs.

Parameters:
  • perturbatlas_interactions (pd.DataFrame) – PerturbAtlas interactions dataframe.

  • species_identifiers (pd.DataFrame) – Species identifiers dataframe.

Returns:

PerturbAtlas interactions with species IDs.

Return type:

pd.DataFrame

Examples

datasets = [HARMONIZOME_DATASET_SHORTNAMES.PERTURB_ATLAS_MOUSE] _ = process_harmonizome_datasets(datasets, “/tmp/harmonizome_data”) perturbseq_data = load_harmonizome_datasets(datasets, “/tmp/harmonizome_data”) perturbatlas_interactions_with_species_ids = format_perturbatlas_with_species_ids(perturbseq_data[datasets[0]][“interactions”], species_identifiers)

napistu.ingestion.perturbseq._get_distinct_harmonizome_perturbseq_interactions(aggregated_perturbseq_data_with_species_ids: DataFrame) DataFrame

Reduce the harmonizome perturbseq data to a single entry per study-type-perturbed-target pair.

napistu.ingestion.perturbseq._get_distinct_replogle_pvalues(replogle_pvalues_with_species_ids: DataFrame) DataFrame

Reduce the Replogle reported significance to a single entry per perturbed-target pair.

napistu.ingestion.perturbseq.assign_predicted_direction(df, perturbation_type_col='perturbation_type', standardized_value_col='Standardized Value', threshold_value_col='Threshold Value')

Assign predicted direction categories based on perturbation type and fold-change.

For OE (overexpression):
  • standardized_value > threshold: strong activation

  • 0 < standardized_value <= threshold: weak activation

  • -threshold <= standardized_value < 0: weak repression

  • standardized_value < -threshold: strong repression

For KD/KO (knockdown/knockout) - directions are flipped:
  • standardized_value > threshold: strong repression

  • 0 < standardized_value <= threshold: weak repression

  • -threshold <= standardized_value < 0: weak activation

  • standardized_value < -threshold: strong activation

Parameters:
  • df (pd.DataFrame) – DataFrame with perturbation data

  • perturbation_type_col (str) – Column name for perturbation type (should contain ‘KD’, ‘KO’, or ‘OE’)

  • standardized_value_col (str) – Column name for standardized fold-change values

  • threshold_value_col (str) – Column name for threshold values (absolute value)

Returns:

Series with predicted direction categories

Return type:

pd.Series

napistu.ingestion.perturbseq.ingest_replogle_pvalues(target_uri: str) None

Ingest Replogle et al. Perturb-seq p-values.

Parameters:

target_uri (str) – Target URI to download the Replogle et al. Perturb-seq p-values to.

Return type:

None

napistu.ingestion.perturbseq.load_harmonizome_perturbseq_datasets(harmonizome_data_dir: str, species_identifiers: DataFrame, datasets_w_formatters: Dict[str, Callable] | None = None, return_distinct_interactions: bool = False) DataFrame

Load aggregated perturbseq data with species IDs.

Parameters:
  • harmonizome_data_dir (str) – Directory containing harmonizome data.

  • species_identifiers (pd.DataFrame) – Species identifiers dataframe.

  • datasets_w_formatters (Optional[Dict[str, Callable]]) – Dictionary mapping dataset shortnames to formatters. By default, uses the human perturbseq datasets to formatters.

  • return_distinct_interactions (bool) – Whether to return distinct interactions. Default is False.

Returns:

Aggregated perturbseq data with the following columns: - perturbed_species_id: the species id of the perturbed gene - target_species_id: the species id of the target gene - perturbation_type: the type of perturbation (for perturbatlas, e.g., KO for knockout) - perturbation_study: the study that reported the perturbation (for perturbatlas, a study code) - standardized_value: the standardized value of the perturbation - thresholded_value: the thresholded value of the perturbation - dataset_shortname: the shortname of the dataset

Return type:

pd.DataFrame

napistu.ingestion.perturbseq.load_replogle_pvalues_with_species_ids(path_to_wide_replogle_pvalues: str | Path, species_identifiers: DataFrame, return_distinct_interactions: bool = False) DataFrame

Load Replogle et al. Perturb-seq p-values with species IDs.

Parameters:
  • path_to_wide_replogle_pvalues (Union[str, Path]) – Path to the wide Replogle et al. Perturb-seq p-values file.

  • species_identifiers (pd.DataFrame) – Species identifiers dataframe.

  • return_distinct_interactions (bool) – Whether to return distinct interactions. Default is False.

Returns:

Replogle et al. Perturb-seq p-values with species IDs.

Return type:

pd.DataFrame