napistu.ingestion.intact

Functions

`download_intact_xmls`(output_dir_path, ...[, ...])	Download IntAct Species
`intact_to_sbml_dfs`(intact_summaries, ...)	Convert IntAct summaries to SBML_dfs

napistu.ingestion.intact._build_psi_mi_ontology_graph(ontology_url: str = 'https://raw.githubusercontent.com/MICommunity/miscore/refs/heads/master/miscore/src/main/resources/psimiOntology.json') → Graph: Parse MI ontology JSON from URL and build igraph directed graph.

napistu.ingestion.intact._calculate_all_scores_vectorized(counts_df: DataFrame, n_publications_series: Series, max_pubs: int = 7, weights: Dict[str, float] = {'interaction_method_score': 1.0, 'interaction_type_score': 1.0, 'publication_score': 1.0}) → DataFrame

Calculate all MIscore components using vectorized operations.

Note: This implementation follows the MIscore mathematical formulas from Villaveces et al. (2015), but has not been validated against published IntAct scores due to lack of detailed worked examples showing: - Specific interaction evidence (X studies of type Y using method Z) - The resulting component scores - The final MIscore

Parameters:

counts_df (pd.DataFrame) – DataFrame containing interaction counts and scores
n_publications_series (pd.Series) – Series containing publication counts for each interaction
max_pubs (int, optional) – Maximum publication threshold for scoring, by default INTACT_PUBLICATION_SCORE_THRESHOLD
weights (dict[str, float], optional) – Dictionary of weights for different score components, by default DEFAULT_INTACT_RELATIVE_WEIGHTS

Returns:

DataFrame containing all calculated scores for each interaction

Return type:

pd.DataFrame

Raises:

ValueError – If the weights dictionary does not contain the expected keys

napistu.ingestion.intact._calculate_category_scores_vectorized(counts_df: DataFrame, attribute_type: str) → DataFrame

Calculate method or type scores for all interactions using vectorized operations.

Parameters:

counts_df (pd.DataFrame) – DataFrame containing interaction data
attribute_type (str) – Type of attribute to calculate scores for (e.g., ‘interaction_method’, ‘interaction_type’)

Returns:

DataFrame containing category scores for each interaction

Return type:

pd.DataFrame

napistu.ingestion.intact._count_studies_with_scored_attributes(standardized_interaction_attrs: DataFrame) → DataFrame

Count the number of studies which report an interaction based on scored attributes.

Parameters:: standardized_interaction_attrs (pd.DataFrame) – Long-form dataframe with columns: upstream_name, downstream_name, study_id, attribute_type, scored_term, score
Returns:: scored_attribute_counts – The number of studies and score for each interaction-attribute_type-scored_term combination.
Return type:: pd.DataFrame

napistu.ingestion.intact._create_basic_edgelist(intact_summaries: Dict[str, DataFrame], lookup_table: Series) → DataFrame

Create a basic edgelist from the IntAct summaries and lookup table.

The edgelist is created by merging the IntAct summaries and lookup table on the study id and interaction id. The edgelist is then filtered to only include interactions where the bait is present and the prey is present. The edgelist is then pivoted based on the hub and spoke model used by IntAct, where a single bait is connected to one or more prey.

Parameters:

intact_summaries (Dict[str, pd.DataFrame]) – A dictionary of IntAct summaries, keyed by study id.
lookup_table (pd.Series) – A lookup table of interaction ids and their corresponding interaction names.

Returns:

edgelist_df – A dataframe of the edgelist with the following columns: - upstream_name : the name of the upstream node - downstream_name : the name of the downstream node - interaction_name : the name of the interaction - study_id : the id of the study - interaction_type : the type of interaction

Return type:

pd.DataFrame

Notes

The convention of associating each bait to many prey follows conventions set by yeast 2 hybrid screens but it is applied across the board even for technologies when bait-prey relationships are not appropriate (e.g., purifying a whole complex). In these cases IntAct chooses a random component to serve as the prey. This will make it more closely related in a network sense than its interactors than they would be to one another. This could be addressed by expanding interactions but this would be quite tricky because some interactions have 100s of prey and (N choose 2) would be cumbersome. This could be done for just certain types of annotations but it seems like a big headache for very little practical gain.

napistu.ingestion.intact._create_r_identifiers(group_data: DataFrame) → Identifiers

Create a list of identifiers for an experiment’s interactions.

Parameters:: group_data (pd.DataFrame) – A dataframe with the study metadata.
Returns:: An Identifiers object containing the interaction identifiers
Return type:: Identifiers

napistu.ingestion.intact._create_species_df(raw_species_df: DataFrame, raw_species_identifiers_df: DataFrame, organismal_species: str | OrganismalSpeciesValidator) → Tuple[Series, DataFrame]

Create a species dataframe from the raw species dataframe and the raw species identifiers dataframe.

Parameters:

raw_species_df (pd.DataFrame) – The raw species dataframe.
raw_species_identifiers_df (pd.DataFrame) – The raw species identifiers dataframe.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions

Returns:

lookup_table (pd.Series) – A lookup table mapping study_id and interactor_id to the molecular species name.
species_df (pd.DataFrame) – The molecular species dataframe.

Raises:

ValueError – If the provided species is not supported by IntAct

napistu.ingestion.intact._define_edgelist_df_ids_and_counts(edgelist_w_study_metadata: DataFrame, alias_mapping: Dict[str, str], scored_attribute_counts: DataFrame) → DataFrame

Add attributes to the edgelist.

Parameters:

edgelist_w_study_metadata (pd.DataFrame) – The edgelist with study metadata.
alias_mapping (dict[str, str]) – A dictionary mapping from ontology aliases to the Napistu controlled vocabulary.
scored_attribute_counts (pd.DataFrame) – A dataframe with the number of studies with each attribute.

Returns:

A dataframe with the edgelist, citation counts, and identifiers.

Return type:

pd.DataFrame

napistu.ingestion.intact._filter_intact_xrefs(intact_summaries: Dict[str, DataFrame], alias_mapping: Dict[str, str], organismal_species: str | OrganismalSpeciesValidator, valid_secondary_ontologies: Set[str] = {'intact'}) → DataFrame

Filter IntAct species identifiers to only those which should be added as s_Identifiers.

Parameters:

intact_summaries (Dict[str, pd.DataFrame]) – The IntAct summaries table.
alias_mapping (Dict[str, str]) – A dictionary mapping from ontology aliases to the Napistu controlled vocabulary.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions
valid_secondary_ontologies (Set[str], optional) – A set of ontologies which are valid secondary references, by default VALID_INTACT_SECONDARY_ONTOLOGIES

Returns:

A DataFrame of IntAct species identifiers which should be added as s_Identifiers.

Return type:

pd.DataFrame

Raises:

ValueError – If ontologies listed as valid secondary references are not in the Napistu controlled vocabulary

napistu.ingestion.intact._get_intact_scored_term(term_name: str, term_lookup: Dict[str, str]) → str: Get the scored ancestor term name for a given term.

napistu.ingestion.intact._get_intact_species_basename(latin_species: str) → str

napistu.ingestion.intact._get_intact_term_with_score(ontology_graph: Graph, scored_terms: Dict[str, float]) → Dict[str, str]: Build lookup mapping all terms to their scored ancestor names.

napistu.ingestion.intact._log_invalid_primary_refs(primary_refs: DataFrame, invalid_ontologies: Set[str]) → None

napistu.ingestion.intact._process_ensembl_ids(ensembl_primary_xrefs, latin_species: str) → DataFrame

Process ensembl IntAct references to convert them from the meta ensembl ontology to the Napistu controlled vocabulary.

Parameters:

ensembl_primary_xrefs (pd.DataFrame) – The standardized IntAct references filtered to the “ensembl” ontology.
latin_species (str) – The latin species name to filter ensembl ids to.

Returns:

The processed ensembl IntAct references.

Return type:

pd.DataFrame

napistu.ingestion.intact._sanitize_identifiers(standardized_intact_xrefs, organismal_species: str | OrganismalSpeciesValidator, ensembl_ontology_name: str = 'ensembl', chebi_ontology_name: str = 'chebi', rna_central_ontology_name: str = 'rnacentral') → DataFrame

Sanitizes the identifiers in the standardized IntAct references.

This functions applies ontology-specific manipulations and white lists non-genic molecular species like metabolites so they aren’t filtered downstream.

Parameters:

standardized_intact_xrefs (pd.DataFrame) – The standardized IntAct references.
organismal_species (str) – The organismal species to filter ensembl ids to.
ensembl_ontology_name (str) – The name of the ontology to convert from.
chebi_ontology_name (str) – The name of the ontology to convert from.
rna_central_ontology_name (str) – The name of the ontology to convert from.

Returns:

The sanitized IntAct references.

Return type:

pd.DataFrame

napistu.ingestion.intact._standardize_interaction_attrs(edgelist_w_study_metadata: DataFrame, ontology_url: str = 'https://raw.githubusercontent.com/MICommunity/miscore/refs/heads/master/miscore/src/main/resources/psimiOntology.json') → DataFrame

napistu.ingestion.intact.download_intact_xmls(output_dir_path: str, organismal_species: str | OrganismalSpeciesValidator, overwrite: bool = False) → None

Download IntAct Species

Download the PSM-30 XML files from IntAct for a species of interest.

Parameters:

output_dir_path (str) – Local directory to create an unzip files into
latin_species (str) – The species name (e.g., “Homo sapiens”) to work with
overwrite (bool, optional) – Overwrite an existing output directory, by default False

Returns:

Files are downloaded and extracted to the specified directory

Return type:

None

napistu.ingestion.intact.intact_to_sbml_dfs(intact_summaries: dict[str, DataFrame], organismal_species: str | OrganismalSpeciesValidator) → SBML_dfs

Convert IntAct summaries to SBML_dfs

Parameters:

intact_summaries (dict[str, pd.DataFrame]) – A dictionary of IntAct summaries.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions

Returns:

sbml_dfs – SBML_dfs object containing the converted IntAct data

Return type:

SBML_dfs

Raises:

ValueError – If intact_summaries does not contain the required tables
ValueError – If the provided species is not supported by IntAct
ValueError – If ontologies listed as valid secondary references are not in the Napistu controlled vocabulary