napistu.ingestion.intact
Functions
|
Download IntAct Species |
|
Convert IntAct summaries to SBML_dfs |
- napistu.ingestion.intact._build_psi_mi_ontology_graph(ontology_url: str = 'https://raw.githubusercontent.com/MICommunity/miscore/refs/heads/master/miscore/src/main/resources/psimiOntology.json') Graph
Parse MI ontology JSON from URL and build igraph directed graph.
- napistu.ingestion.intact._calculate_all_scores_vectorized(counts_df: DataFrame, n_publications_series: Series, max_pubs: int = 7, weights: Dict[str, float] = {'interaction_method_score': 1.0, 'interaction_type_score': 1.0, 'publication_score': 1.0}) DataFrame
Calculate all MIscore components using vectorized operations.
Note: This implementation follows the MIscore mathematical formulas from Villaveces et al. (2015), but has not been validated against published IntAct scores due to lack of detailed worked examples showing: - Specific interaction evidence (X studies of type Y using method Z) - The resulting component scores - The final MIscore
- Parameters:
counts_df (pd.DataFrame) – DataFrame containing interaction counts and scores
n_publications_series (pd.Series) – Series containing publication counts for each interaction
max_pubs (int, optional) – Maximum publication threshold for scoring, by default INTACT_PUBLICATION_SCORE_THRESHOLD
weights (dict[str, float], optional) – Dictionary of weights for different score components, by default DEFAULT_INTACT_RELATIVE_WEIGHTS
- Returns:
DataFrame containing all calculated scores for each interaction
- Return type:
pd.DataFrame
- Raises:
ValueError – If the weights dictionary does not contain the expected keys
- napistu.ingestion.intact._calculate_category_scores_vectorized(counts_df: DataFrame, attribute_type: str) DataFrame
Calculate method or type scores for all interactions using vectorized operations.
- Parameters:
counts_df (pd.DataFrame) – DataFrame containing interaction data
attribute_type (str) – Type of attribute to calculate scores for (e.g., ‘interaction_method’, ‘interaction_type’)
- Returns:
DataFrame containing category scores for each interaction
- Return type:
pd.DataFrame
- napistu.ingestion.intact._count_studies_with_scored_attributes(standardized_interaction_attrs: DataFrame) DataFrame
Count the number of studies which report an interaction based on scored attributes.
- Parameters:
standardized_interaction_attrs (pd.DataFrame) – Long-form dataframe with columns: upstream_name, downstream_name, study_id, attribute_type, scored_term, score
- Returns:
scored_attribute_counts – The number of studies and score for each interaction-attribute_type-scored_term combination.
- Return type:
pd.DataFrame
- napistu.ingestion.intact._create_basic_edgelist(intact_summaries: Dict[str, DataFrame], lookup_table: Series) DataFrame
Create a basic edgelist from the IntAct summaries and lookup table.
The edgelist is created by merging the IntAct summaries and lookup table on the study id and interaction id. The edgelist is then filtered to only include interactions where the bait is present and the prey is present. The edgelist is then pivoted based on the hub and spoke model used by IntAct, where a single bait is connected to one or more prey.
- Parameters:
intact_summaries (Dict[str, pd.DataFrame]) – A dictionary of IntAct summaries, keyed by study id.
lookup_table (pd.Series) – A lookup table of interaction ids and their corresponding interaction names.
- Returns:
edgelist_df – A dataframe of the edgelist with the following columns: - upstream_name : the name of the upstream node - downstream_name : the name of the downstream node - interaction_name : the name of the interaction - study_id : the id of the study - interaction_type : the type of interaction
- Return type:
pd.DataFrame
Notes
The convention of associating each bait to many prey follows conventions set by yeast 2 hybrid screens but it is applied across the board even for technologies when bait-prey relationships are not appropriate (e.g., purifying a whole complex). In these cases IntAct chooses a random component to serve as the prey. This will make it more closely related in a network sense than its interactors than they would be to one another. This could be addressed by expanding interactions but this would be quite tricky because some interactions have 100s of prey and (N choose 2) would be cumbersome. This could be done for just certain types of annotations but it seems like a big headache for very little practical gain.
- napistu.ingestion.intact._create_r_identifiers(group_data: DataFrame) Identifiers
Create a list of identifiers for an experiment’s interactions.
- Parameters:
group_data (pd.DataFrame) – A dataframe with the study metadata.
- Returns:
An Identifiers object containing the interaction identifiers
- Return type:
- napistu.ingestion.intact._create_species_df(raw_species_df: DataFrame, raw_species_identifiers_df: DataFrame, organismal_species: str | OrganismalSpeciesValidator) Tuple[Series, DataFrame]
Create a species dataframe from the raw species dataframe and the raw species identifiers dataframe.
- Parameters:
raw_species_df (pd.DataFrame) – The raw species dataframe.
raw_species_identifiers_df (pd.DataFrame) – The raw species identifiers dataframe.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions
- Returns:
lookup_table (pd.Series) – A lookup table mapping study_id and interactor_id to the molecular species name.
species_df (pd.DataFrame) – The molecular species dataframe.
- Raises:
ValueError – If the provided species is not supported by IntAct
- napistu.ingestion.intact._define_edgelist_df_ids_and_counts(edgelist_w_study_metadata: DataFrame, alias_mapping: Dict[str, str], scored_attribute_counts: DataFrame) DataFrame
Add attributes to the edgelist.
- Parameters:
edgelist_w_study_metadata (pd.DataFrame) – The edgelist with study metadata.
alias_mapping (dict[str, str]) – A dictionary mapping from ontology aliases to the Napistu controlled vocabulary.
scored_attribute_counts (pd.DataFrame) – A dataframe with the number of studies with each attribute.
- Returns:
A dataframe with the edgelist, citation counts, and identifiers.
- Return type:
pd.DataFrame
- napistu.ingestion.intact._filter_intact_xrefs(intact_summaries: Dict[str, DataFrame], alias_mapping: Dict[str, str], organismal_species: str | OrganismalSpeciesValidator, valid_secondary_ontologies: Set[str] = {'intact'}) DataFrame
Filter IntAct species identifiers to only those which should be added as s_Identifiers.
- Parameters:
intact_summaries (Dict[str, pd.DataFrame]) – The IntAct summaries table.
alias_mapping (Dict[str, str]) – A dictionary mapping from ontology aliases to the Napistu controlled vocabulary.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions
valid_secondary_ontologies (Set[str], optional) – A set of ontologies which are valid secondary references, by default VALID_INTACT_SECONDARY_ONTOLOGIES
- Returns:
A DataFrame of IntAct species identifiers which should be added as s_Identifiers.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ontologies listed as valid secondary references are not in the Napistu controlled vocabulary
- napistu.ingestion.intact._get_intact_scored_term(term_name: str, term_lookup: Dict[str, str]) str
Get the scored ancestor term name for a given term.
- napistu.ingestion.intact._get_intact_species_basename(latin_species: str) str
- napistu.ingestion.intact._get_intact_term_with_score(ontology_graph: Graph, scored_terms: Dict[str, float]) Dict[str, str]
Build lookup mapping all terms to their scored ancestor names.
- napistu.ingestion.intact._log_invalid_primary_refs(primary_refs: DataFrame, invalid_ontologies: Set[str]) None
- napistu.ingestion.intact._process_ensembl_ids(ensembl_primary_xrefs, latin_species: str) DataFrame
Process ensembl IntAct references to convert them from the meta ensembl ontology to the Napistu controlled vocabulary.
- Parameters:
ensembl_primary_xrefs (pd.DataFrame) – The standardized IntAct references filtered to the “ensembl” ontology.
latin_species (str) – The latin species name to filter ensembl ids to.
- Returns:
The processed ensembl IntAct references.
- Return type:
pd.DataFrame
- napistu.ingestion.intact._sanitize_identifiers(standardized_intact_xrefs, organismal_species: str | OrganismalSpeciesValidator, ensembl_ontology_name: str = 'ensembl', chebi_ontology_name: str = 'chebi', rna_central_ontology_name: str = 'rnacentral') DataFrame
Sanitizes the identifiers in the standardized IntAct references.
This functions applies ontology-specific manipulations and white lists non-genic molecular species like metabolites so they aren’t filtered downstream.
- Parameters:
standardized_intact_xrefs (pd.DataFrame) – The standardized IntAct references.
organismal_species (str) – The organismal species to filter ensembl ids to.
ensembl_ontology_name (str) – The name of the ontology to convert from.
chebi_ontology_name (str) – The name of the ontology to convert from.
rna_central_ontology_name (str) – The name of the ontology to convert from.
- Returns:
The sanitized IntAct references.
- Return type:
pd.DataFrame
- napistu.ingestion.intact._standardize_interaction_attrs(edgelist_w_study_metadata: DataFrame, ontology_url: str = 'https://raw.githubusercontent.com/MICommunity/miscore/refs/heads/master/miscore/src/main/resources/psimiOntology.json') DataFrame
- napistu.ingestion.intact.download_intact_xmls(output_dir_path: str, organismal_species: str | OrganismalSpeciesValidator, overwrite: bool = False) None
Download IntAct Species
Download the PSM-30 XML files from IntAct for a species of interest.
- Parameters:
output_dir_path (str) – Local directory to create an unzip files into
latin_species (str) – The species name (e.g., “Homo sapiens”) to work with
overwrite (bool, optional) – Overwrite an existing output directory, by default False
- Returns:
Files are downloaded and extracted to the specified directory
- Return type:
None
- napistu.ingestion.intact.intact_to_sbml_dfs(intact_summaries: dict[str, DataFrame], organismal_species: str | OrganismalSpeciesValidator) SBML_dfs
Convert IntAct summaries to SBML_dfs
- Parameters:
intact_summaries (dict[str, pd.DataFrame]) – A dictionary of IntAct summaries.
organismal_species (str | OrganismalSpeciesValidator) – The organismal species pertaining to the IntAct interactions
- Returns:
sbml_dfs – SBML_dfs object containing the converted IntAct data
- Return type:
- Raises:
ValueError – If intact_summaries does not contain the required tables
ValueError – If the provided species is not supported by IntAct
ValueError – If ontologies listed as valid secondary references are not in the Napistu controlled vocabulary