napistu.source
The Source class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.
Classes
- Source
A class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.
Functions
|
Create Source Table |
|
Merge Sources |
|
Greedy Set Coverage of Sources |
|
Unnest Sources - Optimized Version |
Classes
|
An Entity's Source |
- class napistu.source.Source(source_df: DataFrame, pw_index: PWIndex | None = None)
Bases:
objectAn Entity’s Source
- source
A dataframe containing the model source and other optional variables
- Type:
pd.DataFrame
- empty() : classmethod
Create an empty Source object
- single_entry(model, pathway_id, \*\*kwargs) : classmethod
Create a Source object with a single entry
- validate_single_source() : bool
Check whether the Source object contains exactly 1 entry
- _validate_source_df(source_df) : None
Validate that source_df is a pandas DataFrame with required columns
- _validate_pathway_index(pw_index, source_df) : None
Validate pathway index and check for missing pathways
- _process_source_df(source_df, pw_index) : pd.DataFrame
Process source DataFrame by merging with pathway index if provided
- _validate_final_df(df, required_columns) : None
Validate that the final DataFrame has all required columns
- classmethod empty() Source
Create an empty Source object.
This is typically used when creating an SBML_dfs object from a single source.
- Returns:
An empty Source instance with source attribute set to None
- Return type:
- classmethod single_entry(model: str, pathway_id: str | None = None, file: str | None = None, data_source: str | None = None, organismal_species: str | None = None, name: str | None = None, date: str | None = None) Source
Create a Source object with a single entry.
Convenience method for creating a Source with one row containing the core attributes from the pathway index schema.
- Parameters:
model (str) – The model identifier (required)
pathway_id (str, optional) – The pathway identifier. Defaults to same as model if not provided
file (str, optional) – Source file path or identifier
data_source (str, optional) – Source database or origin (e.g., ‘Reactome’, ‘KEGG’)
organismal_species (str, optional) – Species the pathway is from
name (str, optional) – Human-readable pathway/model name
date (str, optional) – Date of pathway/model creation or last update
- Returns:
A Source instance with a single-row DataFrame
- Return type:
Examples
>>> source = Source.single_entry( ... model="R-HSA-123", ... name="Glycolysis", ... source="Reactome", ... organismal_species="Homo sapiens" ... )
- __init__(source_df: DataFrame, pw_index: PWIndex | None = None) None
Tracks the model(s) an entity (i.e., a compartment, species, reaction) came from.
By convention sources exist only for the models that an entity came from rather than the current model they are part of. For example, when combining Reactome models into a consensus, a molecule which existed in multiple models would have a source entry for each, but it would not have a source entry for the consensus model itself.
- Parameters:
source_df (pd.DataFrame) – A dataframe containing the model source and other optional variables
pw_index (PWIndex, optional) – A pathway index object containing the pathway_id and other metadata
- Return type:
None.
- Raises:
ValueError: – If pw_index is not a PWIndex
ValueError: – If required columns are not present in source_df
TypeError: – If source_df is not a pd.DataFrame
- _process_source_df(source_df: DataFrame, pw_index: PWIndex | None) DataFrame
Process source DataFrame by merging with pathway index if provided.
- _validate_final_df(df: DataFrame, required_columns: list[str]) None
Validate that the final DataFrame has all required columns.
- _validate_pathway_index(pw_index: PWIndex, source_df: DataFrame) None
Validate pathway index and check for missing pathways.
- _validate_source_df(source_df: DataFrame) None
Validate that source_df is a pandas DataFrame with required columns.
- validate_single_source() bool
Check whether the Source object contains exactly 1 entry.
- Returns:
True if the Source contains exactly one row, False otherwise
- Return type:
bool
- Raises:
ValueError: – If the Source object is empty or contains more than one row
- napistu.source._collapse_by_membership_string(membership_string: str, membership_categories: DataFrame, table_schema: dict) DataFrame
Assign each member of a membership-string to a set of pathways.
- napistu.source._collapse_source_df(source_df: DataFrame | Series) Series
Collapse a source_df table into a single entry.
Combines multiple source entries into a single entry by joining values with “ OR “ separators. Handles None values by filtering them out before joining.
- Parameters:
source_df (pd.DataFrame or pd.Series) – Source data to collapse. Must contain required columns MODEL and PATHWAY_ID.
- Returns:
Collapsed source entry with joined values and count of collapsed pathways.
- Return type:
pd.Series
- Raises:
TypeError – If source_df is not a DataFrame or Series.
ValueError – If required columns MODEL or PATHWAY_ID are missing.
Notes
None values are filtered out before joining
For DataFrame input, unique values are used for DATA_SOURCE and ORGANISMAL_SPECIES
The N_COLLAPSED_PATHWAYS field tracks how many entries were collapsed
- napistu.source._deduplicate_source_df(source_df: DataFrame) DataFrame
Combine entries in a source table when multiple models have the same members.
- napistu.source._ensure_source_total_counts(source_total_counts: Series | DataFrame | None, verbose: bool = False) Series | None
- napistu.source._safe_source_merge(member_Sources: Source | list) Source
Combine either a Source or pd.Series of Sources into a single Source object.
- napistu.source._select_top_pathway_by_enrichment(unaccounted_for_members: DataFrame, source_total_counts: Series, n_total_entities: int, table_pk: str, min_pw_size: int = 3) str
- napistu.source._select_top_pathway_by_size(unaccounted_for_members: DataFrame, min_pw_size: int = 3) str
- napistu.source._update_unaccounted_for_members(top_pathway, unaccounted_for_members) DataFrame
Update the unaccounted for members dataframe by removing the members associated with the top pathway.
- Parameters:
top_pathway (str) – the pathway to remove from the unaccounted for members
unaccounted_for_members (pd.DataFrame) – the dataframe of unaccounted for members
- Returns:
unaccounted_for_members – the dataframe of unaccounted for members with the top pathway removed
- Return type:
pd.DataFrame
- napistu.source.create_source_table(lookup_table: Series, table_schema: dict, pw_index: PWIndex | None) DataFrame
Create Source Table
Create a table with one row per “new_id” and a Source object created from the union of “old_id” Source objects
- Parameters:
lookup_table (pd.Series) – a pd.Series containing the index of the table to create a source table for
table_schema (dict) – a dictionary containing the schema of the table to create a source table for
pw_index (PWIndex) – a pathway index object containing the pathway_id and other metadata
- Returns:
source_table – a pd.DataFrame containing the index of the table to create a source table for with one row per “new_id” and a Source object created from the union of “old_id” Source objects
- Return type:
pd.DataFrame
- Raises:
ValueError: – if SOURCE_SPEC.DATA_SOURCE is not present in table_schema
- napistu.source.merge_sources(source_list: list | Series) Source
Merge Sources
Merge a list of Source objects into a single Source object
- Parameters:
source_list (list | pd.Series) – a list of Source objects or a pd.Series of Source objects
- Returns:
source – a Source object created from the union of the Source objects in source_list
- Return type:
- Raises:
TypeError: – if source_list is not a list or pd.Series
- napistu.source.source_set_coverage(select_sources_df: pd.DataFrame, source_total_counts: pd.Series | pd.DataFrame | None = None, sbml_dfs: SBML_dfs | None = None, min_pw_size: int = 3, verbose: bool = False) pd.DataFrame
Greedy Set Coverage of Sources
Find the set of pathways covering select_sources_df. If all_sources_df is provided pathways will be selected iteratively based on statistical enrichment. If all_sources_df is not provided, the largest pathways will be chosen iteratively.
- Parameters:
select_sources_df (pd.DataFrame) – pd.Dataframe containing the index of source_table but expanded to include one row per source. As produced by source.unnest_sources()
source_total_counts (pd.Series | pd.DataFrame) – pd.Series containing the total counts of each source. As produced by source.get_source_total_counts() or a pd.DataFrame with two columns: pathway_id and total_counts.
sbml_dfs (SBML_dfs) – if source_total_counts is provided then sbml_dfs must be provided to calculate the total number of entities in the table.
min_pw_size (int) – the minimum size of a pathway to be considered
verbose (bool) – Whether to print verbose output
- Returns:
minimial_sources – A list of pathway_ids of the minimal source set
- Return type:
[str]
- napistu.source.unnest_sources(source_table: DataFrame) DataFrame
Unnest Sources - Optimized Version
Take a pd.DataFrame containing an array of Sources and return one-row per source.
- Parameters:
source_table (pd.DataFrame) – a table containing an array of Sources
- Returns:
pd.Dataframe containing the index of source_table but expanded
to include one row per source