napistu.source

The Source class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.

Classes

Source: A class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.

Functions

`create_source_table`(lookup_table, ...)	Create Source Table
`merge_sources`(source_list)	Merge Sources
`source_set_coverage`(select_sources_df[, ...])	Greedy Set Coverage of Sources
`unnest_sources`(source_table)	Unnest Sources - Optimized Version

Classes

Source(source_df[, pw_index])

An Entity's Source

class napistu.source.Source(source_df: DataFrame, pw_index: PWIndex | None = None)

Bases: object

An Entity’s Source

source

A dataframe containing the model source and other optional variables

Type:: pd.DataFrame

empty() : classmethod: Create an empty Source object

single_entry(model, pathway_id, \*\*kwargs) : classmethod: Create a Source object with a single entry

validate_single_source() : bool: Check whether the Source object contains exactly 1 entry

_validate_source_df(source_df) : None: Validate that source_df is a pandas DataFrame with required columns

_validate_pathway_index(pw_index, source_df) : None: Validate pathway index and check for missing pathways

_process_source_df(source_df, pw_index) : pd.DataFrame: Process source DataFrame by merging with pathway index if provided

_validate_final_df(df, required_columns) : None: Validate that the final DataFrame has all required columns

classmethod empty() → Source

Create an empty Source object.

This is typically used when creating an SBML_dfs object from a single source.

Returns:: An empty Source instance with source attribute set to None
Return type:: Source

Create a Source object with a single entry.

Convenience method for creating a Source with one row containing the core attributes from the pathway index schema.

Parameters:

model (str) – The model identifier (required)
pathway_id (str, optional) – The pathway identifier. Defaults to same as model if not provided
file (str, optional) – Source file path or identifier
data_source (str, optional) – Source database or origin (e.g., ‘Reactome’, ‘KEGG’)
organismal_species (str, optional) – Species the pathway is from
name (str, optional) – Human-readable pathway/model name
date (str, optional) – Date of pathway/model creation or last update

Returns:

A Source instance with a single-row DataFrame

Return type:

Source

Examples

>>> source = Source.single_entry(
...     model="R-HSA-123",
...     name="Glycolysis",
...     source="Reactome",
...     organismal_species="Homo sapiens"
... )

__init__(source_df: DataFrame, pw_index: PWIndex | None = None) → None

Tracks the model(s) an entity (i.e., a compartment, species, reaction) came from.

By convention sources exist only for the models that an entity came from rather than the current model they are part of. For example, when combining Reactome models into a consensus, a molecule which existed in multiple models would have a source entry for each, but it would not have a source entry for the consensus model itself.

Parameters:

source_df (pd.DataFrame) – A dataframe containing the model source and other optional variables
pw_index (PWIndex, optional) – A pathway index object containing the pathway_id and other metadata

Return type:

None.

Raises:

ValueError: – If pw_index is not a PWIndex
ValueError: – If required columns are not present in source_df
TypeError: – If source_df is not a pd.DataFrame

_process_source_df(source_df: DataFrame, pw_index: PWIndex | None) → DataFrame: Process source DataFrame by merging with pathway index if provided.

_validate_final_df(df: DataFrame, required_columns: list[str]) → None: Validate that the final DataFrame has all required columns.

_validate_pathway_index(pw_index: PWIndex, source_df: DataFrame) → None: Validate pathway index and check for missing pathways.

_validate_source_df(source_df: DataFrame) → None: Validate that source_df is a pandas DataFrame with required columns.

validate_single_source() → bool

Check whether the Source object contains exactly 1 entry.

Returns:: True if the Source contains exactly one row, False otherwise
Return type:: bool
Raises:: ValueError: – If the Source object is empty or contains more than one row

napistu.source._collapse_by_membership_string(membership_string: str, membership_categories: DataFrame, table_schema: dict) → DataFrame: Assign each member of a membership-string to a set of pathways.

napistu.source._collapse_source_df(source_df: DataFrame | Series) → Series

Collapse a source_df table into a single entry.

Combines multiple source entries into a single entry by joining values with “ OR “ separators. Handles None values by filtering them out before joining.

Parameters:

source_df (pd.DataFrame or pd.Series) – Source data to collapse. Must contain required columns MODEL and PATHWAY_ID.

Returns:

Collapsed source entry with joined values and count of collapsed pathways.

Return type:

pd.Series

Raises:

TypeError – If source_df is not a DataFrame or Series.
ValueError – If required columns MODEL or PATHWAY_ID are missing.

Notes

None values are filtered out before joining
For DataFrame input, unique values are used for DATA_SOURCE and ORGANISMAL_SPECIES
The N_COLLAPSED_PATHWAYS field tracks how many entries were collapsed

napistu.source._deduplicate_source_df(source_df: DataFrame) → DataFrame: Combine entries in a source table when multiple models have the same members.

napistu.source._ensure_source_total_counts(source_total_counts: Series | DataFrame | None, verbose: bool = False) → Series | None

napistu.source._safe_source_merge(member_Sources: Source | list) → Source: Combine either a Source or pd.Series of Sources into a single Source object.

napistu.source._select_top_pathway_by_enrichment(unaccounted_for_members: DataFrame, source_total_counts: Series, n_total_entities: int, table_pk: str, min_pw_size: int = 3) → str

napistu.source._select_top_pathway_by_size(unaccounted_for_members: DataFrame, min_pw_size: int = 3) → str

napistu.source._update_unaccounted_for_members(top_pathway, unaccounted_for_members) → DataFrame

Update the unaccounted for members dataframe by removing the members associated with the top pathway.

Parameters:

top_pathway (str) – the pathway to remove from the unaccounted for members
unaccounted_for_members (pd.DataFrame) – the dataframe of unaccounted for members

Returns:

unaccounted_for_members – the dataframe of unaccounted for members with the top pathway removed

Return type:

pd.DataFrame

napistu.source.create_source_table(lookup_table: Series, table_schema: dict, pw_index: PWIndex | None) → DataFrame

Create Source Table

Create a table with one row per “new_id” and a Source object created from the union of “old_id” Source objects

Parameters:

lookup_table (pd.Series) – a pd.Series containing the index of the table to create a source table for
table_schema (dict) – a dictionary containing the schema of the table to create a source table for
pw_index (PWIndex) – a pathway index object containing the pathway_id and other metadata

Returns:

source_table – a pd.DataFrame containing the index of the table to create a source table for with one row per “new_id” and a Source object created from the union of “old_id” Source objects

Return type:

pd.DataFrame

Raises:

ValueError: – if SOURCE_SPEC.DATA_SOURCE is not present in table_schema

napistu.source.merge_sources(source_list: list | Series) → Source

Merge Sources

Merge a list of Source objects into a single Source object

Parameters:: source_list (list | pd.Series) – a list of Source objects or a pd.Series of Source objects
Returns:: source – a Source object created from the union of the Source objects in source_list
Return type:: Source
Raises:: TypeError: – if source_list is not a list or pd.Series

napistu.source.source_set_coverage(select_sources_df: pd.DataFrame, source_total_counts: pd.Series | pd.DataFrame | None = None, sbml_dfs: SBML_dfs | None = None, min_pw_size: int = 3, verbose: bool = False) → pd.DataFrame

Greedy Set Coverage of Sources

Find the set of pathways covering select_sources_df. If all_sources_df is provided pathways will be selected iteratively based on statistical enrichment. If all_sources_df is not provided, the largest pathways will be chosen iteratively.

Parameters:

select_sources_df (pd.DataFrame) – pd.Dataframe containing the index of source_table but expanded to include one row per source. As produced by source.unnest_sources()
source_total_counts (pd.Series | pd.DataFrame) – pd.Series containing the total counts of each source. As produced by source.get_source_total_counts() or a pd.DataFrame with two columns: pathway_id and total_counts.
sbml_dfs (SBML_dfs) – if source_total_counts is provided then sbml_dfs must be provided to calculate the total number of entities in the table.
min_pw_size (int) – the minimum size of a pathway to be considered
verbose (bool) – Whether to print verbose output

Returns:

minimial_sources – A list of pathway_ids of the minimal source set

Return type:

[str]

napistu.source.unnest_sources(source_table: DataFrame) → DataFrame

Unnest Sources - Optimized Version

Take a pd.DataFrame containing an array of Sources and return one-row per source.

Parameters:

source_table (pd.DataFrame) – a table containing an array of Sources

Returns:

pd.Dataframe containing the index of source_table but expanded
to include one row per source