napistu.source

The Source class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.

Classes

Source

A class for tracking the model(s) an entity (i.e., a compartment, species, reaction) came from.

Functions

create_source_table(lookup_table, ...)

Create Source Table

merge_sources(source_list)

Merge Sources

source_set_coverage(select_sources_df[, ...])

Greedy Set Coverage of Sources

unnest_sources(source_table)

Unnest Sources - Optimized Version

Classes

Source(source_df[, pw_index])

An Entity's Source

class napistu.source.Source(source_df: DataFrame, pw_index: PWIndex | None = None)

Bases: object

An Entity’s Source

source

A dataframe containing the model source and other optional variables

Type:

pd.DataFrame

empty() : classmethod

Create an empty Source object

single_entry(model, pathway_id, \*\*kwargs) : classmethod

Create a Source object with a single entry

validate_single_source() : bool

Check whether the Source object contains exactly 1 entry

_validate_source_df(source_df) : None

Validate that source_df is a pandas DataFrame with required columns

_validate_pathway_index(pw_index, source_df) : None

Validate pathway index and check for missing pathways

_process_source_df(source_df, pw_index) : pd.DataFrame

Process source DataFrame by merging with pathway index if provided

_validate_final_df(df, required_columns) : None

Validate that the final DataFrame has all required columns

classmethod empty() Source

Create an empty Source object.

This is typically used when creating an SBML_dfs object from a single source.

Returns:

An empty Source instance with source attribute set to None

Return type:

Source

classmethod single_entry(model: str, pathway_id: str | None = None, file: str | None = None, data_source: str | None = None, organismal_species: str | None = None, name: str | None = None, date: str | None = None) Source

Create a Source object with a single entry.

Convenience method for creating a Source with one row containing the core attributes from the pathway index schema.

Parameters:
  • model (str) – The model identifier (required)

  • pathway_id (str, optional) – The pathway identifier. Defaults to same as model if not provided

  • file (str, optional) – Source file path or identifier

  • data_source (str, optional) – Source database or origin (e.g., ‘Reactome’, ‘KEGG’)

  • organismal_species (str, optional) – Species the pathway is from

  • name (str, optional) – Human-readable pathway/model name

  • date (str, optional) – Date of pathway/model creation or last update

Returns:

A Source instance with a single-row DataFrame

Return type:

Source

Examples

>>> source = Source.single_entry(
...     model="R-HSA-123",
...     name="Glycolysis",
...     source="Reactome",
...     organismal_species="Homo sapiens"
... )
__init__(source_df: DataFrame, pw_index: PWIndex | None = None) None

Tracks the model(s) an entity (i.e., a compartment, species, reaction) came from.

By convention sources exist only for the models that an entity came from rather than the current model they are part of. For example, when combining Reactome models into a consensus, a molecule which existed in multiple models would have a source entry for each, but it would not have a source entry for the consensus model itself.

Parameters:
  • source_df (pd.DataFrame) – A dataframe containing the model source and other optional variables

  • pw_index (PWIndex, optional) – A pathway index object containing the pathway_id and other metadata

Return type:

None.

Raises:
  • ValueError: – If pw_index is not a PWIndex

  • ValueError: – If required columns are not present in source_df

  • TypeError: – If source_df is not a pd.DataFrame

_process_source_df(source_df: DataFrame, pw_index: PWIndex | None) DataFrame

Process source DataFrame by merging with pathway index if provided.

_validate_final_df(df: DataFrame, required_columns: list[str]) None

Validate that the final DataFrame has all required columns.

_validate_pathway_index(pw_index: PWIndex, source_df: DataFrame) None

Validate pathway index and check for missing pathways.

_validate_source_df(source_df: DataFrame) None

Validate that source_df is a pandas DataFrame with required columns.

validate_single_source() bool

Check whether the Source object contains exactly 1 entry.

Returns:

True if the Source contains exactly one row, False otherwise

Return type:

bool

Raises:

ValueError: – If the Source object is empty or contains more than one row

napistu.source._collapse_by_membership_string(membership_string: str, membership_categories: DataFrame, table_schema: dict) DataFrame

Assign each member of a membership-string to a set of pathways.

napistu.source._collapse_source_df(source_df: DataFrame | Series) Series

Collapse a source_df table into a single entry.

Combines multiple source entries into a single entry by joining values with “ OR “ separators. Handles None values by filtering them out before joining.

Parameters:

source_df (pd.DataFrame or pd.Series) – Source data to collapse. Must contain required columns MODEL and PATHWAY_ID.

Returns:

Collapsed source entry with joined values and count of collapsed pathways.

Return type:

pd.Series

Raises:
  • TypeError – If source_df is not a DataFrame or Series.

  • ValueError – If required columns MODEL or PATHWAY_ID are missing.

Notes

  • None values are filtered out before joining

  • For DataFrame input, unique values are used for DATA_SOURCE and ORGANISMAL_SPECIES

  • The N_COLLAPSED_PATHWAYS field tracks how many entries were collapsed

napistu.source._deduplicate_source_df(source_df: DataFrame) DataFrame

Combine entries in a source table when multiple models have the same members.

napistu.source._ensure_source_total_counts(source_total_counts: Series | DataFrame | None, verbose: bool = False) Series | None
napistu.source._safe_source_merge(member_Sources: Source | list) Source

Combine either a Source or pd.Series of Sources into a single Source object.

napistu.source._select_top_pathway_by_enrichment(unaccounted_for_members: DataFrame, source_total_counts: Series, n_total_entities: int, table_pk: str, min_pw_size: int = 3) str
napistu.source._select_top_pathway_by_size(unaccounted_for_members: DataFrame, min_pw_size: int = 3) str
napistu.source._update_unaccounted_for_members(top_pathway, unaccounted_for_members) DataFrame

Update the unaccounted for members dataframe by removing the members associated with the top pathway.

Parameters:
  • top_pathway (str) – the pathway to remove from the unaccounted for members

  • unaccounted_for_members (pd.DataFrame) – the dataframe of unaccounted for members

Returns:

unaccounted_for_members – the dataframe of unaccounted for members with the top pathway removed

Return type:

pd.DataFrame

napistu.source.create_source_table(lookup_table: Series, table_schema: dict, pw_index: PWIndex | None) DataFrame

Create Source Table

Create a table with one row per “new_id” and a Source object created from the union of “old_id” Source objects

Parameters:
  • lookup_table (pd.Series) – a pd.Series containing the index of the table to create a source table for

  • table_schema (dict) – a dictionary containing the schema of the table to create a source table for

  • pw_index (PWIndex) – a pathway index object containing the pathway_id and other metadata

Returns:

source_table – a pd.DataFrame containing the index of the table to create a source table for with one row per “new_id” and a Source object created from the union of “old_id” Source objects

Return type:

pd.DataFrame

Raises:

ValueError: – if SOURCE_SPEC.DATA_SOURCE is not present in table_schema

napistu.source.merge_sources(source_list: list | Series) Source

Merge Sources

Merge a list of Source objects into a single Source object

Parameters:

source_list (list | pd.Series) – a list of Source objects or a pd.Series of Source objects

Returns:

source – a Source object created from the union of the Source objects in source_list

Return type:

Source

Raises:

TypeError: – if source_list is not a list or pd.Series

napistu.source.source_set_coverage(select_sources_df: pd.DataFrame, source_total_counts: pd.Series | pd.DataFrame | None = None, sbml_dfs: SBML_dfs | None = None, min_pw_size: int = 3, verbose: bool = False) pd.DataFrame

Greedy Set Coverage of Sources

Find the set of pathways covering select_sources_df. If all_sources_df is provided pathways will be selected iteratively based on statistical enrichment. If all_sources_df is not provided, the largest pathways will be chosen iteratively.

Parameters:
  • select_sources_df (pd.DataFrame) – pd.Dataframe containing the index of source_table but expanded to include one row per source. As produced by source.unnest_sources()

  • source_total_counts (pd.Series | pd.DataFrame) – pd.Series containing the total counts of each source. As produced by source.get_source_total_counts() or a pd.DataFrame with two columns: pathway_id and total_counts.

  • sbml_dfs (SBML_dfs) – if source_total_counts is provided then sbml_dfs must be provided to calculate the total number of entities in the table.

  • min_pw_size (int) – the minimum size of a pathway to be considered

  • verbose (bool) – Whether to print verbose output

Returns:

minimial_sources – A list of pathway_ids of the minimal source set

Return type:

[str]

napistu.source.unnest_sources(source_table: DataFrame) DataFrame

Unnest Sources - Optimized Version

Take a pd.DataFrame containing an array of Sources and return one-row per source.

Parameters:

source_table (pd.DataFrame) – a table containing an array of Sources

Returns:

  • pd.Dataframe containing the index of source_table but expanded

  • to include one row per source