napistu.genomics.gsea

Functions for organizing gene sets and for applying gene set enrichment analysis (GSEA) to vertices or edges.

Classes

GenesetCollection:: A collection of gene sets for a given organismal species.

Public Functions

edgelist_ora:: Test an edgelist for enrichment between source pathways and target pathways.
get_default_collection_config:: Get the default collection configuration for a given organismal species.
vertex_ora:: Test a vertex for enrichment between a gene list and a geneset collection.

Functions

`edgelist_ora`(edgelist, genesets, graph[, ...])	Test pathway edge enrichment using NEAT degree-corrected method.
`get_default_collection_config`(organismal_species)
`vertex_ora`(gene_list, genesets[, universe, ...])	Over-representation analysis (ORA) for an unranked gene list.

Classes

`GenesetCollection`(organismal_species)	A collection of gene sets for a given organismal species
`GmtsConfig`(*, engine, categories[, dbver])	Pydantic model for GMT (Gene Matrix Transposed) configuration.

class napistu.genomics.gsea.GenesetCollection(organismal_species: str | OrganismalSpeciesValidator)

Bases: object

A collection of gene sets for a given organismal species

Parameters:: organismal_species (Union[str, OrganismalSpeciesValidator]) – The organismal species to create a gene set collection for.

organismal_species

The organismal species to create a gene set collection for.

Type:: OrganismalSpeciesValidator

gmt

A dictionary of gene set categories to their gene sets.

Type:: Dict[str, List[str]]

gmts

A nested dictionary of gene set categories to their gene sets for each ontology.

Type:: Dict[str, Dict[str, List[str]]]

Public Methods

--------------

add_gmts: Add gene sets to the gene set collection.

get_gmt_as_df: Convert the GMT dictionary to a DataFrame format suitable for matching.

Examples

>>> geneset_collection = GenesetCollection(organismal_species="Homo sapiens")
>>> # Add the default gene set collection
>>> geneset_collection.add_gmts()
>>> # Add a custom gene set collection using string engine name
>>> geneset_collection.add_gmts(gmts_config=GmtsConfig(engine="msigdb", categories=["c5.go.bp", "c5.go.cc", "c5.go.mf"], dbver="2023.2.Hs"))
>>> # Or using a dict with string engine name (dbver is optional)
>>> geneset_collection.add_gmts(gmts_config={"engine": "msigdb", "categories": ["c5.go.bp"]})

__init__(organismal_species: str | OrganismalSpeciesValidator)

_create_deep_to_shallow_lookup()

Create a lookup from deep gene set categories to shallow gene set categories.

If there is only one ontology, the lookup is simply the gene set names. If there are multiple ontologies, the lookup is a concatenation of the ontology name and the gene set name.

Returns:: A DataFrame with the deep gene set names and the shallow gene set names.
Return type:: pd.DataFrame

_create_gmt()

Create a GMT dictionary from the gmts dictionary.

Returns:: A dictionary of shallow gene set names to their gene sets.
Return type:: Dict[str, List[str]]

_format_gmts_config(gmts_config: Dict[str, Any] | GmtsConfig | str | None = None) → GmtsConfig

Format a gmts config into a GmtsConfig object.

Parameters:: gmts_config (Optional[Union[Dict[str, Any], GmtsConfig, str]]) – The gmts config to format. Can be: - A string name from GENESET_DEFAULT_CONFIG_NAMES (e.g., “hallmarks”, “bp_kegg_hallmarks”, “wikipathways”) - A dict with engine, categories, and optionally dbver - A GmtsConfig object - None: uses the default collection config for the organismal species from GENESET_DEFAULT_BY_SPECIES If a dict is provided, the engine can be specified as a string (e.g., “msigdb”) or as a callable class.
Returns:: The formatted gmts config.
Return type:: GmtsConfig

add_gmts(gmts_config: Dict[str, Any] | GmtsConfig | str | None = None, entrez: bool = True)

Add gene sets to the gene set collection.

Parameters:

gmts_config (Union[Dict[str, Any], GmtsConfig, str, None]) – The configuration for the gene set collection. Can be: - A string name from GENESET_DEFAULT_CONFIG_NAMES (e.g., “hallmarks”, “bp_kegg_hallmarks”, “wikipathways”) - A dict with engine, categories, and optionally dbver - A GmtsConfig object - None: uses the default collection config for the organismal species from GENESET_DEFAULT_BY_SPECIES The engine can be specified as a string (e.g., “msigdb”) or as a callable class.
entrez (bool) – Whether to use Entrez gene IDs (True) or gene symbols (False).

get_gmt_as_df() → DataFrame

Convert the GMT dictionary to a DataFrame format suitable for matching.

Returns:: A DataFrame with two columns: - “gene_set”: The gene set name - “identifier”: The identifier (e.g., Entrez ID) for each gene in the set
Return type:: pd.DataFrame

Examples

>>> collection = GenesetCollection(organismal_species="Homo sapiens")
>>> collection.add_gmts()
>>> gmt_df = collection.get_gmt_as_df()

get_gmt_w_napistu_ids(species_identifiers: DataFrame, id_type: str = 's_id', bqb_terms: str | List[str] = ['BQB_IS', 'BQB_IS_HOMOLOG_TO', 'BQB_IS_ENCODED_BY', 'BQB_ENCODES']) → DataFrame

Get the gene set collection with Napistu molecular species IDs.

Parameters:

species_identifiers (pd.DataFrame) – A DataFrame with the species identifiers. Either updated with sbml_dfs.get_characteristic_species_ids() or loaded from a tsv distributed as part of a Napistu GCS tar-balls. To map to compartmentalized species IDs use identifiers.construct_cspecies_identifiers() to add the sc_id column.
id_type (str) – The type of identifier to use. Must be one of {SBML_DFS.S_ID, SBML_DFS.SC_ID}. If using sc_id, then the species_identifiers table must be update to add the sc_id column.
bqb_terms (Union[str, List[str]]) – The BQB terms to use to filter the species identifiers. Defaults to BQB_DEFINING_ATTRS_LOOSE (BQB.IS, BQB.IS_HOMOLOG_TO, BQB.IS_ENCODED_BY, BQB.ENCODES)

Returns:

A dictionary of gene set names to their Napistu molecular species IDs.

Return type:

Dict[str, List[str]]

class napistu.genomics.gsea.GmtsConfig(*, engine: str | Callable, categories: List[str], dbver: str | None = None)

Bases: BaseModel

Pydantic model for GMT (Gene Matrix Transposed) configuration.

This class validates the configuration used for gene set collections, including the engine, categories, and database version.

Parameters:

engine (Union[str, Any]) – The gene set engine class (e.g., MsigDB from gseapy) or a string name (e.g., “msigdb”). Supported string names: “msigdb”.
categories (List[str]) – List of gene set categories to use (e.g., [“h.all”, “c2.cp.kegg”]).
dbver (Optional[str]) – Database version string (e.g., “2023.2.Hs”). If None, the engine’s default version will be used.

Examples

>>> # Using string engine name (recommended)
>>> config = GmtsConfig(
...     engine="msigdb",
...     categories=["h.all", "c2.cp.kegg", "c5.go.bp"],
...     dbver="2023.2.Hs"
... )
>>> # Using callable engine class (also supported)
>>> config = GmtsConfig(
...     engine=gp.msigdb.Msigdb,
...     categories=["h.all", "c2.cp.kegg", "c5.go.bp"],
...     dbver="2023.2.Hs"
... )
>>> # dbver is optional
>>> config = GmtsConfig(
...     engine="msigdb",
...     categories=["h.all", "c2.cp.kegg", "c5.go.bp"]
... )

_abc_impl = <_abc._abc_data object>

categories: List[str]

dbver: str | None

engine: str | Callable

model_config = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

napistu.genomics.gsea._calculate_enrichment_statistics(enrichment_test: str, edge_counts_df: DataFrame, edgelist_size: int, universe_size: int) → tuple[ndarray, ndarray]

Calculate enrichment statistics using the specified test method.

Parameters:

enrichment_test (str) – Test method: “proportion”, “fisher_exact”, or “binomial”
edge_counts_df (pd.DataFrame) – DataFrame with observed_edges and universe_edges columns
edgelist_size (int) – Total number of edges in the observed edgelist
universe_size (int) – Total number of edges in the universe

Returns:

Tuple of (odds_ratios, p_values) arrays

Return type:

tuple[np.ndarray, np.ndarray]

napistu.genomics.gsea._calculate_geneset_edge_counts(edgelist: DataFrame, genesets: Dict[str, List[str]], universe: Graph, min_set_size: int = 5, max_set_size: int | None = None, chunk_size: int = 10) → DataFrame

Calculate edge counts between all geneset pairs in both observed edgelist and universe.

Parameters:

edgelist (pd.DataFrame) – Edgelist with ‘from’ and ‘to’ columns containing vertex names. These are the actual edges to count.
genesets (Dict[str, List[str]]) – Dictionary mapping geneset names to lists of vertex names
universe (igraph.Graph) – Universe graph defining the possible edges (used for filtering genesets to valid vertices)
min_set_size (int) – Minimum number of genes in universe for a geneset to be included
max_set_size (int, optional) – Maximum number of genes in universe for a geneset to be included
chunk_size (int) – Number of target genesets to process at once. Set to np.inf to process all at once.

Returns:

Columns: source_geneset, target_geneset, observed_edges, universe_edges, n_genes_source, n_genes_target One row per geneset pair (upper triangle only if undirected)

Return type:

pd.DataFrame

napistu.genomics.gsea._count_edges_by_geneset_pair_chunked(edgelist_df: DataFrame, geneset_df: DataFrame, count_column_name: str, source_col: str, target_col: str, chunk_size: int = 10) → DataFrame

Count edges between geneset pairs, processing target genesets in chunks to limit memory.

Parameters:

edgelist_df (pd.DataFrame) – Edgelist with source and target columns
geneset_df (pd.DataFrame) – Long format with columns: geneset, vertex_name
count_column_name (str) – Name for the count column in output
source_col (str) – Name of the source column in edgelist_df
target_col (str) – Name of the target column in edgelist_df
chunk_size (int) – Number of target genesets to process at once

Returns:

Columns: source_geneset, target_geneset, {count_column_name}

Return type:

pd.DataFrame

napistu.genomics.gsea._filter_genesets_to_universe(universe: Graph, genesets: Dict[str, List[str]], min_set_size: int = 5, max_set_size: int | None = None) → Tuple[Dict[str, List[str]], DataFrame]

Filter genesets to universe vertices and create membership dataframe.

Parameters:

universe (igraph.Graph) – Universe graph with ‘name’ attribute on vertices
genesets (Dict[str, List[str]]) – Dictionary mapping geneset names to lists of vertex names
min_set_size (int) – Minimum number of genes in universe for inclusion
max_set_size (int, optional) – Maximum number of genes in universe for inclusion

Returns:

filtered_genesets (Dict[str, List[str]]) – Geneset name -> list of vertex names in universe
geneset_df (pd.DataFrame) – Long format with columns: geneset, vertex_name Each row is one gene in one geneset

napistu.genomics.gsea._get_engine_from_string(engine_name: str) → Any

Convert a string engine name to the corresponding gseapy engine class.

Parameters:: engine_name (str) – The engine name (e.g., “msigdb”).
Returns:: The engine class (e.g., gp.msigdb.Msigdb).
Return type:: Any
Raises:: ValueError – If the engine name is not recognized.

Examples

>>> engine = _get_engine_from_string("msigdb")
>>> engine
<class 'gseapy.msigdb.Msigdb'>

napistu.genomics.gsea._log_edgelist_ora_input(verbose: bool, graph: Graph, edgelist: Edgelist, genesets_dict: Dict[str, List[str]])

napistu.genomics.gsea._log_edgelist_ora_paired_counts(verbose: bool, edge_counts_df: DataFrame, min_set_size: int, max_set_size: int | None)

napistu.genomics.gsea._log_edgelist_ora_paired_results(verbose: bool, results_df: DataFrame)

napistu.genomics.gsea._log_edgelist_ora_universe(verbose: bool, universe: Graph)

napistu.genomics.gsea._resolve_edgelist(graph: Graph, edgelist: Edgelist) → Edgelist

napistu.genomics.gsea._validate_edgelist_universe(edgelist, universe)

napistu.genomics.gsea.edgelist_ora(edgelist: DataFrame | Edgelist, genesets: GenesetCollection | Dict[str, List[str]], graph: Graph, enrichment_test: str = 'fisher_exact', universe_vertex_names: List[str] | Series | None = None, universe_edgelist: DataFrame | None = None, universe_observed_only: bool = False, universe_edge_filter_logic: str = 'and', include_self_edges: bool = False, min_set_size: int = 5, max_set_size: int | None = None, min_x_geneset_edges_possible: int = 5, chunk_size: int = 10, verbose: bool = True) → DataFrame

Test pathway edge enrichment using NEAT degree-corrected method.

Performs gene set edge enrichment analysis to identify pairs of pathways with more edges between them than expected by chance based on a Fisher’s exact test.

Parameters:

edgelist (Union[pd.DataFrame, Edgelist]) – Edgelist with ‘from’ and ‘to’ columns containing vertex names. These are the edges to test for enrichment.
genesets (GenesetCollection or Dict[str, List[str]]) – Gene sets to test. Either a GenesetCollection object or a dictionary mapping geneset names to lists of gene names.
graph (ig.Graph) – Source network graph
enrichment_test (str) – The enrichment test to use. Must be one of “fisher_exact”, “proportion”, or “binomial”. Default is “fisher_exact”. - “fisher_exact”: Uses a Fisher’s exact test to test for enrichment. - “proportion”: Uses a proportion test to test for enrichment. - “binomial”: Uses a binomial test to test for enrichment.
universe_vertex_names (list of str or pd.Series, optional) – Vertex names to include in universe. If None, filter to the vertices present in at least one geneset.
universe_edgelist (pd.DataFrame, optional) – Edgelist defining possible edges in universe. If None and universe_observed_only=False, creates complete graph.
universe_observed_only (bool) – If True, universe includes only observed edges from graph.
universe_edge_filter_logic (str) – How to combine universe_edgelist and universe_observed_only: ‘and’ or ‘or’
include_self_edges (bool) – Whether to include self-edges in universe
min_set_size (int) – Minimum geneset size (after filtering to universe)
max_set_size (int, optional) – Maximum geneset size (after filtering to universe)
min_x_geneset_edges_possible (int) – Minimum number of possible edges in universe between geneset pairs to include in results. If there are only a small number of possible edges, seeing 1 would be statistically surprising but not meaningful. Default is 5.
chunk_size (int) – Number of target genesets to process at once. Only used when counting edges between genesets in the universe.
verbose (bool) – If True, print progress information

Returns:

Enrichment results with columns: - source_geneset, target_geneset: Pathway names - n_genes_source, n_genes_target: Pathway sizes in universe - observed_edges: Number of observed edges between pathways - p_value: One-tailed p-value (upper tail) - q_value: FDR-corrected p-value (Benjamini-Hochberg)

Return type:

pd.DataFrame

Examples

>>> # Test enrichment in full network
>>> observed = pd.DataFrame({
...     'source': ['A', 'B', 'C'],
...     'target': ['B', 'C', 'D']
... })
>>> results = edgelist_ora(
...     observed, genesets, graph
... )

>>> # Test with gene-only universe
>>> gene_names = [v['name'] for v in graph.vs if v.get('biotype') == 'gene']
>>> results = edgelist_ora(
...     observed, genesets, graph,
...     universe_vertex_names=gene_names
... )

>>> # Test with observed edges only in universe
>>> results = edgelist_ora(
...     observed, genesets, graph,
...     universe_observed_only=True
... )

napistu.genomics.gsea.get_default_collection_config(organismal_species: str | OrganismalSpeciesValidator) → GmtsConfig

Over-representation analysis (ORA) for an unranked gene list.

Tests whether each gene set is over-represented in the query gene list relative to a background universe, using the same vectorized Fisher exact test as edgelist_ora. Appropriate for labeling clusters or any unranked membership question; for ranked lists (e.g. DE results) use a rank-based method instead.

Parameters:

gene_list (list or pd.Series) – Genes of interest (e.g. members of one cluster). Must be in the same ID space as the gene sets (Entrez IDs by default, or Napistu s_ids if using GenesetCollection.get_gmt_w_napistu_ids()).
genesets (GenesetCollection or Dict[str, List[str]]) – Gene sets to test. Either a GenesetCollection object (uses .gmt) or a dict mapping gene set names to lists of member IDs.
universe (list or pd.Series, optional) – All testable genes (e.g. all vertices in your graph). The query gene_list is intersected with this universe before testing. If None, defaults to the union of all gene set members in genesets.
min_set_size (int) – Minimum gene set size after intersection with universe. Default 20.
max_set_size (int, optional) – Maximum gene set size after intersection with universe. Default None.

Returns:

One row per gene set passing size filters, columns: - term: gene set name - overlap: “k/M” string (overlap count / set size in universe) - odds_ratio: odds ratio - p_value: one-tailed Fisher exact p-value (upper tail, enrichment) - q_value: BH-corrected FDR Sorted by p_value ascending.

Return type:

pd.DataFrame

Examples

>>> gc = GenesetCollection("Homo sapiens")
>>> gc.add_gmts("bp_kegg_hallmarks")
>>> universe = vertex_df["entrez_id"].astype(str).tolist()
>>> results = vertex_ora(
...     gene_list=cluster_df["entrez_id"].astype(str).tolist(),
...     genesets=gc,
...     universe=universe,
... )

>>> # Using Napistu s_ids instead of Entrez
>>> gmt_sids = gc.get_gmt_w_napistu_ids(species_identifiers)
>>> results = vertex_ora(
...     gene_list=cluster_df["s_id"].tolist(),
...     genesets=gmt_sids,
...     universe=vertex_df["s_id"].tolist(),
... )