napistu.ontologies.mygene

Functions

create_python_mapping_tables(mappings[, ...])

Create genome-wide mapping tables between Entrez and other gene identifiers.

unnest_mygene_ontology(df, field)

Unnest a column containing list of dicts in MyGene.info results.

napistu.ontologies.mygene._create_mygene_mapping_tables(mygene_results_df: DataFrame, mygene_fields: Set[str]) Dict[str, DataFrame]

Create mapping tables from MyGene.info query results.

Parameters:
  • mygene_results_df (pd.DataFrame) – DataFrame containing MyGene.info query results

  • mygene_fields (Set[str]) – Set of MyGene.info fields that were queried

Returns:

Dictionary mapping ontology names to DataFrames containing identifier mappings

Return type:

Dict[str, pd.DataFrame]

napistu.ontologies.mygene._fetch_mygene_data(mg: MyGeneInfo, query: str, taxa_id: int, fields: List[str], test_mode: bool = False) DataFrame

Fetch gene data from MyGene.info for a single query.

Parameters:
  • mg (mygene.MyGeneInfo) – Initialized MyGene.info client

  • query (str) – Query string to search for genes

  • taxa_id (int) – NCBI taxonomy ID for the species

  • fields (List[str]) – List of MyGene.info fields to retrieve

  • test_mode (bool, default False) – If True, only fetch first 1000 genes

Returns:

DataFrame containing gene data from the query

Return type:

pd.DataFrame

Raises:

ValueError – If query results are not in expected format

napistu.ontologies.mygene._fetch_mygene_data_all_queries(mg: MyGeneInfo, taxa_id: int, fields: List[str], query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna'], test_mode: bool = False) DataFrame

Fetch comprehensive gene data from MyGene using multiple query strategies.

Parameters:
  • mg (mygene.MyGeneInfo) – Initialized MyGene.info client

  • taxa_id (int) – NCBI taxonomy ID for the species

  • fields (List[str]) – List of MyGene.info fields to retrieve

  • query_strategies (List[str], default MYGENE_DEFAULT_QUERIES) – List of query strategies to use from MYGENE_QUERY_DEFS_LIST

  • test_mode (bool, default False) – If True, only fetch first 1000 genes

Returns:

Combined DataFrame with gene data from all queries

Return type:

pd.DataFrame

Raises:

ValueError – If any query strategies are invalid

napistu.ontologies.mygene._format_mygene_fields(mappings: Set[str]) Set[str]

Format and validate ontology mappings for MyGene.info queries.

Parameters:

mappings (Set[str]) – Set of ontologies to validate and convert to MyGene.info field names

Returns:

Set of valid MyGene.info field names including NCBI_ENTREZ_GENE

Return type:

Set[str]

Raises:

ValueError – If any mappings are invalid

napistu.ontologies.mygene._format_mygene_species(species: str | int) int

Convert species name or taxonomy ID to NCBI taxonomy ID.

Parameters:

species (Union[str, int]) – Species name (e.g. “Homo sapiens”) or NCBI taxonomy ID

Returns:

NCBI taxonomy ID

Return type:

int

Raises:

ValueError – If species name is not recognized

napistu.ontologies.mygene.create_python_mapping_tables(mappings: Set[str], species: str = 'Homo sapiens', test_mode: bool = False, query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna']) Dict[str, DataFrame]

Create genome-wide mapping tables between Entrez and other gene identifiers.

Python equivalent of create_bioconductor_mapping_tables using MyGene.info API.

Parameters:
  • mappings (Set[str]) – Set of ontologies to create mappings for. Must be valid ontologies from INTERCONVERTIBLE_GENIC_ONTOLOGIES.

  • species (str, default "Homo sapiens") – Species name (e.g., “Homo sapiens”, “Mus musculus”). Must be a key in SPECIES_TO_TAXID or a valid NCBI taxonomy ID.

  • test_mode (bool, default False) – If True, only fetch the first 1000 genes for testing purposes.

  • query_strategies (list of str, default MYGENE_DEFAULT_QUERIES) – MyGene.info query strings to run (see MYGENE_QUERY_DEFS_LIST).

Returns:

Dictionary with ontology names as keys and DataFrames as values. Each DataFrame has Entrez gene IDs as index and mapped identifiers as values.

Return type:

Dict[str, pd.DataFrame]

Raises:
  • ValueError – If any requested mappings are invalid or species is not recognized.

  • ImportError – If mygene package is not available.

Notes

The function uses MyGene.info API to fetch gene annotations and creates mapping tables between different gene identifier systems. It supports various ontologies like Ensembl genes/transcripts/proteins, UniProt, gene symbols, etc.

Examples

>>> mappings = {'ensembl_gene', 'symbol', 'uniprot'}
>>> tables = create_python_mapping_tables(mappings, 'Homo sapiens')
>>> print(tables['symbol'].head())
napistu.ontologies.mygene.unnest_mygene_ontology(df: DataFrame, field: str) DataFrame

Unnest a column containing list of dicts in MyGene.info results.

Parameters:
  • df (pd.DataFrame) – DataFrame containing MyGene.info results

  • field (str) – Field name to unnest, must contain a period to indicate nesting

Returns:

DataFrame with unnested values, containing columns for entrez ID and the unnested field value

Return type:

pd.DataFrame

Raises:

ValueError – If field format is invalid or data structure is unexpected