napistu.ontologies.mygene

Functions

`create_python_mapping_tables`(mappings[, ...])	Create genome-wide mapping tables between Entrez and other gene identifiers.
`unnest_mygene_ontology`(df, field)	Unnest a column containing list of dicts in MyGene.info results.

napistu.ontologies.mygene._create_mygene_mapping_tables(mygene_results_df: DataFrame, mygene_fields: Set[str]) → Dict[str, DataFrame]

Create mapping tables from MyGene.info query results.

Parameters:

mygene_results_df (pd.DataFrame) – DataFrame containing MyGene.info query results
mygene_fields (Set[str]) – Set of MyGene.info fields that were queried

Returns:

Dictionary mapping ontology names to DataFrames containing identifier mappings

Return type:

Dict[str, pd.DataFrame]

napistu.ontologies.mygene._fetch_mygene_data(mg: MyGeneInfo, query: str, taxa_id: int, fields: List[str], test_mode: bool = False) → DataFrame

Fetch gene data from MyGene.info for a single query.

Parameters:

mg (mygene.MyGeneInfo) – Initialized MyGene.info client
query (str) – Query string to search for genes
taxa_id (int) – NCBI taxonomy ID for the species
fields (List[str]) – List of MyGene.info fields to retrieve
test_mode (bool, default False) – If True, only fetch first 1000 genes

Returns:

DataFrame containing gene data from the query

Return type:

pd.DataFrame

Raises:

ValueError – If query results are not in expected format

napistu.ontologies.mygene._fetch_mygene_data_all_queries(mg: MyGeneInfo, taxa_id: int, fields: List[str], query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna'], test_mode: bool = False) → DataFrame

Fetch comprehensive gene data from MyGene using multiple query strategies.

Parameters:

mg (mygene.MyGeneInfo) – Initialized MyGene.info client
taxa_id (int) – NCBI taxonomy ID for the species
fields (List[str]) – List of MyGene.info fields to retrieve
query_strategies (List[str], default MYGENE_DEFAULT_QUERIES) – List of query strategies to use from MYGENE_QUERY_DEFS_LIST
test_mode (bool, default False) – If True, only fetch first 1000 genes

Returns:

Combined DataFrame with gene data from all queries

Return type:

pd.DataFrame

Raises:

ValueError – If any query strategies are invalid

napistu.ontologies.mygene._format_mygene_fields(mappings: Set[str]) → Set[str]

Format and validate ontology mappings for MyGene.info queries.

Parameters:: mappings (Set[str]) – Set of ontologies to validate and convert to MyGene.info field names
Returns:: Set of valid MyGene.info field names including NCBI_ENTREZ_GENE
Return type:: Set[str]
Raises:: ValueError – If any mappings are invalid

napistu.ontologies.mygene._format_mygene_species(species: str | int) → int

Convert species name or taxonomy ID to NCBI taxonomy ID.

Parameters:: species (Union[str, int]) – Species name (e.g. “Homo sapiens”) or NCBI taxonomy ID
Returns:: NCBI taxonomy ID
Return type:: int
Raises:: ValueError – If species name is not recognized

napistu.ontologies.mygene.create_python_mapping_tables(mappings: Set[str], species: str = 'Homo sapiens', test_mode: bool = False, query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna']) → Dict[str, DataFrame]

Create genome-wide mapping tables between Entrez and other gene identifiers.

Python equivalent of create_bioconductor_mapping_tables using MyGene.info API.

Parameters:

mappings (Set[str]) – Set of ontologies to create mappings for. Must be valid ontologies from INTERCONVERTIBLE_GENIC_ONTOLOGIES.
species (str, default "Homo sapiens") – Species name (e.g., “Homo sapiens”, “Mus musculus”). Must be a key in SPECIES_TO_TAXID or a valid NCBI taxonomy ID.
test_mode (bool, default False) – If True, only fetch the first 1000 genes for testing purposes.
query_strategies (list of str, default MYGENE_DEFAULT_QUERIES) – MyGene.info query strings to run (see MYGENE_QUERY_DEFS_LIST).

Returns:

Dictionary with ontology names as keys and DataFrames as values. Each DataFrame has Entrez gene IDs as index and mapped identifiers as values.

Return type:

Dict[str, pd.DataFrame]

Raises:

ValueError – If any requested mappings are invalid or species is not recognized.
ImportError – If mygene package is not available.

Notes

The function uses MyGene.info API to fetch gene annotations and creates mapping tables between different gene identifier systems. It supports various ontologies like Ensembl genes/transcripts/proteins, UniProt, gene symbols, etc.

Examples

>>> mappings = {'ensembl_gene', 'symbol', 'uniprot'}
>>> tables = create_python_mapping_tables(mappings, 'Homo sapiens')
>>> print(tables['symbol'].head())

napistu.ontologies.mygene.unnest_mygene_ontology(df: DataFrame, field: str) → DataFrame

Unnest a column containing list of dicts in MyGene.info results.

Parameters:

df (pd.DataFrame) – DataFrame containing MyGene.info results
field (str) – Field name to unnest, must contain a period to indicate nesting

Returns:

DataFrame with unnested values, containing columns for entrez ID and the unnested field value

Return type:

pd.DataFrame

Raises:

ValueError – If field format is invalid or data structure is unexpected