napistu.ontologies.mygene
Functions
|
Create genome-wide mapping tables between Entrez and other gene identifiers. |
|
Unnest a column containing list of dicts in MyGene.info results. |
- napistu.ontologies.mygene._create_mygene_mapping_tables(mygene_results_df: DataFrame, mygene_fields: Set[str]) Dict[str, DataFrame]
Create mapping tables from MyGene.info query results.
- Parameters:
mygene_results_df (pd.DataFrame) – DataFrame containing MyGene.info query results
mygene_fields (Set[str]) – Set of MyGene.info fields that were queried
- Returns:
Dictionary mapping ontology names to DataFrames containing identifier mappings
- Return type:
Dict[str, pd.DataFrame]
- napistu.ontologies.mygene._fetch_mygene_data(mg: MyGeneInfo, query: str, taxa_id: int, fields: List[str], test_mode: bool = False) DataFrame
Fetch gene data from MyGene.info for a single query.
- Parameters:
mg (mygene.MyGeneInfo) – Initialized MyGene.info client
query (str) – Query string to search for genes
taxa_id (int) – NCBI taxonomy ID for the species
fields (List[str]) – List of MyGene.info fields to retrieve
test_mode (bool, default False) – If True, only fetch first 1000 genes
- Returns:
DataFrame containing gene data from the query
- Return type:
pd.DataFrame
- Raises:
ValueError – If query results are not in expected format
- napistu.ontologies.mygene._fetch_mygene_data_all_queries(mg: MyGeneInfo, taxa_id: int, fields: List[str], query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna'], test_mode: bool = False) DataFrame
Fetch comprehensive gene data from MyGene using multiple query strategies.
- Parameters:
mg (mygene.MyGeneInfo) – Initialized MyGene.info client
taxa_id (int) – NCBI taxonomy ID for the species
fields (List[str]) – List of MyGene.info fields to retrieve
query_strategies (List[str], default MYGENE_DEFAULT_QUERIES) – List of query strategies to use from MYGENE_QUERY_DEFS_LIST
test_mode (bool, default False) – If True, only fetch first 1000 genes
- Returns:
Combined DataFrame with gene data from all queries
- Return type:
pd.DataFrame
- Raises:
ValueError – If any query strategies are invalid
- napistu.ontologies.mygene._format_mygene_fields(mappings: Set[str]) Set[str]
Format and validate ontology mappings for MyGene.info queries.
- Parameters:
mappings (Set[str]) – Set of ontologies to validate and convert to MyGene.info field names
- Returns:
Set of valid MyGene.info field names including NCBI_ENTREZ_GENE
- Return type:
Set[str]
- Raises:
ValueError – If any mappings are invalid
- napistu.ontologies.mygene._format_mygene_species(species: str | int) int
Convert species name or taxonomy ID to NCBI taxonomy ID.
- Parameters:
species (Union[str, int]) – Species name (e.g. “Homo sapiens”) or NCBI taxonomy ID
- Returns:
NCBI taxonomy ID
- Return type:
int
- Raises:
ValueError – If species name is not recognized
- napistu.ontologies.mygene.create_python_mapping_tables(mappings: Set[str], species: str = 'Homo sapiens', test_mode: bool = False, query_strategies: List[str] = ['type_of_gene:protein-coding', 'type_of_gene:ncrna']) Dict[str, DataFrame]
Create genome-wide mapping tables between Entrez and other gene identifiers.
Python equivalent of create_bioconductor_mapping_tables using MyGene.info API.
- Parameters:
mappings (Set[str]) – Set of ontologies to create mappings for. Must be valid ontologies from INTERCONVERTIBLE_GENIC_ONTOLOGIES.
species (str, default "Homo sapiens") – Species name (e.g., “Homo sapiens”, “Mus musculus”). Must be a key in SPECIES_TO_TAXID or a valid NCBI taxonomy ID.
test_mode (bool, default False) – If True, only fetch the first 1000 genes for testing purposes.
query_strategies (list of str, default
MYGENE_DEFAULT_QUERIES) – MyGene.info query strings to run (seeMYGENE_QUERY_DEFS_LIST).
- Returns:
Dictionary with ontology names as keys and DataFrames as values. Each DataFrame has Entrez gene IDs as index and mapped identifiers as values.
- Return type:
Dict[str, pd.DataFrame]
- Raises:
ValueError – If any requested mappings are invalid or species is not recognized.
ImportError – If mygene package is not available.
Notes
The function uses MyGene.info API to fetch gene annotations and creates mapping tables between different gene identifier systems. It supports various ontologies like Ensembl genes/transcripts/proteins, UniProt, gene symbols, etc.
Examples
>>> mappings = {'ensembl_gene', 'symbol', 'uniprot'} >>> tables = create_python_mapping_tables(mappings, 'Homo sapiens') >>> print(tables['symbol'].head())
- napistu.ontologies.mygene.unnest_mygene_ontology(df: DataFrame, field: str) DataFrame
Unnest a column containing list of dicts in MyGene.info results.
- Parameters:
df (pd.DataFrame) – DataFrame containing MyGene.info results
field (str) – Field name to unnest, must contain a period to indicate nesting
- Returns:
DataFrame with unnested values, containing columns for entrez ID and the unnested field value
- Return type:
pd.DataFrame
- Raises:
ValueError – If field format is invalid or data structure is unexpected