napistu.ontologies.standardization

Standardization of ontologies for creating Identifiers and URLs

Public Functions

check_reactome_identifier_compatibility

Check whether two sets of Reactome identifiers are from the same species.

create_uri_url

Convert from an identifier and ontology to a URL reference for the identifier.

ensembl_id_to_url_regex

Map an ensembl ID to a validation regex and its canonical url on ensembl.

format_uri

Convert a RDF URI into an identifier list.

format_uri_url

Convert a URI into an identifier dictionary.

format_uri_url_identifiers_dot_org

Parse identifiers.org identifiers from a split URL path.

is_known_unsupported_uri

Check if a URI is known to be unsupported/pathological.

parse_ensembl_id

Extract the molecule type and species name from an ensembl identifier.

Functions

check_reactome_identifier_compatibility(...)

Check Reactome Identifier Compatibility

create_uri_url(ontology, identifier[, strict])

Create URI URL

ensembl_id_to_url_regex(identifier, ontology)

Ensembl ID to URL and Regex

format_uri(uri, bqb[, strict])

Convert a RDF URI into an identifier list

format_uri_url(uri[, strict])

Convert a URI into an identifier dictionary

format_uri_url_identifiers_dot_org(split_path)

Parse identifiers.org identifiers

is_known_unsupported_uri(uri)

Check if a URI is known to be unsupported/pathological.

parse_ensembl_id(input_str)

Parse Ensembl ID

napistu.ontologies.standardization._count_reactome_species(reactome_series: Series) Series

Count the number of species tags in a set of reactome IDs

napistu.ontologies.standardization._format_Identifiers_pubmed(pubmed_id: str) Identifiers

Format Identifiers for a single PubMed ID.

These will generally be used in an r_Identifiers field.

napistu.ontologies.standardization._infer_primary_reactome_species(reactome_series: Series) tuple[str, int]

Infer the best supported species based on a set of Reactome identifiers

napistu.ontologies.standardization._netloc_to_identifiers_matrixdb_adaptor(uri, class_regex, id_regex)
napistu.ontologies.standardization._netloc_to_identifiers_mirbase_adaptor(split_path, result)
napistu.ontologies.standardization._netloc_to_identifiers_pubchem_adaptor(split_path, result)
napistu.ontologies.standardization._netloc_w_url_prefix_to_identifiers_ncbi_adaptor(result, uri)
napistu.ontologies.standardization._netloc_w_url_suffix_to_identifiers_ensembl_adaptor(result, ontology: str)
napistu.ontologies.standardization._netloc_w_url_suffix_to_identifiers_phosphosite_adaptor(split_path, result)
napistu.ontologies.standardization._reactome_id_species(reactome_id: str) str

Extract the species code from a Reactome ID

napistu.ontologies.standardization._split_one_to_identifiers_chebi_adaptor(split_path, result)
napistu.ontologies.standardization._validate_bqb(bqb: str) None

Validate a BQB code

Parameters:

bqb (str) – The BQB code to validate

Return type:

None

Raises:
  • TypeError – If the BQB code is not a string

  • ValueError – If the BQB code does not start with ‘BQB’

napistu.ontologies.standardization.check_reactome_identifier_compatibility(reactome_series_a: Series, reactome_series_b: Series) None

Check Reactome Identifier Compatibility

Determine whether two sets of Reactome identifiers are from the same organismal species.

Parameters:
  • reactome_series_a (pd.Series) – a Series containing Reactome identifiers

  • reactome_series_b (pd.Series) – a Series containing Reactome identifiers

Return type:

None

napistu.ontologies.standardization.create_uri_url(ontology: str, identifier: str, strict: bool = True) str

Create URI URL

Convert from an identifier and ontology to a URL reference for the identifier

Parameters:
  • ontology (str) – An ontology for organizing genes, metabolites, etc.

  • identifier (str) – A systematic identifier from the “ontology” ontology.

  • strict (bool) – if strict then throw errors for invalid IDs otherwise return None

Returns:

url – A url representing a unique identifier

Return type:

str

napistu.ontologies.standardization.ensembl_id_to_url_regex(identifier: str, ontology: str) tuple[str, str]

Ensembl ID to URL and Regex

Map an ensembl ID to a validation regex and its canonical url on ensembl

Parameters:
  • identifier (str) – A standard identifier from ensembl genes, transcripts, or proteins

  • ontology (str) – The standard ontology (ensembl_gene, ensembl_transcript, or ensembl_protein)

Returns:

id_regex : a regex which should match a valid entry in this ontology url : the id’s url on ensembl

Return type:

tuple[str, str]

napistu.ontologies.standardization.format_uri(uri: str, bqb: str, strict: bool = True) list[dict] | None

Convert a RDF URI into an identifier list

Parameters:
  • uri (str) – The RDF URI to convert

  • bqb (str) – The BQB to add to the identifier

  • strict (bool) – Whether to raise an error if the URI is not valid

Returns:

The identifier list or None if the URI is not valid

Return type:

Optional[list[dict]]

napistu.ontologies.standardization.format_uri_url(uri: str, strict: bool = True) dict

Convert a URI into an identifier dictionary

Parameters:
  • uri (str) – The URI to convert

  • strict (bool) – Whether to raise an error if the URI is not valid

Returns:

The identifier dictionary

Return type:

dict

Raises:
  • NotImplementedError – If a parsing precedure has not been implemented for the netloc

  • TypeError – If the URI is not valid

  • ValueError – If there is a pathological identifier within ontology-specific parsing

napistu.ontologies.standardization.format_uri_url_identifiers_dot_org(split_path: list[str])

Parse identifiers.org identifiers

The identifiers.org identifier have two different formats: 1. http://identifiers.org/<ontology>/<id> 2. http://identifiers.org/<ontology>:<id>

Currently we are identifying the newer format 2. by looking for the : in the second element of the split path.

Also the ontology is converted to lower case letters.

Parameters:

split_path (list[str]) – split url path

Returns:

ontology, identifier

Return type:

tuple[str, str]

napistu.ontologies.standardization.is_known_unsupported_uri(uri: str) bool

Check if a URI is known to be unsupported/pathological.

This prevents throwing exceptions for URIs we know we can’t parse, allowing for cleaner logging and batch processing.

Parameters:

uri (str) – The URI to check

Returns:

True if the URI is known to be unsupported

Return type:

bool

napistu.ontologies.standardization.parse_ensembl_id(input_str: str) tuple[str, str, str]

Parse Ensembl ID

Extract the molecule type and species name from a string containing an ensembl identifier.

Parameters:

(str) (input_str) – A string containing an ensembl gene, transcript, or protein identifier

Returns:

identifier (str):

The substring matching the full identifier

molecule_type (str):
The ontology the identifier belongs to:
  • G -> ensembl_gene

  • T -> ensembl_transcript

  • P -> ensembl_protein

organismal_species (str):

The species name the identifier belongs to

Return type:

tuple[str, str, str]