napistu.ontologies.standardization

Standardization of ontologies for creating Identifiers and URLs

Public Functions

check_reactome_identifier_compatibility: Check whether two sets of Reactome identifiers are from the same species.
create_uri_url: Convert from an identifier and ontology to a URL reference for the identifier.
ensembl_id_to_url_regex: Map an ensembl ID to a validation regex and its canonical url on ensembl.
format_uri: Convert a RDF URI into an identifier list.
format_uri_url: Convert a URI into an identifier dictionary.
format_uri_url_identifiers_dot_org: Parse identifiers.org identifiers from a split URL path.
is_known_unsupported_uri: Check if a URI is known to be unsupported/pathological.
parse_ensembl_id: Extract the molecule type and species name from an ensembl identifier.

Functions

`check_reactome_identifier_compatibility`(...)	Check Reactome Identifier Compatibility
`create_uri_url`(ontology, identifier[, strict])	Create URI URL
`ensembl_id_to_url_regex`(identifier, ontology)	Ensembl ID to URL and Regex
`format_uri`(uri, bqb[, strict])	Convert a RDF URI into an identifier list
`format_uri_url`(uri[, strict])	Convert a URI into an identifier dictionary
`format_uri_url_identifiers_dot_org`(split_path)	Parse identifiers.org identifiers
`is_known_unsupported_uri`(uri)	Check if a URI is known to be unsupported/pathological.
`parse_ensembl_id`(input_str)	Parse Ensembl ID

napistu.ontologies.standardization._count_reactome_species(reactome_series: Series) → Series: Count the number of species tags in a set of reactome IDs

napistu.ontologies.standardization._format_Identifiers_pubmed(pubmed_id: str) → Identifiers

Format Identifiers for a single PubMed ID.

These will generally be used in an r_Identifiers field.

napistu.ontologies.standardization._infer_primary_reactome_species(reactome_series: Series) → tuple[str, int]: Infer the best supported species based on a set of Reactome identifiers

napistu.ontologies.standardization._netloc_to_identifiers_matrixdb_adaptor(uri, class_regex, id_regex)

napistu.ontologies.standardization._netloc_to_identifiers_mirbase_adaptor(split_path, result)

napistu.ontologies.standardization._netloc_to_identifiers_pubchem_adaptor(split_path, result)

napistu.ontologies.standardization._netloc_w_url_prefix_to_identifiers_ncbi_adaptor(result, uri)

napistu.ontologies.standardization._netloc_w_url_suffix_to_identifiers_ensembl_adaptor(result, ontology: str)

napistu.ontologies.standardization._netloc_w_url_suffix_to_identifiers_phosphosite_adaptor(split_path, result)

napistu.ontologies.standardization._reactome_id_species(reactome_id: str) → str: Extract the species code from a Reactome ID

napistu.ontologies.standardization._split_one_to_identifiers_chebi_adaptor(split_path, result)

napistu.ontologies.standardization._validate_bqb(bqb: str) → None

Validate a BQB code

Parameters:

bqb (str) – The BQB code to validate

Return type:

None

Raises:

TypeError – If the BQB code is not a string
ValueError – If the BQB code does not start with ‘BQB’

napistu.ontologies.standardization.check_reactome_identifier_compatibility(reactome_series_a: Series, reactome_series_b: Series) → None

Check Reactome Identifier Compatibility

Determine whether two sets of Reactome identifiers are from the same organismal species.

Parameters:

reactome_series_a (pd.Series) – a Series containing Reactome identifiers
reactome_series_b (pd.Series) – a Series containing Reactome identifiers

Return type:

None

napistu.ontologies.standardization.create_uri_url(ontology: str, identifier: str, strict: bool = True) → str

Create URI URL

Convert from an identifier and ontology to a URL reference for the identifier

Parameters:

ontology (str) – An ontology for organizing genes, metabolites, etc.
identifier (str) – A systematic identifier from the “ontology” ontology.
strict (bool) – if strict then throw errors for invalid IDs otherwise return None

Returns:

url – A url representing a unique identifier

Return type:

str

napistu.ontologies.standardization.ensembl_id_to_url_regex(identifier: str, ontology: str) → tuple[str, str]

Ensembl ID to URL and Regex

Map an ensembl ID to a validation regex and its canonical url on ensembl

Parameters:

identifier (str) – A standard identifier from ensembl genes, transcripts, or proteins
ontology (str) – The standard ontology (ensembl_gene, ensembl_transcript, or ensembl_protein)

Returns:

id_regex : a regex which should match a valid entry in this ontology url : the id’s url on ensembl

Return type:

tuple[str, str]

napistu.ontologies.standardization.format_uri(uri: str, bqb: str, strict: bool = True) → list[dict] | None

Convert a RDF URI into an identifier list

Parameters:

uri (str) – The RDF URI to convert
bqb (str) – The BQB to add to the identifier
strict (bool) – Whether to raise an error if the URI is not valid

Returns:

The identifier list or None if the URI is not valid

Return type:

Optional[list[dict]]

napistu.ontologies.standardization.format_uri_url(uri: str, strict: bool = True) → dict

Convert a URI into an identifier dictionary

Parameters:

uri (str) – The URI to convert
strict (bool) – Whether to raise an error if the URI is not valid

Returns:

The identifier dictionary

Return type:

dict

Raises:

NotImplementedError – If a parsing precedure has not been implemented for the netloc
TypeError – If the URI is not valid
ValueError – If there is a pathological identifier within ontology-specific parsing

napistu.ontologies.standardization.format_uri_url_identifiers_dot_org(split_path: list[str])

Parse identifiers.org identifiers

The identifiers.org identifier have two different formats: 1. http://identifiers.org/<ontology>/<id> 2. http://identifiers.org/<ontology>:<id>

Currently we are identifying the newer format 2. by looking for the : in the second element of the split path.

Also the ontology is converted to lower case letters.

Parameters:: split_path (list[str]) – split url path
Returns:: ontology, identifier
Return type:: tuple[str, str]

napistu.ontologies.standardization.is_known_unsupported_uri(uri: str) → bool

Check if a URI is known to be unsupported/pathological.

This prevents throwing exceptions for URIs we know we can’t parse, allowing for cleaner logging and batch processing.

Parameters:: uri (str) – The URI to check
Returns:: True if the URI is known to be unsupported
Return type:: bool

napistu.ontologies.standardization.parse_ensembl_id(input_str: str) → tuple[str, str, str]

Parse Ensembl ID

Extract the molecule type and species name from a string containing an ensembl identifier.

Parameters:

(str) (input_str) – A string containing an ensembl gene, transcript, or protein identifier

Returns:

identifier (str):

The substring matching the full identifier

molecule_type (str):

The ontology the identifier belongs to:

G -> ensembl_gene
T -> ensembl_transcript
P -> ensembl_protein

organismal_species (str):

The species name the identifier belongs to

Return type:

tuple[str, str, str]