napistu.ontologies.pubchem
Functions
|
Map PubChem Compound Identifiers (CIDs) to compound names and SMILES strings. |
Exceptions
Raised when PubChem API is unreachable due to network/connectivity issues. |
- exception napistu.ontologies.pubchem.PubChemConnectivityError
Bases:
ExceptionRaised when PubChem API is unreachable due to network/connectivity issues.
- napistu.ontologies.pubchem._fetch_batch(batch: List[str], max_retries: int, delay: float) Tuple[Dict[str, Dict[str, str]], bool]
Fetch data for a single batch of CIDs with retry logic.
- napistu.ontologies.pubchem._fetch_individual_cids(cids: List[str], max_retries: int, delay: float) Tuple[Dict[str, Dict[str, str]], bool]
Fetch CIDs individually when batch fails due to mixed valid/invalid CIDs.
- napistu.ontologies.pubchem._is_immediate_failure(e)
Return True if this error should not be retried.
- napistu.ontologies.pubchem._process_batch_response(data: Dict[str, Any], batch: List[str]) Dict[str, Dict[str, str]]
Process PubChem API response for a batch of compound identifiers.
Extracts compound names and SMILES strings from the PubChem REST API response, handling both found and missing compounds in the batch.
- Parameters:
data (Dict[str, any]) – JSON response from PubChem REST API containing property table with compound information.
batch (List[str]) – List of PubChem CIDs that were requested in this batch.
- Returns:
Dictionary mapping each CID to a nested dictionary containing: - ‘name’: Compound name (prefers Title over IUPACName, falls back to CID) - ‘smiles’: Isomeric SMILES string (empty string if not available)
Missing CIDs are included with name=CID and empty SMILES.
- Return type:
Dict[str, Dict[str, str]]
Notes
Handles API inconsistencies between IsomericSMILES and SMILES fields
Compounds not found in API response are included with default values
Prefers compound Title over IUPACName for better readability
- napistu.ontologies.pubchem._validate_params(batch_size: int, max_retries: int, delay: float) Tuple[int, int, float]
Validate and correct input parameters.
- napistu.ontologies.pubchem.map_pubchem_ids(pubchem_cids: List[str], batch_size: int = 100, max_retries: int = 3, delay: float = 0.25, verbose: bool = True) Dict[str, Dict[str, str]]
Map PubChem Compound Identifiers (CIDs) to compound names and SMILES strings.
Efficiently processes large datasets using batched API requests with retry logic. Returns both compound names and Isomeric SMILES for each CID.
- Parameters:
pubchem_cids (List[str]) – List of PubChem CIDs as strings (e.g., [“2244”, “5362065”]).
batch_size (int, optional) – CIDs per API request. Default 100. Range: 1-500.
max_retries (int, optional) – Retry attempts per failed batch. Default 3.
delay (float, optional) – Seconds between requests. Default 0.25 (respects 5 req/sec limit).
verbose (bool, optional) – Enable detailed logging. Default True.
- Returns:
Maps CID to {“name”: str, “smiles”: str, “mapped”: bool}.
mapped=Falseindicates fallback data (API failure or CID not in database).- Return type:
Dict[str, Dict[str, Any]]
Examples
>>> result = map_pubchem_ids(["2244", "5362065"]) >>> print(result["2244"]) {"name": "aspirin", "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"}
Notes
Rate limit: Max 5 requests/second (PubChem policy)
Timeout: 30 seconds per request
Uses Isomeric SMILES (includes stereochemistry)
Some compounds may lack SMILES data