napistu.ingestion.hpa

Functions

download_hpa_data(target_uri[, url])

Download protein localization data from the Human Protein Atlas.

load_and_clean_hpa_data(hpa_data_path)

Load and format Human Protein Atlas subcellular localization data.

napistu.ingestion.hpa.download_hpa_data(target_uri: str, url: str = 'https://www.proteinatlas.org/download/tsv/subcellular_location.tsv.zip') None

Download protein localization data from the Human Protein Atlas.

Parameters:
  • target_uri (str) – The URI where the HPA data should be saved. Should end with .tsv

  • url (str, optional) – URL to download the zipped protein atlas subcellular localization tsv from. Defaults to PROTEINATLAS_SUBCELL_LOC_URL.

Return type:

None

Notes

Downloads the subcellular localization data from the Human Protein Atlas and saves it to the specified target URI. The data is downloaded from the official HPA website as a ZIP file and automatically unzipped to extract the TSV.

Raises:

ValueError – If target_uri does not end with .tsv

napistu.ingestion.hpa.load_and_clean_hpa_data(hpa_data_path: str) DataFrame

Load and format Human Protein Atlas subcellular localization data.

Parameters:

hpa_data_path (str) – Path to HPA subcellular localization data TSV file

Returns:

DataFrame with genes as rows and GO terms as columns. Each cell is a binary value (0 or 1) indicating whether that gene (row) is found in that compartment (column). Genes with no compartment annotations are filtered out.

Return type:

pd.DataFrame

Notes

This function loads subcellular localization data from the Human Protein Atlas and creates a binary matrix where rows are genes and columns are GO terms, with 1 indicating that a gene is localized to that compartment and 0 indicating it is not.

The function filters out genes that have no compartment annotations and logs information about the number of genes filtered and the final matrix dimensions.

Raises:
  • FileNotFoundError – If the input file does not exist

  • ValueError – If no gene-compartment associations are found in the data