napistu.ingestion.gtex

Functions

download_gtex_rnaseq(target_uri[, url])

Download GTEx RNA-seq expression data.

load_and_clean_gtex_data(gtex_data_path)

Load and format GTEx tissue specific expression data.

napistu.ingestion.gtex.download_gtex_rnaseq(target_uri: str, url: str = 'https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz') None

Download GTEx RNA-seq expression data.

Parameters:
  • target_uri (str) – The URI where the GTEx data should be saved

  • url (str, optional) – URL to download the GTEx RNA-seq expression data from. Defaults to GTEX_RNASEQ_EXPRESSION_URL.

Return type:

None

Notes

Downloads GTEx RNA-seq expression data (median TPM per gene per tissue) from the specified URL and saves it to the target URI. By default, downloads from GTEx Analysis V8 data (dbGaP Accession phs000424.v8.p2).

napistu.ingestion.gtex.load_and_clean_gtex_data(gtex_data_path: str) DataFrame

Load and format GTEx tissue specific expression data.

This function loads tissue-specific expression data from GTEx (median value per gene per tissue).

Parameters:

gtex_data_path (str) – Path to GTEx tissue specific expression data (medians)

Returns:

DataFrame containing all the information from the GTEx file with standardized column names: - ensembl_gene_id: Ensembl gene ID without version number - ensembl_geneTranscript_id: Original GTEx hybrid gene/transcript ID - Description: Gene description/symbol - Multiple tissue columns with median TPM values

Return type:

pd.DataFrame

Notes

The function: 1. Skips the first 2 lines of the GTEx file (header info) 2. Creates clean Ensembl gene IDs by removing version numbers 3. Renames columns for clarity 4. Reorders columns to put ID and description columns first

Raises:

FileNotFoundError – If the input file does not exist