napistu.context.discretize
Functions
|
Discretize the GTEx data |
|
Generate a simple test dataset for basic validation. |
|
Transform entire DataFrame using zFPKM. |
Classes
|
Container for peak indices classified by importance. |
|
Class to handle peak detection and classification in density data. |
- class napistu.context.discretize.PeakIndices(major: float, minor: float | None, other: np.ndarray | None)
Bases:
NamedTupleContainer for peak indices classified by importance.
- Parameters:
major (float) – Position of the rightmost/highest peak
minor (Optional[float]) – Position of the second most significant peak, if it exists
other (Optional[np.ndarray]) – Positions of any remaining peaks
- classmethod _make(iterable)
Make a new PeakIndices object from a sequence or iterable
- _asdict()
Return a new dict which maps field names to their values.
- _replace(**kwds)
Return a new PeakIndices object replacing specified fields with new values
- _field_defaults = {}
- _fields = ('major', 'minor', 'other')
- major: float
Alias for field number 0
- minor: float | None
Alias for field number 1
- other: ndarray | None
Alias for field number 2
- class napistu.context.discretize.PeakSelector(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)
Bases:
objectClass to handle peak detection and classification in density data.
- Parameters:
min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log detailed information, by default True
- __init__(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)
- find_peaks(density_y: ndarray, x_eval: ndarray) PeakIndices
Find and classify peaks in density data.
- Parameters:
density_y (np.ndarray) – Y-values of the density estimation
x_eval (np.ndarray) – X-values corresponding to density_y
- Returns:
Named tuple containing classified peak positions
- Return type:
- napistu.context.discretize._remove_nan_inf_rows(fpkm_df: DataFrame) DataFrame
Remove rows containing all NaN or infinite values.
- Parameters:
fpkm_df (pd.DataFrame) – Input DataFrame with FPKM values
- Returns:
DataFrame with rows containing all NaN or infinite values removed
- Return type:
pd.DataFrame
Notes
Logs a warning if any rows are filtered out.
- napistu.context.discretize._zfpkm_calc(fpkm: ndarray | Series, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) ndarray
Perform zFPKM transform on a single sample of FPKM data.
The zFPKM algorithm fits a kernel density estimate to the log2(FPKM) distribution of ALL GENES within a single sample. This requires: - Input: A vector of FPKM values for all genes in ONE sample - Many genes (typically 1000+) for meaningful density estimation - The algorithm identifies the rightmost peak as “active” gene expression
- Parameters:
fpkm (Union[np.ndarray, pd.Series]) – Raw FPKM values for all genes in ONE sample (NOT log2 transformed)
min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log debug information, by default False
- Returns:
Array of zFPKM values
- Return type:
np.ndarray
- Raises:
ValueError – If no valid FPKM values are found after filtering
- napistu.context.discretize.discretize_expression_data(expression_data: DataFrame, metadata_attributes: list[str] = None, min_row_sum: int = 50, zfpm_threshold: float = -3, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)
Discretize the GTEx data
- Parameters:
expression_data (pandas DataFrame) – The expression data to discretize
metadata_attributes (list[str], optional) – Non-numeric and other metadata attributes which should be included in the output but ignored when discretizing expression data
min_row_sum (int, optional) – The minimum row sum to use for filtering constituatively un-expressed genes
zfpm_threshold (float, optional) – The zFPKM threshold to use for discretization. Samples with zFPKM values below this threshold are considered as unexpressed (0) in the sample/condition.
min_peakheight (float, optional) – The minimum peak height to use for peak detection
min_peakdistance (int, optional) – The minimum peak distance to use for peak detection
prominence (float, optional) – The prominence to use for peak detection
verbose (bool, optional) – Whether to print verbose output
- Returns:
A tuple of two pandas DataFrames. The first DataFrame contains the zFPKM-transformed expression data with the metadata attributes merged on the left. The second DataFrame contains the expression data with binary values (0 for unexpressed, 1 for expressed) merged on the left.
- Return type:
tuple of pandas DataFrames
- napistu.context.discretize.generate_simple_test_data(n_genes: int = 200, n_samples: int = 100) DataFrame
Generate a simple test dataset for basic validation.
- Parameters:
n_genes (int, optional) – Number of genes to generate, by default 200
n_samples (int, optional) – Number of samples to generate, by default 50
- Returns:
DataFrame with simulated FPKM values. Rows = genes, Columns = samples
- Return type:
pd.DataFrame
- napistu.context.discretize.zfpkm(fpkm_df: DataFrame, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) DataFrame
Transform entire DataFrame using zFPKM.
- Parameters:
fpkm_df (pd.DataFrame) – DataFrame containing raw FPKM values. Rows = genes/transcripts, Columns = samples
min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log detailed information, by default False
- Returns:
DataFrame with zFPKM transformed values
- Return type:
pd.DataFrame