napistu.context.discretize

Functions

`discretize_expression_data`(expression_data)	Discretize the GTEx data
`generate_simple_test_data`([n_genes, n_samples])	Generate a simple test dataset for basic validation.
`zfpkm`(fpkm_df[, min_peakheight, ...])	Transform entire DataFrame using zFPKM.

Classes

`PeakIndices`(major, minor, other)	Container for peak indices classified by importance.
`PeakSelector`([min_peakheight, ...])	Class to handle peak detection and classification in density data.

class napistu.context.discretize.PeakIndices(major: float, minor: float | None, other: np.ndarray | None)

Bases: NamedTuple

Container for peak indices classified by importance.

Parameters:

major (float) – Position of the rightmost/highest peak
minor (Optional[float]) – Position of the second most significant peak, if it exists
other (Optional[np.ndarray]) – Positions of any remaining peaks

classmethod _make(iterable): Make a new PeakIndices object from a sequence or iterable

_asdict(): Return a new dict which maps field names to their values.

_replace(**kwds): Return a new PeakIndices object replacing specified fields with new values

_field_defaults = {}

_fields = ('major', 'minor', 'other')

major: float: Alias for field number 0

minor: float | None: Alias for field number 1

other: ndarray | None: Alias for field number 2

class napistu.context.discretize.PeakSelector(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)

Bases: object

Class to handle peak detection and classification in density data.

Parameters:

min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log detailed information, by default True

__init__(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)

find_peaks(density_y: ndarray, x_eval: ndarray) → PeakIndices

Find and classify peaks in density data.

Parameters:

density_y (np.ndarray) – Y-values of the density estimation
x_eval (np.ndarray) – X-values corresponding to density_y

Returns:

Named tuple containing classified peak positions

Return type:

PeakIndices

napistu.context.discretize._remove_nan_inf_rows(fpkm_df: DataFrame) → DataFrame

Remove rows containing all NaN or infinite values.

Parameters:: fpkm_df (pd.DataFrame) – Input DataFrame with FPKM values
Returns:: DataFrame with rows containing all NaN or infinite values removed
Return type:: pd.DataFrame

Notes

Logs a warning if any rows are filtered out.

napistu.context.discretize._zfpkm_calc(fpkm: ndarray | Series, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) → ndarray

Perform zFPKM transform on a single sample of FPKM data.

The zFPKM algorithm fits a kernel density estimate to the log2(FPKM) distribution of ALL GENES within a single sample. This requires: - Input: A vector of FPKM values for all genes in ONE sample - Many genes (typically 1000+) for meaningful density estimation - The algorithm identifies the rightmost peak as “active” gene expression

Parameters:

fpkm (Union[np.ndarray, pd.Series]) – Raw FPKM values for all genes in ONE sample (NOT log2 transformed)
min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log debug information, by default False

Returns:

Array of zFPKM values

Return type:

np.ndarray

Raises:

ValueError – If no valid FPKM values are found after filtering

napistu.context.discretize.discretize_expression_data(expression_data: DataFrame, metadata_attributes: list[str] = None, min_row_sum: int = 50, zfpm_threshold: float = -3, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)

Discretize the GTEx data

Parameters:

expression_data (pandas DataFrame) – The expression data to discretize
metadata_attributes (list[str], optional) – Non-numeric and other metadata attributes which should be included in the output but ignored when discretizing expression data
min_row_sum (int, optional) – The minimum row sum to use for filtering constituatively un-expressed genes
zfpm_threshold (float, optional) – The zFPKM threshold to use for discretization. Samples with zFPKM values below this threshold are considered as unexpressed (0) in the sample/condition.
min_peakheight (float, optional) – The minimum peak height to use for peak detection
min_peakdistance (int, optional) – The minimum peak distance to use for peak detection
prominence (float, optional) – The prominence to use for peak detection
verbose (bool, optional) – Whether to print verbose output

Returns:

A tuple of two pandas DataFrames. The first DataFrame contains the zFPKM-transformed expression data with the metadata attributes merged on the left. The second DataFrame contains the expression data with binary values (0 for unexpressed, 1 for expressed) merged on the left.

Return type:

tuple of pandas DataFrames

napistu.context.discretize.generate_simple_test_data(n_genes: int = 200, n_samples: int = 100) → DataFrame

Generate a simple test dataset for basic validation.

Parameters:

n_genes (int, optional) – Number of genes to generate, by default 200
n_samples (int, optional) – Number of samples to generate, by default 50

Returns:

DataFrame with simulated FPKM values. Rows = genes, Columns = samples

Return type:

pd.DataFrame

napistu.context.discretize.zfpkm(fpkm_df: DataFrame, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) → DataFrame

Transform entire DataFrame using zFPKM.

Parameters:

fpkm_df (pd.DataFrame) – DataFrame containing raw FPKM values. Rows = genes/transcripts, Columns = samples
min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02
min_peakdistance (int, optional) – Minimum distance between peaks, by default 1
prominence (float, optional) – Minimum prominence for peak detection, by default 0.05
verbose (bool, optional) – Whether to log detailed information, by default False

Returns:

DataFrame with zFPKM transformed values

Return type:

pd.DataFrame