napistu.context.discretize

Functions

discretize_expression_data(expression_data)

Discretize the GTEx data

generate_simple_test_data([n_genes, n_samples])

Generate a simple test dataset for basic validation.

zfpkm(fpkm_df[, min_peakheight, ...])

Transform entire DataFrame using zFPKM.

Classes

PeakIndices(major, minor, other)

Container for peak indices classified by importance.

PeakSelector([min_peakheight, ...])

Class to handle peak detection and classification in density data.

class napistu.context.discretize.PeakIndices(major: float, minor: float | None, other: np.ndarray | None)

Bases: NamedTuple

Container for peak indices classified by importance.

Parameters:
  • major (float) – Position of the rightmost/highest peak

  • minor (Optional[float]) – Position of the second most significant peak, if it exists

  • other (Optional[np.ndarray]) – Positions of any remaining peaks

classmethod _make(iterable)

Make a new PeakIndices object from a sequence or iterable

_asdict()

Return a new dict which maps field names to their values.

_replace(**kwds)

Return a new PeakIndices object replacing specified fields with new values

_field_defaults = {}
_fields = ('major', 'minor', 'other')
major: float

Alias for field number 0

minor: float | None

Alias for field number 1

other: ndarray | None

Alias for field number 2

class napistu.context.discretize.PeakSelector(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)

Bases: object

Class to handle peak detection and classification in density data.

Parameters:
  • min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02

  • min_peakdistance (int, optional) – Minimum distance between peaks, by default 1

  • prominence (float, optional) – Minimum prominence for peak detection, by default 0.05

  • verbose (bool, optional) – Whether to log detailed information, by default True

__init__(min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)
find_peaks(density_y: ndarray, x_eval: ndarray) PeakIndices

Find and classify peaks in density data.

Parameters:
  • density_y (np.ndarray) – Y-values of the density estimation

  • x_eval (np.ndarray) – X-values corresponding to density_y

Returns:

Named tuple containing classified peak positions

Return type:

PeakIndices

napistu.context.discretize._remove_nan_inf_rows(fpkm_df: DataFrame) DataFrame

Remove rows containing all NaN or infinite values.

Parameters:

fpkm_df (pd.DataFrame) – Input DataFrame with FPKM values

Returns:

DataFrame with rows containing all NaN or infinite values removed

Return type:

pd.DataFrame

Notes

Logs a warning if any rows are filtered out.

napistu.context.discretize._zfpkm_calc(fpkm: ndarray | Series, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) ndarray

Perform zFPKM transform on a single sample of FPKM data.

The zFPKM algorithm fits a kernel density estimate to the log2(FPKM) distribution of ALL GENES within a single sample. This requires: - Input: A vector of FPKM values for all genes in ONE sample - Many genes (typically 1000+) for meaningful density estimation - The algorithm identifies the rightmost peak as “active” gene expression

Parameters:
  • fpkm (Union[np.ndarray, pd.Series]) – Raw FPKM values for all genes in ONE sample (NOT log2 transformed)

  • min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02

  • min_peakdistance (int, optional) – Minimum distance between peaks, by default 1

  • prominence (float, optional) – Minimum prominence for peak detection, by default 0.05

  • verbose (bool, optional) – Whether to log debug information, by default False

Returns:

Array of zFPKM values

Return type:

np.ndarray

Raises:

ValueError – If no valid FPKM values are found after filtering

napistu.context.discretize.discretize_expression_data(expression_data: DataFrame, metadata_attributes: list[str] = None, min_row_sum: int = 50, zfpm_threshold: float = -3, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False)

Discretize the GTEx data

Parameters:
  • expression_data (pandas DataFrame) – The expression data to discretize

  • metadata_attributes (list[str], optional) – Non-numeric and other metadata attributes which should be included in the output but ignored when discretizing expression data

  • min_row_sum (int, optional) – The minimum row sum to use for filtering constituatively un-expressed genes

  • zfpm_threshold (float, optional) – The zFPKM threshold to use for discretization. Samples with zFPKM values below this threshold are considered as unexpressed (0) in the sample/condition.

  • min_peakheight (float, optional) – The minimum peak height to use for peak detection

  • min_peakdistance (int, optional) – The minimum peak distance to use for peak detection

  • prominence (float, optional) – The prominence to use for peak detection

  • verbose (bool, optional) – Whether to print verbose output

Returns:

A tuple of two pandas DataFrames. The first DataFrame contains the zFPKM-transformed expression data with the metadata attributes merged on the left. The second DataFrame contains the expression data with binary values (0 for unexpressed, 1 for expressed) merged on the left.

Return type:

tuple of pandas DataFrames

napistu.context.discretize.generate_simple_test_data(n_genes: int = 200, n_samples: int = 100) DataFrame

Generate a simple test dataset for basic validation.

Parameters:
  • n_genes (int, optional) – Number of genes to generate, by default 200

  • n_samples (int, optional) – Number of samples to generate, by default 50

Returns:

DataFrame with simulated FPKM values. Rows = genes, Columns = samples

Return type:

pd.DataFrame

napistu.context.discretize.zfpkm(fpkm_df: DataFrame, min_peakheight: float = 0.02, min_peakdistance: int = 1, prominence: float = 0.05, verbose: bool = False) DataFrame

Transform entire DataFrame using zFPKM.

Parameters:
  • fpkm_df (pd.DataFrame) – DataFrame containing raw FPKM values. Rows = genes/transcripts, Columns = samples

  • min_peakheight (float, optional) – Minimum height for peak detection, by default 0.02

  • min_peakdistance (int, optional) – Minimum distance between peaks, by default 1

  • prominence (float, optional) – Minimum prominence for peak detection, by default 0.05

  • verbose (bool, optional) – Whether to log detailed information, by default False

Returns:

DataFrame with zFPKM transformed values

Return type:

pd.DataFrame