napistu.statistics.quantiles

Module for comparing observed values to null distributions.

Functions

calculate_quantiles(observed_df, null_df, *, ...)

Calculate quantiles of observed scores relative to null distributions using the standard midrank method for tie handling.

napistu.statistics.quantiles._assert_quantile_inputs(observed_df: DataFrame, null_df: DataFrame) None
napistu.statistics.quantiles._calculate_quantiles_dense(observed_df: DataFrame, null_df: DataFrame, *, comparison_dtype: Any) DataFrame

Midrank quantiles: stack all nulls into a dense 3D array and vectorize.

Favors CPU throughput when the problem fits in memory; materializes shape (n_features, n_null_samples, n_attributes) (padded with NaN where per-feature null counts differ).

napistu.statistics.quantiles._calculate_quantiles_per_feature(observed_df: DataFrame, null_df: DataFrame, *, comparison_dtype: Any) DataFrame

Midrank quantiles: one (n_null, n_attr) block per feature; no 3D tensor.

napistu.statistics.quantiles._coerce_to_2d_null_block(null_slice: DataFrame | Series, dtype: Any) ndarray

Return (n_null_samples, n_attributes) array for one feature’s null draws.

napistu.statistics.quantiles.calculate_quantiles(observed_df: DataFrame, null_df: DataFrame, *, method: str = 'dense', comparison_dtype: Any = <class 'numpy.float32'>) DataFrame

Calculate quantiles of observed scores relative to null distributions using the standard midrank method for tie handling.

This implements the same approach as R’s quantile function (Type 7), which handles ties by averaging the ranks of tied values. For an observed value with tied null values, the quantile is calculated as: (count_less_than + count_equal_to/2) / total_count

This approach ensures proper statistical behavior: if an observed value of 0.5 is compared to null values [0.3, 0.5, 0.7], the result is (1 + 1/2)/3 = 0.5, meaning the observed value falls at the 50th percentile.

Parameters:
  • observed_df (pd.DataFrame) – DataFrame with features as index and attributes as columns containing observed scores.

  • null_df (pd.DataFrame) – DataFrame with null scores, features as index (multiple rows per feature) and attributes as columns.

  • method (str) –

    per_feature: one null block per feature, linear memory in n_null_samples * n_attributes (no 3D tensor). Best for large graphs and many nulls when memory is the constraint.

    dense (default): materialize a padded 3D array (n_features, n_null_samples, n_attributes) and use vectorized comparisons. Faster when the full array fits in RAM, but can increase peak memory substantially.

  • comparison_dtype – numpy float dtype for < and == (default float32 for parity with napistu.utils.pd_utils.downcast_float_dataframe() in the propagation pipeline). The dense path stores the 3D block in this dtype when it is float32, float16, or float64. Use float64 to mirror legacy all-float64 midrank numerics.

Returns:

DataFrame with same structure as observed_df containing quantiles. Each value represents the proportion of null values relative to observed value using the midrank method for handling ties. Returns NaN when the observed value and all null values are identical (no meaningful quantile can be computed).

Return type:

pd.DataFrame

Notes

The midrank method is the standard statistical approach used in R and other major statistical software packages. When all values (observed + nulls) for a feature-attribute combination are identical, NaN is returned since no meaningful ranking is possible.

The per_feature method processes one feature at a time; the dense method is the historical vectorized implementation optimized for speed at the cost of a large temporary 3D buffer.