napistu.utils.pd_utils

Utilities for pandas DataFrame operations.

Classes

match_pd_vars: class

Match pandas variables - check if required variables are present in a DataFrame or Series.

Public Functions

check_unique_index(df: pd.DataFrame, label: str = “”) -> None:

Validate that each index value only maps to a single row.

downcast_float_dataframe(df: pd.DataFrame, dtype: Any = np.float32, copy: bool = True) -> pd.DataFrame:

Coerce DataFrame values to a narrower float dtype to reduce in-memory size.

drop_extra_cols(df_in: pd.DataFrame, df_out: pd.DataFrame, always_include: Optional[List[str]] = None) -> pd.DataFrame:

Remove columns in df_out that are not in df_in, except those specified in always_include.

ensure_pd_df(pd_df_or_series: pd.DataFrame | pd.Series) -> pd.DataFrame:

Ensure pandas DataFrame by converting a Series to DataFrame if needed.

format_identifiers_as_edgelist(df: pd.DataFrame, defining_vars: list[str], verbose: bool = False) -> pd.DataFrame:

Format identifiers as edgelist by collapsing multiindex and multiple variables.

infer_entity_type(df: pd.DataFrame) -> str:

Infer the entity type of a DataFrame based on its structure and schema.

matrix_to_edgelist(matrix: np.ndarray, row_labels: Optional[List] = None, col_labels: Optional[List] = None) -> pd.DataFrame:

Convert a matrix to an edgelist format.

safe_series_tolist(x: str | pd.Series) -> list:

Convert either a list or str to a list.

series_to_none_filled_list(series: pd.Series) -> list:

Convert a pandas Series to a Python list, replacing all NA-like values with None.

style_df(df: pd.DataFrame, headers: Union[str, list[str], None] = “keys”, hide_index: bool = False) -> Styler:

Style a pandas DataFrame with simple formatting options.

update_pathological_names(names: pd.Series, prefix: str) -> pd.Series:

Update pathological names in a pandas Series by adding a prefix if all numeric.

validate_merge(left_df: pd.DataFrame, right_df: pd.DataFrame, left_on: Union[str, List[str]], right_on: Union[str, List[str]], relationship: Optional[str] = None) -> None:

Validate merge relationship before performing a merge operation.

Functions

check_unique_index(df[, label])

Validate that each index value only maps to a single row.

downcast_float_dataframe(df, dtype, copy)

Coerce values to a narrower float dtype (default float32) to cut RAM usage.

drop_extra_cols(df_in, df_out[, always_include])

Remove columns in df_out that are not in df_in, except those specified in always_include.

ensure_pd_df(pd_df_or_series)

Ensure Pandas DataFrame

format_identifiers_as_edgelist(df, defining_vars)

Format Identifiers as Edgelist

infer_entity_type(df)

Infer the entity type of a DataFrame based on its structure and schema.

matrix_to_edgelist(matrix[, row_labels, ...])

safe_series_tolist(x)

Convert either a list or str to a list.

series_to_none_filled_list(series)

Convert a pandas Series to a Python list, replacing all NA-like values with None.

style_df(df[, headers, hide_index])

Style DataFrame

update_pathological_names(names, prefix)

Update pathological names in a pandas Series.

validate_merge(left_df, right_df, left_on, ...)

Validate merge relationship before performing a merge operation.

Classes

match_pd_vars(df, req_vars[, allow_series])

Match Pandas Variables.

class napistu.utils.pd_utils.match_pd_vars(df: DataFrame | Series, req_vars: set, allow_series: bool = True)

Bases: object

Match Pandas Variables.

req_vars

A set of variables which should exist in df

missing_vars

Required variables which are not present in df

extra_vars

Non-required variables which are present in df

are_present

Returns True if req_vars are present and False otherwise

assert_present()

Raise an exception of req_vars are absent

__init__(df: DataFrame | Series, req_vars: set, allow_series: bool = True) None

Connects to an SBML file

Parameters:
  • df – A pd.DataFrame or pd.Series

  • req_vars – A set of variables which should exist in df

  • allow_series – Can a pd.Series be provided as df?

Return type:

None.

assert_present() None

Raise an error if required variables are missing

napistu.utils.pd_utils._merge_and_log_overwrites(left_df: DataFrame, right_df: DataFrame, merge_context: str, **merge_kwargs) DataFrame

Merge two DataFrames and log any column overwrites.

Parameters:
  • left_df (pd.DataFrame) – Left DataFrame for merge

  • right_df (pd.DataFrame) – Right DataFrame for merge

  • merge_context (str) – Description of the merge operation for logging

  • **merge_kwargs (dict) – Additional keyword arguments passed to pd.merge

Returns:

Merged DataFrame with overwritten columns removed

Return type:

pd.DataFrame

napistu.utils.pd_utils.check_unique_index(df, label='')

Validate that each index value only maps to a single row.

napistu.utils.pd_utils.downcast_float_dataframe(df: DataFrame, dtype: Any = <class 'numpy.float32'>, copy: bool = True) DataFrame

Coerce values to a narrower float dtype (default float32) to cut RAM usage.

Typical use is large all-numeric tables (e.g. stacked propagation scores) where float64 is unnecessary. Uses pandas.DataFrame.astype() with the given dtype and copy semantics.

Parameters:
  • df – DataFrame to coerce.

  • dtype – Numpy float dtype, commonly np.float32 to halve size versus float64.

  • copy – If True (default), always returns a new object even when dtype matches.

Returns:

DataFrame with values in dtype.

Return type:

pd.DataFrame

napistu.utils.pd_utils.drop_extra_cols(df_in: DataFrame, df_out: DataFrame, always_include: List[str] | None = None) DataFrame

Remove columns in df_out that are not in df_in, except those specified in always_include.

Parameters:
  • df_in (pd.DataFrame) – Reference DataFrame whose columns determine what to keep

  • df_out (pd.DataFrame) – DataFrame to filter columns from

  • always_include (Optional[List[str]], optional) – List of column names to always include in output, even if not in df_in

Returns:

DataFrame with columns filtered to match df_in plus any always_include columns. Column order follows df_in, with always_include columns appended at the end.

Return type:

pd.DataFrame

Examples

>>> df_in = pd.DataFrame({'a': [1], 'b': [2]})
>>> df_out = pd.DataFrame({'a': [3], 'c': [4], 'd': [5]})
>>> _drop_extra_cols(df_in, df_out)
# Returns DataFrame with just column 'a'
>>> _drop_extra_cols(df_in, df_out, always_include=['d'])
# Returns DataFrame with columns ['a', 'd']
napistu.utils.pd_utils.ensure_pd_df(pd_df_or_series: DataFrame | Series) DataFrame

Ensure Pandas DataFrame

Convert a pd.Series to a DataFrame if needed.

Parameters:

pd_df_or_series (pd.Series | pd.DataFrame) – a pandas df or series

Returns:

pd_df converted to a pd.DataFrame if needed

napistu.utils.pd_utils.format_identifiers_as_edgelist(df: DataFrame, defining_vars: list[str], verbose: bool = False) DataFrame

Format Identifiers as Edgelist

Collapse a multiindex to an index (if needed), and similarly collapse multiple variables to a single entry. This indexed pd.Sereies of index - ids can be treated as an edgelist for greedy clustering.

Parameters:
  • df (pd.DataFrame) – Any pd.DataFrame

  • defining_vars (list[str]) – A set of attributes which define a distinct entry in df

  • verbose (bool, default=False) – If True, then include detailed logs.

Returns:

df – A pd.DataFrame with an “ind” and “id” variable added indicating rolled up values of the index and defining_vars

Return type:

pd.DataFrame

napistu.utils.pd_utils.infer_entity_type(df: DataFrame) str

Infer the entity type of a DataFrame based on its structure and schema.

Parameters:

df (pd.DataFrame) – The DataFrame to analyze

Returns:

The inferred entity type name

Return type:

str

Raises:

ValueError – If no entity type can be determined

napistu.utils.pd_utils.matrix_to_edgelist(matrix, row_labels=None, col_labels=None)
napistu.utils.pd_utils.safe_series_tolist(x)

Convert either a list or str to a list.

napistu.utils.pd_utils.series_to_none_filled_list(series: Series) list

Convert a pandas Series to a Python list, replacing all NA-like values with None.

Pandas represents missing values as NaN (a float) in numeric Series, but many downstream consumers (e.g. igraph, JSON serialization) expect None for missing values. This function normalizes missing values consistently across all dtypes.

Parameters:

series (pd.Series) – The Series to convert. May be any dtype.

Returns:

Python list with NaN, pd.NA, and other NA-like values replaced by None.

Return type:

list

napistu.utils.pd_utils.style_df(df: DataFrame, headers: str | list[str] | None = 'keys', hide_index: bool = False) Styler

Style DataFrame

Provide some simple options for styling a pd.DataFrame

Parameters:
  • df (pd.DataFrame) – A table to style

  • headers (Union[str, list[str], None]) –

    • “keys” to use the current column names

    • None to suppress column names

    • list[str] to overwrite and show column names

  • hide_index (bool) – Should rows be displayed?

Returns:

styled_dfdf with styles updated

Return type:

Styler

napistu.utils.pd_utils.update_pathological_names(names: Series, prefix: str) Series

Update pathological names in a pandas Series.

Add a prefix to the names if they are all numeric.

napistu.utils.pd_utils.validate_merge(left_df: DataFrame, right_df: DataFrame, left_on: str | List[str], right_on: str | List[str], relationship: str, _original_relationship: str | None = None) None

Validate merge relationship before performing a merge operation.

Parameters:
  • left_df (pd.DataFrame) – Left DataFrame for merge

  • right_df (pd.DataFrame) – Right DataFrame for merge

  • left_on (str or list of str) – Column name(s) in left_df to merge on

  • right_on (str or list of str) – Column name(s) in right_df to merge on

  • relationship (str) – Expected relationship type to validate: - ‘1:1’ (one-to-one): both keys are unique and match exactly - ‘1:m’ (one-to-many): left keys are unique and can match to 1 or more right keys - ‘m:1’ (many-to-one): right keys are unique and can match to 1 or more left keys - ‘m:m’ (many-to-many): both keys may have duplicates - ‘1:0’ (one-to-zero-or-one): left keys are unique and can match to 0 or more right keys - ‘0:1’ (zero-or-one-to-one): right keys are unique and can match to 0 or more left keys

Raises:

ValueError – If relationship validation fails