napistu.utils.pd_utils
Utilities for pandas DataFrame operations.
Classes
- match_pd_vars: class
Match pandas variables - check if required variables are present in a DataFrame or Series.
Public Functions
- check_unique_index(df: pd.DataFrame, label: str = “”) -> None:
Validate that each index value only maps to a single row.
- downcast_float_dataframe(df: pd.DataFrame, dtype: Any = np.float32, copy: bool = True) -> pd.DataFrame:
Coerce DataFrame values to a narrower float dtype to reduce in-memory size.
- drop_extra_cols(df_in: pd.DataFrame, df_out: pd.DataFrame, always_include: Optional[List[str]] = None) -> pd.DataFrame:
Remove columns in df_out that are not in df_in, except those specified in always_include.
- ensure_pd_df(pd_df_or_series: pd.DataFrame | pd.Series) -> pd.DataFrame:
Ensure pandas DataFrame by converting a Series to DataFrame if needed.
- format_identifiers_as_edgelist(df: pd.DataFrame, defining_vars: list[str], verbose: bool = False) -> pd.DataFrame:
Format identifiers as edgelist by collapsing multiindex and multiple variables.
- infer_entity_type(df: pd.DataFrame) -> str:
Infer the entity type of a DataFrame based on its structure and schema.
- matrix_to_edgelist(matrix: np.ndarray, row_labels: Optional[List] = None, col_labels: Optional[List] = None) -> pd.DataFrame:
Convert a matrix to an edgelist format.
- safe_series_tolist(x: str | pd.Series) -> list:
Convert either a list or str to a list.
- series_to_none_filled_list(series: pd.Series) -> list:
Convert a pandas Series to a Python list, replacing all NA-like values with None.
- style_df(df: pd.DataFrame, headers: Union[str, list[str], None] = “keys”, hide_index: bool = False) -> Styler:
Style a pandas DataFrame with simple formatting options.
- update_pathological_names(names: pd.Series, prefix: str) -> pd.Series:
Update pathological names in a pandas Series by adding a prefix if all numeric.
- validate_merge(left_df: pd.DataFrame, right_df: pd.DataFrame, left_on: Union[str, List[str]], right_on: Union[str, List[str]], relationship: Optional[str] = None) -> None:
Validate merge relationship before performing a merge operation.
Functions
|
Validate that each index value only maps to a single row. |
|
Coerce values to a narrower float dtype (default float32) to cut RAM usage. |
|
Remove columns in df_out that are not in df_in, except those specified in always_include. |
|
Ensure Pandas DataFrame |
|
Format Identifiers as Edgelist |
Infer the entity type of a DataFrame based on its structure and schema. |
|
|
|
Convert either a list or str to a list. |
|
|
Convert a pandas Series to a Python list, replacing all NA-like values with None. |
|
Style DataFrame |
|
Update pathological names in a pandas Series. |
|
Validate merge relationship before performing a merge operation. |
Classes
|
Match Pandas Variables. |
- class napistu.utils.pd_utils.match_pd_vars(df: DataFrame | Series, req_vars: set, allow_series: bool = True)
Bases:
objectMatch Pandas Variables.
- req_vars
A set of variables which should exist in df
- missing_vars
Required variables which are not present in df
- extra_vars
Non-required variables which are present in df
- are_present
Returns True if req_vars are present and False otherwise
- assert_present()
Raise an exception of req_vars are absent
- __init__(df: DataFrame | Series, req_vars: set, allow_series: bool = True) None
Connects to an SBML file
- Parameters:
df – A pd.DataFrame or pd.Series
req_vars – A set of variables which should exist in df
allow_series – Can a pd.Series be provided as df?
- Return type:
None.
- assert_present() None
Raise an error if required variables are missing
- napistu.utils.pd_utils._merge_and_log_overwrites(left_df: DataFrame, right_df: DataFrame, merge_context: str, **merge_kwargs) DataFrame
Merge two DataFrames and log any column overwrites.
- Parameters:
left_df (pd.DataFrame) – Left DataFrame for merge
right_df (pd.DataFrame) – Right DataFrame for merge
merge_context (str) – Description of the merge operation for logging
**merge_kwargs (dict) – Additional keyword arguments passed to pd.merge
- Returns:
Merged DataFrame with overwritten columns removed
- Return type:
pd.DataFrame
- napistu.utils.pd_utils.check_unique_index(df, label='')
Validate that each index value only maps to a single row.
- napistu.utils.pd_utils.downcast_float_dataframe(df: DataFrame, dtype: Any = <class 'numpy.float32'>, copy: bool = True) DataFrame
Coerce values to a narrower float dtype (default float32) to cut RAM usage.
Typical use is large all-numeric tables (e.g. stacked propagation scores) where float64 is unnecessary. Uses
pandas.DataFrame.astype()with the givendtypeandcopysemantics.- Parameters:
df – DataFrame to coerce.
dtype – Numpy float dtype, commonly
np.float32to halve size versus float64.copy – If True (default), always returns a new object even when dtype matches.
- Returns:
DataFrame with values in
dtype.- Return type:
pd.DataFrame
- napistu.utils.pd_utils.drop_extra_cols(df_in: DataFrame, df_out: DataFrame, always_include: List[str] | None = None) DataFrame
Remove columns in df_out that are not in df_in, except those specified in always_include.
- Parameters:
df_in (pd.DataFrame) – Reference DataFrame whose columns determine what to keep
df_out (pd.DataFrame) – DataFrame to filter columns from
always_include (Optional[List[str]], optional) – List of column names to always include in output, even if not in df_in
- Returns:
DataFrame with columns filtered to match df_in plus any always_include columns. Column order follows df_in, with always_include columns appended at the end.
- Return type:
pd.DataFrame
Examples
>>> df_in = pd.DataFrame({'a': [1], 'b': [2]}) >>> df_out = pd.DataFrame({'a': [3], 'c': [4], 'd': [5]}) >>> _drop_extra_cols(df_in, df_out) # Returns DataFrame with just column 'a'
>>> _drop_extra_cols(df_in, df_out, always_include=['d']) # Returns DataFrame with columns ['a', 'd']
- napistu.utils.pd_utils.ensure_pd_df(pd_df_or_series: DataFrame | Series) DataFrame
Ensure Pandas DataFrame
Convert a pd.Series to a DataFrame if needed.
- Parameters:
pd_df_or_series (pd.Series | pd.DataFrame) – a pandas df or series
- Returns:
pd_df converted to a pd.DataFrame if needed
- napistu.utils.pd_utils.format_identifiers_as_edgelist(df: DataFrame, defining_vars: list[str], verbose: bool = False) DataFrame
Format Identifiers as Edgelist
Collapse a multiindex to an index (if needed), and similarly collapse multiple variables to a single entry. This indexed pd.Sereies of index - ids can be treated as an edgelist for greedy clustering.
- Parameters:
df (pd.DataFrame) – Any pd.DataFrame
defining_vars (list[str]) – A set of attributes which define a distinct entry in df
verbose (bool, default=False) – If True, then include detailed logs.
- Returns:
df – A pd.DataFrame with an “ind” and “id” variable added indicating rolled up values of the index and defining_vars
- Return type:
pd.DataFrame
- napistu.utils.pd_utils.infer_entity_type(df: DataFrame) str
Infer the entity type of a DataFrame based on its structure and schema.
- Parameters:
df (pd.DataFrame) – The DataFrame to analyze
- Returns:
The inferred entity type name
- Return type:
str
- Raises:
ValueError – If no entity type can be determined
- napistu.utils.pd_utils.matrix_to_edgelist(matrix, row_labels=None, col_labels=None)
- napistu.utils.pd_utils.safe_series_tolist(x)
Convert either a list or str to a list.
- napistu.utils.pd_utils.series_to_none_filled_list(series: Series) list
Convert a pandas Series to a Python list, replacing all NA-like values with None.
Pandas represents missing values as NaN (a float) in numeric Series, but many downstream consumers (e.g. igraph, JSON serialization) expect None for missing values. This function normalizes missing values consistently across all dtypes.
- Parameters:
series (pd.Series) – The Series to convert. May be any dtype.
- Returns:
Python list with NaN, pd.NA, and other NA-like values replaced by None.
- Return type:
list
- napistu.utils.pd_utils.style_df(df: DataFrame, headers: str | list[str] | None = 'keys', hide_index: bool = False) Styler
Style DataFrame
Provide some simple options for styling a pd.DataFrame
- Parameters:
df (pd.DataFrame) – A table to style
headers (Union[str, list[str], None]) –
“keys” to use the current column names
None to suppress column names
list[str] to overwrite and show column names
hide_index (bool) – Should rows be displayed?
- Returns:
styled_df – df with styles updated
- Return type:
Styler
- napistu.utils.pd_utils.update_pathological_names(names: Series, prefix: str) Series
Update pathological names in a pandas Series.
Add a prefix to the names if they are all numeric.
- napistu.utils.pd_utils.validate_merge(left_df: DataFrame, right_df: DataFrame, left_on: str | List[str], right_on: str | List[str], relationship: str, _original_relationship: str | None = None) None
Validate merge relationship before performing a merge operation.
- Parameters:
left_df (pd.DataFrame) – Left DataFrame for merge
right_df (pd.DataFrame) – Right DataFrame for merge
left_on (str or list of str) – Column name(s) in left_df to merge on
right_on (str or list of str) – Column name(s) in right_df to merge on
relationship (str) – Expected relationship type to validate: - ‘1:1’ (one-to-one): both keys are unique and match exactly - ‘1:m’ (one-to-many): left keys are unique and can match to 1 or more right keys - ‘m:1’ (many-to-one): right keys are unique and can match to 1 or more left keys - ‘m:m’ (many-to-many): both keys may have duplicates - ‘1:0’ (one-to-zero-or-one): left keys are unique and can match to 0 or more right keys - ‘0:1’ (zero-or-one-to-one): right keys are unique and can match to 0 or more left keys
- Raises:
ValueError – If relationship validation fails