napistu.matching.mount

Functions

bind_dict_of_wide_results(sbml_dfs, ...[, ...])

Bind a dictionary of wide results to an SBML_dfs object.

bind_wide_results(sbml_dfs, results_df, ...)

Binds wide results to a sbml_dfs object.

resolve_matches(matched_data[, ...])

Resolve many-to-1 and 1-to-many matches in matched data.

napistu.matching.mount._aggregate_grouped_columns(df: DataFrame, column_types: dict, numeric_aggregator: callable, boolean_aggregator: callable, feature_id_var: str = 'feature_id', numeric_agg: str = 'weighted_mean') DataFrame

Aggregate numeric, boolean, and string columns for grouped DataFrame. Assumes deduplication by feature_id within each s_id has already been performed. Returns the combined DataFrame.

napistu.matching.mount._classify_columns_modify_types(df: DataFrame, always_string=None)

Classify DataFrame columns into modify types for consensus operations.

Parameters:
  • df (pd.DataFrame) – The DataFrame to classify.

  • always_string (list or set, optional) – Columns to always treat as string type (e.g., [‘feature_id’]).

Returns:

Dictionary with keys ‘numeric’, ‘boolean’, ‘string’ and values being pd.Index of columns in those categories.

Return type:

dict

napistu.matching.mount._get_boolean_aggregator(method: str = 'first', feature_id_var: str = 'feature_id') callable

Get aggregation function for boolean columns.

Parameters:
  • method (str, default="first") – Aggregation method to use: - “first”: first value after sorting by feature_id (default) - “weighted_mean”: treat as first (booleans don’t support weighted averaging) - “mean”: treat as first (booleans don’t support arithmetic mean) - “max”: treat as first (boolean max is ambiguous)

  • feature_id_var (str, default="feature_id") – Name of the column specifying a measured feature - used for sorting

Returns:

Aggregation function to use with groupby for boolean columns

Return type:

callable

napistu.matching.mount._get_numeric_aggregator(method: str = 'weighted_mean', feature_id_var: str = 'feature_id') callable

Get aggregation function for numeric columns with various methods.

Parameters:
  • method (str, default="weighted_mean") – Aggregation method to use: - “weighted_mean”: weighted by inverse of feature_id frequency (default) - “mean”: simple arithmetic mean - “first”: first value after sorting by feature_id_var (requires feature_id_var) - “max”: maximum value

  • feature_id_var (str, default="feature_id") – Name of the column specifying a measured feature - used for sorting and weighting

Returns:

Aggregation function to use with groupby

Return type:

callable

Raises:

ValueError – If method is not recognized

napistu.matching.mount._get_wide_results_valid_ontologies(results_df: DataFrame, ontologies: str | list | None = None) list

Get the valid ontologies for a wide results dataframe.

If ontologies is a string, it will be converted to a list. If ontologies is None, the column names of the results dataframe which match ONTOLOGIES_LIST will be used.

Parameters:
  • results_df (pd.DataFrame) – The results dataframe to get the valid ontologies for.

  • ontologies (optional str, list) – The ontology to use for the species identifiers. If not provided, the column names of the results dataframes which match ONTOLOGIES_LIST will be used.

Returns:

The valid ontologies for the results dataframe.

Return type:

list

napistu.matching.mount._split_numeric_non_numeric_columns(df: DataFrame, always_non_numeric=None)

Utility to split DataFrame columns into numeric and non-numeric, always treating specified columns as non-numeric.

Parameters:
  • df (pd.DataFrame) – The DataFrame to split.

  • always_non_numeric (list or set, optional) – Columns to always treat as non-numeric (e.g., [‘feature_id’]).

Returns:

  • numeric_cols (pd.Index) – Columns considered numeric (int64, float64, and not in always_non_numeric).

  • non_numeric_cols (pd.Index) – Columns considered non-numeric (object, string, etc., plus always_non_numeric).

napistu.matching.mount.bind_dict_of_wide_results(sbml_dfs: SBML_dfs, results_dict: dict, results_name: str, strategy: str = 'concatenate', species_identifiers: DataFrame = None, ontologies: str | list | None = None, dogmatic: bool = False, inplace: bool = True, verbose=True)

Bind a dictionary of wide results to an SBML_dfs object.

This function is used to bind a dictionary of wide results to 1 or more species_data attributes of an SBML_dfs object. The dictionary should have keys which are the modality names and values which are the results dataframes. The “strategy” argument controls how the results are added to the SBML_dfs object.

Parameters:
  • sbml_dfs ("SBML_dfs" # noqa: F821) – The SBML_dfs object to bind the results to.

  • results_dict (dict) – A dictionary of results dataframes with modality names as keys.

  • results_name (str) – The name of the species_data attribute to bind the results to.

  • strategy (str) –

    The strategy to use for binding the results.

    Options are: - “concatenate” : concatenate the results dataframes and add them as a single attribute. - “multiple_keys” : add each modality’s results as a separate attribute. The attribute name will be f’{results_name}_{modality}’. - “stagger” : add each modality’s results as a separate attribute. The attribute name will be f’{attr_name}_{modality}’.

  • species_identifiers (pd.DataFrame) – A dataframe with species identifiers.

  • ontologies (optional str, list) – The ontology to use for the species identifiers. If not provided, the column names of the results dataframes which match ONTOLOGIES_LIST will be used.

  • dogmatic (bool) – Whether to use dogmatic mode. Ignored if species_identifiers is provided.

  • verbose (bool) – Whether to print verbose output.

  • inplace (bool, default=True) – Whether to modify the sbml_dfs object in place. If False, returns a copy.

Returns:

Optional[“SBML_dfs”] # noqa – If inplace=True, returns None. Otherwise returns the modified copy of sbml_dfs.

Return type:

F821

napistu.matching.mount.bind_wide_results(sbml_dfs: SBML_dfs, results_df: DataFrame, results_name: str, ontologies: Set[str] | Dict[str, str] | None = None, dogmatic: bool = False, species_identifiers: DataFrame | None = None, feature_id_var: str = 'feature_id', numeric_agg: str = 'weighted_mean', keep_id_col: bool = True, verbose: bool = False, inplace: bool = True) SBML_dfs | None

Binds wide results to a sbml_dfs object.

Take a table with molecular species-level attributes tied to systematic identifiers and match them to an sbml_dfs_model transferring these attributes to species_data

Parameters:
  • sbml_dfs ("SBML_dfs" # noqa: F821) – The sbml_dfs object to bind the results to.

  • results_df (pd.DataFrame) – The table containing the results to bind.

  • results_name (str) – The name of the results to bind.

  • ontologies (Optional[Union[Set[str], Dict[str, str]]], default=None) – Either: - Set of columns to treat as ontologies (these should be entries in ONTOLOGIES_LIST ) - Dict mapping wide column names to ontology names in the ONTOLOGIES_LIST controlled vocabulary - None to automatically detect valid ontology columns based on ONTOLOGIES_LIST

  • dogmatic (bool) – Whether to respect differences between genes, transcripts, and proteins (True) or ignore them (False).

  • species_identifiers (Optional[pd.DataFrame]) – Systematic identifiers for the molecular species “sbml_dfs”. If None this will be generate on-the-fly.

  • feature_id_var (str) – The name of the column in the results_df that contains the feature identifiers. If this does not exist it will be created.

  • numeric_agg (str) – The aggregation method to use for resolving degeneracy.

  • keep_id_col (bool) – Whether to keep the identifier column in the results_df.

  • verbose (bool) – Whether to log cases of 1-to-many and many-to-one mapping and to indicate the behavior for resolving degeneracy

  • inplace (bool, default=True) – Whether to modify the sbml_dfs object in place. If False, returns a copy.

Returns:

sbml_dfs – The sbml_dfs object with the results bound.

Return type:

“SBML_dfs” # noqa: F821

napistu.matching.mount.resolve_matches(matched_data: DataFrame, feature_id_var: str = 'feature_id', index_col: str = 's_id', numeric_agg: str = 'weighted_mean', keep_id_col: bool = True) DataFrame

Resolve many-to-1 and 1-to-many matches in matched data.

Parameters:
  • matched_data (pd.DataFrame) – DataFrame containing matched data with columns: - feature_id_var: identifier column (e.g. feature_id) - index_col: index column (e.g. s_id) - other columns: data columns to be aggregated

  • feature_id_var (str, default="feature_id") – Name of the identifier column

  • index_col (str, default="s_id") – Name of the column to use as index

  • numeric_agg (str, default="weighted_mean") – Method to aggregate numeric columns: - “weighted_mean”: weighted by inverse of feature_id frequency (default) - “mean”: simple arithmetic mean - “first”: first value after sorting by feature_id_var (requires feature_id_var) - “max”: maximum value

  • keep_id_col (bool, default=True) – Whether to keep and rollup the feature_id_var in the output. If False, feature_id_var will be dropped from the output.

Returns:

DataFrame with resolved matches: - Many-to-1: numeric columns are aggregated using specified method - 1-to-many: adds a count column showing number of matches - Index is set to index_col and named accordingly

Return type:

pd.DataFrame

Raises:
  • KeyError – If feature_id_var is not present in the DataFrame

  • TypeError – If DataFrame contains unsupported data types (boolean or datetime)