napistu.utils

Napistu utilities package.

Submodules provide focused helpers (io_utils, path_utils, pd_utils, etc.). Optional dependencies are imported via napistu.utils.optional, not re-exported here.

class napistu.utils.match_pd_vars(df: DataFrame | Series, req_vars: set, allow_series: bool = True)

Bases: object

Match Pandas Variables.

req_vars

A set of variables which should exist in df

missing_vars

Required variables which are not present in df

extra_vars

Non-required variables which are present in df

are_present

Returns True if req_vars are present and False otherwise

assert_present()

Raise an exception of req_vars are absent

__init__(df: DataFrame | Series, req_vars: set, allow_series: bool = True) None

Connects to an SBML file

Parameters:
  • df – A pd.DataFrame or pd.Series

  • req_vars – A set of variables which should exist in df

  • allow_series – Can a pd.Series be provided as df?

Return type:

None.

assert_present() None

Raise an error if required variables are missing

napistu.utils.check_unique_index(df, label='')

Validate that each index value only maps to a single row.

napistu.utils.copy_uri(input_uri: str, output_uri: str, is_file: bool = True) None

Copy a file or folder from one URI to another.

Parameters:
  • input_uri (str) – Input file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).

  • output_uri (str) – Output file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).

  • is_file (bool, default=True) – If True, copy a single file. If False, copy directory recursively.

Examples

>>> copy_uri('/local/source.txt', '/local/dest.txt')
>>> copy_uri('gs://bucket/source/', 'gs://bucket/dest/', is_file=False)
napistu.utils.downcast_float_dataframe(df: DataFrame, dtype: Any = <class 'numpy.float32'>, copy: bool = True) DataFrame

Coerce values to a narrower float dtype (default float32) to cut RAM usage.

Typical use is large all-numeric tables (e.g. stacked propagation scores) where float64 is unnecessary. Uses pandas.DataFrame.astype() with the given dtype and copy semantics.

Parameters:
  • df – DataFrame to coerce.

  • dtype – Numpy float dtype, commonly np.float32 to halve size versus float64.

  • copy – If True (default), always returns a new object even when dtype matches.

Returns:

DataFrame with values in dtype.

Return type:

pd.DataFrame

napistu.utils.download_and_extract(url: str, output_dir_path: str = '.', download_method: str = 'wget', overwrite: bool = False) None

Download archive and extract to directory.

napistu.utils.download_ftp(url: str, path: str) None

Download a file from an FTP server.

Parameters:
  • url (str) – URL of the file to download

  • path (str) – Path to the output file

Return type:

None

napistu.utils.download_wget(url: str, path, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) None

Downloads file / archive with wget

Parameters:
  • url (str) – URL of the file to download

  • path (FilePath | WriteBuffer) – File path or buffer

  • target_filename (str) – Specific file to extract from ZIP if URL is a ZIP file

  • verify (bool) – url (str): url

  • timeout (int) – Timeout in seconds for the request

  • max_retries (int) – Number of times to retry the download if it fails

Return type:

None

napistu.utils.drop_extra_cols(df_in: DataFrame, df_out: DataFrame, always_include: List[str] | None = None) DataFrame

Remove columns in df_out that are not in df_in, except those specified in always_include.

Parameters:
  • df_in (pd.DataFrame) – Reference DataFrame whose columns determine what to keep

  • df_out (pd.DataFrame) – DataFrame to filter columns from

  • always_include (Optional[List[str]], optional) – List of column names to always include in output, even if not in df_in

Returns:

DataFrame with columns filtered to match df_in plus any always_include columns. Column order follows df_in, with always_include columns appended at the end.

Return type:

pd.DataFrame

Examples

>>> df_in = pd.DataFrame({'a': [1], 'b': [2]})
>>> df_out = pd.DataFrame({'a': [3], 'c': [4], 'd': [5]})
>>> _drop_extra_cols(df_in, df_out)
# Returns DataFrame with just column 'a'
>>> _drop_extra_cols(df_in, df_out, always_include=['d'])
# Returns DataFrame with columns ['a', 'd']
napistu.utils.ensure_pd_df(pd_df_or_series: DataFrame | Series) DataFrame

Ensure Pandas DataFrame

Convert a pd.Series to a DataFrame if needed.

Parameters:

pd_df_or_series (pd.Series | pd.DataFrame) – a pandas df or series

Returns:

pd_df converted to a pd.DataFrame if needed

napistu.utils.extract(file_uri: str) None

Extract archive at file_uri to same directory.

Supports: .tar.gz, .tgz, .zip, .gz

napistu.utils.extract_regex_match(regex: str, query: str) str
Parameters:
  • regex (str) – regular expression to search

  • query (str) – string to search against

Returns:

a character string match

Return type:

match (str)

Match an identifier substring and otherwise throw an error

Parameters:
  • regex (str) – regular expression to search

  • query (str) – string to search against

  • index_value (int) – entry in index to return

Returns:

a character string match

Return type:

match (str)

napistu.utils.find_weakly_connected_subgraphs(edgelist: DataFrame) DataFrame

Find all cliques of loosly connected components.

napistu.utils.format_identifiers_as_edgelist(df: DataFrame, defining_vars: list[str], verbose: bool = False) DataFrame

Format Identifiers as Edgelist

Collapse a multiindex to an index (if needed), and similarly collapse multiple variables to a single entry. This indexed pd.Sereies of index - ids can be treated as an edgelist for greedy clustering.

Parameters:
  • df (pd.DataFrame) – Any pd.DataFrame

  • defining_vars (list[str]) – A set of attributes which define a distinct entry in df

  • verbose (bool, default=False) – If True, then include detailed logs.

Returns:

df – A pd.DataFrame with an “ind” and “id” variable added indicating rolled up values of the index and defining_vars

Return type:

pd.DataFrame

napistu.utils.get_extn_from_url(url: str) str

Retrieve file extension from a URL.

Parameters:

url (str) – URL to extract extension from.

Returns:

File extension including the leading dot (e.g., ‘.gz’, ‘.tar.gz’).

Return type:

str

Raises:

ValueError – If no file extension can be identified in the URL.

Examples

>>> get_extn_from_url('https://test/test.gz')
'.gz'
>>> get_extn_from_url('https://test/test.tar.gz')
'.tar.gz'
>>> get_extn_from_url('https://test/test.tar.gz/bla')
Traceback (most recent call last):
...
ValueError: File extension not identifiable: https://test/test.tar.gz/bla
napistu.utils.get_source_base_and_path(uri: str) tuple[str, str]

Get the base of a bucket or folder and the path to the file.

For URIs with a scheme (e.g., ‘gs://’), returns the scheme + netloc as base. For local paths, returns the directory as base.

Parameters:

uri (str) – URI or path to parse.

Returns:

A tuple of (base, path) where: - base : str

The base URI or directory (e.g., ‘gs://bucket’ or ‘/local/dir’).

  • pathstr

    The relative path to the file (e.g., ‘folder/file’ or ‘file’).

Return type:

tuple[str, str]

Examples

>>> get_source_base_and_path("gs://bucket/folder/file")
('gs://bucket', 'folder/file')
>>> get_source_base_and_path("/bucket/folder/file")
('/bucket/folder', 'file')
napistu.utils.get_target_base_and_path(uri: str) tuple[str, str]

Get the base directory + parent path and the filename.

Splits the URI at the last path separator to extract the filename.

Parameters:

uri (str) – URI or path to parse.

Returns:

A tuple of (base, filename) where: - base : str

The directory path (e.g., ‘gs://bucket/folder’ or ‘/local/folder’).

  • filenamestr

    The filename (e.g., ‘file’).

Return type:

tuple[str, str]

Examples

>>> get_target_base_and_path("gs://bucket/folder/file")
('gs://bucket/folder', 'file')
>>> get_target_base_and_path("bucket/folder/file")
('bucket/folder', 'file')
>>> get_target_base_and_path("/bucket/folder/file")
('/bucket/folder', 'file')
napistu.utils.gunzip(gzipped_path: str, outpath: str | None = None) None

Gunzip a file to an output path.

Parameters:
  • gzipped_path (str) – Path or URI to the gzipped file (e.g., ‘/local/file.gz’, ‘gs://bucket/file.gz’).

  • outpath (str | None, optional) – Path or URI to the output file. If None, automatically determined by removing the .gz extension from gzipped_path.

Return type:

None

Raises:

FileNotFoundError – If gzipped_path does not exist.

Examples

>>> gunzip('/tmp/data.txt.gz')  # Creates /tmp/data.txt
>>> gunzip('gs://bucket/data.txt.gz', 'gs://bucket/output.txt')
napistu.utils.infer_entity_type(df: DataFrame) str

Infer the entity type of a DataFrame based on its structure and schema.

Parameters:

df (pd.DataFrame) – The DataFrame to analyze

Returns:

The inferred entity type name

Return type:

str

Raises:

ValueError – If no entity type can be determined

napistu.utils.initialize_dir(output_dir_path: str, overwrite: bool) None

Initialize a filesystem directory.

Creates a new directory or optionally overwrites an existing one. Works with any fsspec-supported filesystem (local, GCS, S3, etc.).

Parameters:
  • output_dir_path (str) – Path or URI to the directory to create (e.g., ‘/local/path’, ‘gs://bucket/path’).

  • overwrite (bool) – If True, delete and recreate the directory if it exists. If False, raise FileExistsError if the directory exists.

Raises:

FileExistsError – If directory exists and overwrite is False.

Examples

>>> initialize_dir('/tmp/newdir', overwrite=False)
>>> initialize_dir('gs://bucket/path', overwrite=True)
napistu.utils.load_json(uri: str) Any

Read JSON from a URI.

Parameters:

uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).

Returns:

The parsed JSON object (dict, list, etc.).

Return type:

Any

Examples

>>> data = load_json('/tmp/config.json')
>>> data = load_json('gs://bucket/config.json')
napistu.utils.load_parquet(uri: str | Path) DataFrame

Read a DataFrame from a Parquet file.

Parameters:

uri (Union[str, Path]) – Path or URI to the Parquet file to load (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’).

Returns:

The DataFrame loaded from the Parquet file.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If the specified file does not exist.

Examples

>>> df = load_parquet('/tmp/data.parquet')
>>> df = load_parquet('gs://bucket/data.parquet')
napistu.utils.load_pickle(path: str) Any

Load a pickle object from a path or URI.

Parameters:

path (str) – Path or URI to the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).

Returns:

The unpickled object.

Return type:

Any

Examples

>>> obj = load_pickle('/tmp/data.pkl')
>>> obj = load_pickle('gs://bucket/data.pkl')
napistu.utils.match_regex_dict(s: str, regex_dict: Dict[str, any]) any | None

Apply each regex in regex_dict to the string s. If a regex matches, return its value. If no regex matches, return None.

Parameters:
  • s (str) – The string to test.

  • regex_dict (dict) – Dictionary where keys are regex patterns (str), and values are the values to return.

Return type:

The value associated with the first matching regex, or None if no match.

napistu.utils.matrix_to_edgelist(matrix, row_labels=None, col_labels=None)
napistu.utils.path_exists(path: str) bool

Check if a path or URI exists.

Works with any fsspec-supported filesystem (local, GCS, S3, memory, etc.).

Parameters:

path (str) – Path or URI to check (e.g., ‘/local/path’, ‘gs://bucket/path’, ‘memory://path’).

Returns:

True if the path exists, False otherwise.

Return type:

bool

Examples

>>> path_exists('/tmp/myfile.txt')
False
>>> path_exists('gs://bucket/existing_file.txt')
True
>>> path_exists('.')
True
napistu.utils.pickle_cache(path: str, overwrite: bool = False) Callable

A decorator to cache a function call result to pickle

Attention: this does not care about the function arguments All function calls will be served by the same pickle file.

Parameters:
  • path (str) – Path to the cache pickle file

  • overwrite (bool) – Should an existing cache be overwritten even if it exists?

Returns:

A function whos output will be cached to pickle.

Return type:

Callable

napistu.utils.requests_retry_session(retries=5, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session: Session | None = None, **kwargs) Session

Requests session with retry logic

This should help to combat flaky apis, eg Brenda. From: https://stackoverflow.com/a/58687549

Parameters:
  • retries (int) – Number of retries. Defaults to 5.

  • backoff_factor (float) – Backoff factor. Defaults to 0.3.

  • status_forcelist (tuple) – Errors to retry. Defaults to (500, 502, 503, 504).

  • session (requests.Session | None) – Existing session. Defaults to None.

Return type:

requests.Session

napistu.utils.safe_capitalize(text: str) str

Capitalize first letter only, preserve case of rest.

napistu.utils.safe_fill(x: str, fill_width: int = 15) str

Safely wrap a string to a specified width.

Parameters:
  • x (str) – The string to wrap.

  • fill_width (int, optional) – The width to wrap the string to. Default is 15.

Returns:

The wrapped string.

Return type:

str

napistu.utils.safe_join_set(values: Any) str | None

Safely join values, filtering out None values.

Converts input to a set (ensuring uniqueness), removes None values, and joins remaining values with “ OR “ separator in sorted order.

Parameters:

values (Any) – Values to join. Can be list, tuple, set, pandas Series, string, or other iterable. Strings are treated as single values, not character sequences.

Returns:

Joined string with “ OR “ separator in alphabetical order, or None if no valid values remain after filtering.

Return type:

str or None

Examples

>>> safe_join_set([1, 2, 3])
'1 OR 2 OR 3'
>>> safe_join_set([3, 1, 2, 1])  # Removes duplicates and sorts
'1 OR 2 OR 3'
>>> safe_join_set([1, None, 3])
'1 OR 3'
>>> safe_join_set([None, None])
None
>>> safe_join_set("hello")  # String treated as single value
'hello'
napistu.utils.save_json(uri: str, obj: Any) None

Write object to JSON file at URI.

Parameters:
  • uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).

  • obj (Any) – Object to serialize to JSON.

Return type:

None

Examples

>>> save_json('/tmp/config.json', {'key': 'value'})
>>> save_json('gs://bucket/config.json', {'key': 'value'})
napistu.utils.save_parquet(df: DataFrame, uri: str | Path, compression: str = 'snappy') None

Write a DataFrame to a single Parquet file.

Parameters:
  • df (pd.DataFrame) – The DataFrame to save.

  • uri (Union[str, Path]) – Path or URI where to save the Parquet file (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’). Recommended extensions: .parquet or .pq

  • compression (str, default='snappy') – Compression algorithm. Options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.

Raises:

OSError – If the file cannot be written to (permission issues, etc.).

Examples

>>> save_parquet(df, '/tmp/data.parquet')
>>> save_parquet(df, 'gs://bucket/data.parquet', compression='gzip')
napistu.utils.save_pickle(path: str, dat: Any) None

Save object to path as pickle.

Parameters:
  • path (str) – Path or URI where to save the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).

  • dat (Any) – Object to pickle.

Return type:

None

Examples

>>> save_pickle('/tmp/data.pkl', my_object)
>>> save_pickle('gs://bucket/data.pkl', my_object)
napistu.utils.score_nameness(string: str)

Score Nameness

This utility assigns a numeric score to a string reflecting how likely it is to be a human readable name. This will help to prioritize readable entries when we are trying to pick out a single name to display from a set of values which may also include entries like systematic ids.

Parameters:

string (str) – An alphanumeric string

Returns:

An integer score indicating how name-like the string is (low is more name-like)

Return type:

score (int)

napistu.utils.show(obj, method='auto', headers='keys', hide_index=False, left_align_strings=True, max_rows=20)

Show a table using the appropriate method for the environment.

Parameters:
  • obj (pd.DataFrame or any other object) – The object to show

  • method (str) – The method to use to show the object - “string” : show the object as a string - “jupyter” : show the object in a Jupyter notebook - “auto” : show the object in a Jupyter notebook if available, otherwise show as a string

  • headers (str, list, or None) – The headers to use for the object

  • left_align_strings (bool) – Should strings be left aligned?

  • max_rows (int) – The maximum number of rows to show

Return type:

None

Examples

>>> show(pd.DataFrame({"a": [1, 2, 3]}), headers="keys", hide_index=True)
napistu.utils.style_df(df: DataFrame, headers: str | list[str] | None = 'keys', hide_index: bool = False) Styler

Style DataFrame

Provide some simple options for styling a pd.DataFrame

Parameters:
  • df (pd.DataFrame) – A table to style

  • headers (Union[str, list[str], None]) –

    • “keys” to use the current column names

    • None to suppress column names

    • list[str] to overwrite and show column names

  • hide_index (bool) – Should rows be displayed?

Returns:

styled_dfdf with styles updated

Return type:

Styler

napistu.utils.update_pathological_names(names: Series, prefix: str) Series

Update pathological names in a pandas Series.

Add a prefix to the names if they are all numeric.

napistu.utils.write_file_contents_to_path(path: str, contents: bytes) None

Write file contents to a path or URI.

Handles both file-like objects with write() method and string paths/URIs.

Parameters:
  • path (str) – Destination path or URI, or a file-like object with write() method.

  • contents (bytes) – File contents to write.

Return type:

None

Examples

>>> write_file_contents_to_path('/tmp/file.txt', b'Hello')
>>> write_file_contents_to_path('gs://bucket/file.txt', b'Hello')

Modules

constants

Constants for the utils module.

display_utils

docker_utils

Base Docker image management for Napistu workflows.

ig_utils

Utilities for igraph operations.

io_utils

Utilities for input and output operations.

optional

Utilities for handling optional dependencies in Napistu.

path_utils

Utilities for path and URI operations.

pd_utils

Utilities for pandas DataFrame operations.

string_utils

Utilities for string operations and text processing.