napistu.utils

Napistu utilities package.

Submodules provide focused helpers (io_utils, path_utils, pd_utils, etc.). Optional dependencies are imported via napistu.utils.optional, not re-exported here.

class napistu.utils.match_pd_vars(df: DataFrame | Series, req_vars: set, allow_series: bool = True)

Bases: object

Match Pandas Variables.

req_vars: A set of variables which should exist in df

missing_vars: Required variables which are not present in df

extra_vars: Non-required variables which are present in df

are_present: Returns True if req_vars are present and False otherwise

assert_present(): Raise an exception of req_vars are absent

__init__(df: DataFrame | Series, req_vars: set, allow_series: bool = True) → None

Connects to an SBML file

Parameters:

df – A pd.DataFrame or pd.Series
req_vars – A set of variables which should exist in df
allow_series – Can a pd.Series be provided as df?

Return type:

None.

assert_present() → None: Raise an error if required variables are missing

napistu.utils.check_unique_index(df, label=''): Validate that each index value only maps to a single row.

napistu.utils.copy_uri(input_uri: str, output_uri: str, is_file: bool = True) → None

Copy a file or folder from one URI to another.

Parameters:

input_uri (str) – Input file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).
output_uri (str) – Output file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).
is_file (bool, default=True) – If True, copy a single file. If False, copy directory recursively.

Examples

>>> copy_uri('/local/source.txt', '/local/dest.txt')
>>> copy_uri('gs://bucket/source/', 'gs://bucket/dest/', is_file=False)

napistu.utils.downcast_float_dataframe(df: DataFrame, dtype: Any = <class 'numpy.float32'>, copy: bool = True) → DataFrame

Coerce values to a narrower float dtype (default float32) to cut RAM usage.

Typical use is large all-numeric tables (e.g. stacked propagation scores) where float64 is unnecessary. Uses pandas.DataFrame.astype() with the given dtype and copy semantics.

Parameters:

df – DataFrame to coerce.
dtype – Numpy float dtype, commonly np.float32 to halve size versus float64.
copy – If True (default), always returns a new object even when dtype matches.

Returns:

DataFrame with values in dtype.

Return type:

pd.DataFrame

napistu.utils.download_and_extract(url: str, output_dir_path: str = '.', download_method: str = 'wget', overwrite: bool = False) → None: Download archive and extract to directory.

napistu.utils.download_ftp(url: str, path: str) → None

Download a file from an FTP server.

Parameters:

url (str) – URL of the file to download
path (str) – Path to the output file

Return type:

None

napistu.utils.download_wget(url: str, path, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) → None

Downloads file / archive with wget

Parameters:

url (str) – URL of the file to download
path (FilePath | WriteBuffer) – File path or buffer
target_filename (str) – Specific file to extract from ZIP if URL is a ZIP file
verify (bool) – url (str): url
timeout (int) – Timeout in seconds for the request
max_retries (int) – Number of times to retry the download if it fails

Return type:

None

napistu.utils.drop_extra_cols(df_in: DataFrame, df_out: DataFrame, always_include: List[str] | None = None) → DataFrame

Remove columns in df_out that are not in df_in, except those specified in always_include.

Parameters:

df_in (pd.DataFrame) – Reference DataFrame whose columns determine what to keep
df_out (pd.DataFrame) – DataFrame to filter columns from
always_include (Optional[List[str]], optional) – List of column names to always include in output, even if not in df_in

Returns:

DataFrame with columns filtered to match df_in plus any always_include columns. Column order follows df_in, with always_include columns appended at the end.

Return type:

pd.DataFrame

Examples

>>> df_in = pd.DataFrame({'a': [1], 'b': [2]})
>>> df_out = pd.DataFrame({'a': [3], 'c': [4], 'd': [5]})
>>> _drop_extra_cols(df_in, df_out)
# Returns DataFrame with just column 'a'

>>> _drop_extra_cols(df_in, df_out, always_include=['d'])
# Returns DataFrame with columns ['a', 'd']

napistu.utils.ensure_pd_df(pd_df_or_series: DataFrame | Series) → DataFrame

Ensure Pandas DataFrame

Convert a pd.Series to a DataFrame if needed.

Parameters:: pd_df_or_series (pd.Series | pd.DataFrame) – a pandas df or series
Returns:: pd_df converted to a pd.DataFrame if needed

napistu.utils.extract(file_uri: str) → None

Extract archive at file_uri to same directory.

Supports: .tar.gz, .tgz, .zip, .gz

napistu.utils.extract_regex_match(regex: str, query: str) → str

Parameters:

regex (str) – regular expression to search
query (str) – string to search against

Returns:

a character string match

Return type:

match (str)

napistu.utils.extract_regex_search(regex: str, query: str, index_value: int = 0) → str

Match an identifier substring and otherwise throw an error

Parameters:

regex (str) – regular expression to search
query (str) – string to search against
index_value (int) – entry in index to return

Returns:

a character string match

Return type:

match (str)

napistu.utils.find_weakly_connected_subgraphs(edgelist: DataFrame) → DataFrame: Find all cliques of loosly connected components.

napistu.utils.format_identifiers_as_edgelist(df: DataFrame, defining_vars: list[str], verbose: bool = False) → DataFrame

Format Identifiers as Edgelist

Collapse a multiindex to an index (if needed), and similarly collapse multiple variables to a single entry. This indexed pd.Sereies of index - ids can be treated as an edgelist for greedy clustering.

Parameters:

df (pd.DataFrame) – Any pd.DataFrame
defining_vars (list[str]) – A set of attributes which define a distinct entry in df
verbose (bool, default=False) – If True, then include detailed logs.

Returns:

df – A pd.DataFrame with an “ind” and “id” variable added indicating rolled up values of the index and defining_vars

Return type:

pd.DataFrame

napistu.utils.get_extn_from_url(url: str) → str

Retrieve file extension from a URL.

Parameters:: url (str) – URL to extract extension from.
Returns:: File extension including the leading dot (e.g., ‘.gz’, ‘.tar.gz’).
Return type:: str
Raises:: ValueError – If no file extension can be identified in the URL.

Examples

>>> get_extn_from_url('https://test/test.gz')
'.gz'
>>> get_extn_from_url('https://test/test.tar.gz')
'.tar.gz'
>>> get_extn_from_url('https://test/test.tar.gz/bla')
Traceback (most recent call last):
...
ValueError: File extension not identifiable: https://test/test.tar.gz/bla

napistu.utils.get_source_base_and_path(uri: str) → tuple[str, str]

Get the base of a bucket or folder and the path to the file.

For URIs with a scheme (e.g., ‘gs://’), returns the scheme + netloc as base. For local paths, returns the directory as base.

Parameters:

uri (str) – URI or path to parse.

Returns:

A tuple of (base, path) where: - base : str

The base URI or directory (e.g., ‘gs://bucket’ or ‘/local/dir’).

pathstr
The relative path to the file (e.g., ‘folder/file’ or ‘file’).

Return type:

tuple[str, str]

Examples

>>> get_source_base_and_path("gs://bucket/folder/file")
('gs://bucket', 'folder/file')
>>> get_source_base_and_path("/bucket/folder/file")
('/bucket/folder', 'file')

napistu.utils.get_target_base_and_path(uri: str) → tuple[str, str]

Get the base directory + parent path and the filename.

Splits the URI at the last path separator to extract the filename.

Parameters:

uri (str) – URI or path to parse.

Returns:

A tuple of (base, filename) where: - base : str

The directory path (e.g., ‘gs://bucket/folder’ or ‘/local/folder’).

filenamestr
The filename (e.g., ‘file’).

Return type:

tuple[str, str]

Examples

>>> get_target_base_and_path("gs://bucket/folder/file")
('gs://bucket/folder', 'file')
>>> get_target_base_and_path("bucket/folder/file")
('bucket/folder', 'file')
>>> get_target_base_and_path("/bucket/folder/file")
('/bucket/folder', 'file')

napistu.utils.gunzip(gzipped_path: str, outpath: str | None = None) → None

Gunzip a file to an output path.

Parameters:

gzipped_path (str) – Path or URI to the gzipped file (e.g., ‘/local/file.gz’, ‘gs://bucket/file.gz’).
outpath (str | None, optional) – Path or URI to the output file. If None, automatically determined by removing the .gz extension from gzipped_path.

Return type:

None

Raises:

FileNotFoundError – If gzipped_path does not exist.

Examples

>>> gunzip('/tmp/data.txt.gz')  # Creates /tmp/data.txt
>>> gunzip('gs://bucket/data.txt.gz', 'gs://bucket/output.txt')

napistu.utils.infer_entity_type(df: DataFrame) → str

Infer the entity type of a DataFrame based on its structure and schema.

Parameters:: df (pd.DataFrame) – The DataFrame to analyze
Returns:: The inferred entity type name
Return type:: str
Raises:: ValueError – If no entity type can be determined

napistu.utils.initialize_dir(output_dir_path: str, overwrite: bool) → None

Initialize a filesystem directory.

Creates a new directory or optionally overwrites an existing one. Works with any fsspec-supported filesystem (local, GCS, S3, etc.).

Parameters:

output_dir_path (str) – Path or URI to the directory to create (e.g., ‘/local/path’, ‘gs://bucket/path’).
overwrite (bool) – If True, delete and recreate the directory if it exists. If False, raise FileExistsError if the directory exists.

Raises:

FileExistsError – If directory exists and overwrite is False.

Examples

>>> initialize_dir('/tmp/newdir', overwrite=False)
>>> initialize_dir('gs://bucket/path', overwrite=True)

napistu.utils.load_json(uri: str) → Any

Read JSON from a URI.

Parameters:: uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
Returns:: The parsed JSON object (dict, list, etc.).
Return type:: Any

Examples

>>> data = load_json('/tmp/config.json')
>>> data = load_json('gs://bucket/config.json')

napistu.utils.load_parquet(uri: str | Path) → DataFrame

Read a DataFrame from a Parquet file.

Parameters:: uri (Union[str, Path]) – Path or URI to the Parquet file to load (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’).
Returns:: The DataFrame loaded from the Parquet file.
Return type:: pd.DataFrame
Raises:: FileNotFoundError – If the specified file does not exist.

Examples

>>> df = load_parquet('/tmp/data.parquet')
>>> df = load_parquet('gs://bucket/data.parquet')

napistu.utils.load_pickle(path: str) → Any

Load a pickle object from a path or URI.

Parameters:: path (str) – Path or URI to the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
Returns:: The unpickled object.
Return type:: Any

Examples

>>> obj = load_pickle('/tmp/data.pkl')
>>> obj = load_pickle('gs://bucket/data.pkl')

napistu.utils.match_regex_dict(s: str, regex_dict: Dict[str, any]) → any | None

Apply each regex in regex_dict to the string s. If a regex matches, return its value. If no regex matches, return None.

Parameters:

s (str) – The string to test.
regex_dict (dict) – Dictionary where keys are regex patterns (str), and values are the values to return.

Return type:

The value associated with the first matching regex, or None if no match.

napistu.utils.matrix_to_edgelist(matrix, row_labels=None, col_labels=None)

napistu.utils.path_exists(path: str) → bool

Check if a path or URI exists.

Works with any fsspec-supported filesystem (local, GCS, S3, memory, etc.).

Parameters:: path (str) – Path or URI to check (e.g., ‘/local/path’, ‘gs://bucket/path’, ‘memory://path’).
Returns:: True if the path exists, False otherwise.
Return type:: bool

Examples

>>> path_exists('/tmp/myfile.txt')
False
>>> path_exists('gs://bucket/existing_file.txt')
True
>>> path_exists('.')
True

napistu.utils.pickle_cache(path: str, overwrite: bool = False) → Callable

A decorator to cache a function call result to pickle

Attention: this does not care about the function arguments All function calls will be served by the same pickle file.

Parameters:

path (str) – Path to the cache pickle file
overwrite (bool) – Should an existing cache be overwritten even if it exists?

Returns:

A function whos output will be cached to pickle.

Return type:

Callable

napistu.utils.requests_retry_session(retries=5, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session: Session | None = None, **kwargs) → Session

Requests session with retry logic

This should help to combat flaky apis, eg Brenda. From: https://stackoverflow.com/a/58687549

Parameters:

retries (int) – Number of retries. Defaults to 5.
backoff_factor (float) – Backoff factor. Defaults to 0.3.
status_forcelist (tuple) – Errors to retry. Defaults to (500, 502, 503, 504).
session (requests.Session | None) – Existing session. Defaults to None.

Return type:

requests.Session

napistu.utils.safe_capitalize(text: str) → str: Capitalize first letter only, preserve case of rest.

napistu.utils.safe_fill(x: str, fill_width: int = 15) → str

Safely wrap a string to a specified width.

Parameters:

x (str) – The string to wrap.
fill_width (int, optional) – The width to wrap the string to. Default is 15.

Returns:

The wrapped string.

Return type:

str

napistu.utils.safe_join_set(values: Any) → str | None

Safely join values, filtering out None values.

Converts input to a set (ensuring uniqueness), removes None values, and joins remaining values with “ OR “ separator in sorted order.

Parameters:: values (Any) – Values to join. Can be list, tuple, set, pandas Series, string, or other iterable. Strings are treated as single values, not character sequences.
Returns:: Joined string with “ OR “ separator in alphabetical order, or None if no valid values remain after filtering.
Return type:: str or None

Examples

>>> safe_join_set([1, 2, 3])
'1 OR 2 OR 3'
>>> safe_join_set([3, 1, 2, 1])  # Removes duplicates and sorts
'1 OR 2 OR 3'
>>> safe_join_set([1, None, 3])
'1 OR 3'
>>> safe_join_set([None, None])
None
>>> safe_join_set("hello")  # String treated as single value
'hello'

napistu.utils.save_json(uri: str, obj: Any) → None

Write object to JSON file at URI.

Parameters:

uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
obj (Any) – Object to serialize to JSON.

Return type:

None

Examples

>>> save_json('/tmp/config.json', {'key': 'value'})
>>> save_json('gs://bucket/config.json', {'key': 'value'})

napistu.utils.save_parquet(df: DataFrame, uri: str | Path, compression: str = 'snappy') → None

Write a DataFrame to a single Parquet file.

Parameters:

df (pd.DataFrame) – The DataFrame to save.
uri (Union[str, Path]) – Path or URI where to save the Parquet file (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’). Recommended extensions: .parquet or .pq
compression (str, default='snappy') – Compression algorithm. Options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.

Raises:

OSError – If the file cannot be written to (permission issues, etc.).

Examples

>>> save_parquet(df, '/tmp/data.parquet')
>>> save_parquet(df, 'gs://bucket/data.parquet', compression='gzip')

napistu.utils.save_pickle(path: str, dat: Any) → None

Save object to path as pickle.

Parameters:

path (str) – Path or URI where to save the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
dat (Any) – Object to pickle.

Return type:

None

Examples

>>> save_pickle('/tmp/data.pkl', my_object)
>>> save_pickle('gs://bucket/data.pkl', my_object)

napistu.utils.score_nameness(string: str)

Score Nameness

This utility assigns a numeric score to a string reflecting how likely it is to be a human readable name. This will help to prioritize readable entries when we are trying to pick out a single name to display from a set of values which may also include entries like systematic ids.

Parameters:: string (str) – An alphanumeric string
Returns:: An integer score indicating how name-like the string is (low is more name-like)
Return type:: score (int)

napistu.utils.show(obj, method='auto', headers='keys', hide_index=False, left_align_strings=True, max_rows=20)

Show a table using the appropriate method for the environment.

Parameters:

obj (pd.DataFrame or any other object) – The object to show
method (str) – The method to use to show the object - “string” : show the object as a string - “jupyter” : show the object in a Jupyter notebook - “auto” : show the object in a Jupyter notebook if available, otherwise show as a string
headers (str, list, or None) – The headers to use for the object
left_align_strings (bool) – Should strings be left aligned?
max_rows (int) – The maximum number of rows to show

Return type:

None

Examples

>>> show(pd.DataFrame({"a": [1, 2, 3]}), headers="keys", hide_index=True)

napistu.utils.style_df(df: DataFrame, headers: str | list[str] | None = 'keys', hide_index: bool = False) → Styler

Style DataFrame

Provide some simple options for styling a pd.DataFrame

Parameters:

df (pd.DataFrame) – A table to style
headers (Union[str, list[str], None]) –
- “keys” to use the current column names
- None to suppress column names
- list[str] to overwrite and show column names
hide_index (bool) – Should rows be displayed?

Returns:

styled_df – df with styles updated

Return type:

Styler

napistu.utils.update_pathological_names(names: Series, prefix: str) → Series

Update pathological names in a pandas Series.

Add a prefix to the names if they are all numeric.

napistu.utils.write_file_contents_to_path(path: str, contents: bytes) → None

Write file contents to a path or URI.

Handles both file-like objects with write() method and string paths/URIs.

Parameters:

path (str) – Destination path or URI, or a file-like object with write() method.
contents (bytes) – File contents to write.

Return type:

None

Examples

>>> write_file_contents_to_path('/tmp/file.txt', b'Hello')
>>> write_file_contents_to_path('gs://bucket/file.txt', b'Hello')

Modules

`constants`	Constants for the utils module.
`display_utils`
`docker_utils`	Base Docker image management for Napistu workflows.
`ig_utils`	Utilities for igraph operations.
`io_utils`	Utilities for input and output operations.
`optional`	Utilities for handling optional dependencies in Napistu.
`path_utils`	Utilities for path and URI operations.
`pd_utils`	Utilities for pandas DataFrame operations.
`string_utils`	Utilities for string operations and text processing.