napistu.utils
Napistu utilities package.
Submodules provide focused helpers (io_utils, path_utils, pd_utils, etc.).
Optional dependencies are imported via napistu.utils.optional, not re-exported here.
- class napistu.utils.match_pd_vars(df: DataFrame | Series, req_vars: set, allow_series: bool = True)
Bases:
objectMatch Pandas Variables.
- req_vars
A set of variables which should exist in df
- missing_vars
Required variables which are not present in df
- extra_vars
Non-required variables which are present in df
- are_present
Returns True if req_vars are present and False otherwise
- assert_present()
Raise an exception of req_vars are absent
- __init__(df: DataFrame | Series, req_vars: set, allow_series: bool = True) None
Connects to an SBML file
- Parameters:
df – A pd.DataFrame or pd.Series
req_vars – A set of variables which should exist in df
allow_series – Can a pd.Series be provided as df?
- Return type:
None.
- assert_present() None
Raise an error if required variables are missing
- napistu.utils.check_unique_index(df, label='')
Validate that each index value only maps to a single row.
- napistu.utils.copy_uri(input_uri: str, output_uri: str, is_file: bool = True) None
Copy a file or folder from one URI to another.
- Parameters:
input_uri (str) – Input file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).
output_uri (str) – Output file URI (e.g., ‘gs://bucket/file’, ‘/local/path’, ‘memory://path’).
is_file (bool, default=True) – If True, copy a single file. If False, copy directory recursively.
Examples
>>> copy_uri('/local/source.txt', '/local/dest.txt') >>> copy_uri('gs://bucket/source/', 'gs://bucket/dest/', is_file=False)
- napistu.utils.downcast_float_dataframe(df: DataFrame, dtype: Any = <class 'numpy.float32'>, copy: bool = True) DataFrame
Coerce values to a narrower float dtype (default float32) to cut RAM usage.
Typical use is large all-numeric tables (e.g. stacked propagation scores) where float64 is unnecessary. Uses
pandas.DataFrame.astype()with the givendtypeandcopysemantics.- Parameters:
df – DataFrame to coerce.
dtype – Numpy float dtype, commonly
np.float32to halve size versus float64.copy – If True (default), always returns a new object even when dtype matches.
- Returns:
DataFrame with values in
dtype.- Return type:
pd.DataFrame
- napistu.utils.download_and_extract(url: str, output_dir_path: str = '.', download_method: str = 'wget', overwrite: bool = False) None
Download archive and extract to directory.
- napistu.utils.download_ftp(url: str, path: str) None
Download a file from an FTP server.
- Parameters:
url (str) – URL of the file to download
path (str) – Path to the output file
- Return type:
None
- napistu.utils.download_wget(url: str, path, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) None
Downloads file / archive with wget
- Parameters:
url (str) – URL of the file to download
path (FilePath | WriteBuffer) – File path or buffer
target_filename (str) – Specific file to extract from ZIP if URL is a ZIP file
verify (bool) – url (str): url
timeout (int) – Timeout in seconds for the request
max_retries (int) – Number of times to retry the download if it fails
- Return type:
None
- napistu.utils.drop_extra_cols(df_in: DataFrame, df_out: DataFrame, always_include: List[str] | None = None) DataFrame
Remove columns in df_out that are not in df_in, except those specified in always_include.
- Parameters:
df_in (pd.DataFrame) – Reference DataFrame whose columns determine what to keep
df_out (pd.DataFrame) – DataFrame to filter columns from
always_include (Optional[List[str]], optional) – List of column names to always include in output, even if not in df_in
- Returns:
DataFrame with columns filtered to match df_in plus any always_include columns. Column order follows df_in, with always_include columns appended at the end.
- Return type:
pd.DataFrame
Examples
>>> df_in = pd.DataFrame({'a': [1], 'b': [2]}) >>> df_out = pd.DataFrame({'a': [3], 'c': [4], 'd': [5]}) >>> _drop_extra_cols(df_in, df_out) # Returns DataFrame with just column 'a'
>>> _drop_extra_cols(df_in, df_out, always_include=['d']) # Returns DataFrame with columns ['a', 'd']
- napistu.utils.ensure_pd_df(pd_df_or_series: DataFrame | Series) DataFrame
Ensure Pandas DataFrame
Convert a pd.Series to a DataFrame if needed.
- Parameters:
pd_df_or_series (pd.Series | pd.DataFrame) – a pandas df or series
- Returns:
pd_df converted to a pd.DataFrame if needed
- napistu.utils.extract(file_uri: str) None
Extract archive at file_uri to same directory.
Supports: .tar.gz, .tgz, .zip, .gz
- napistu.utils.extract_regex_match(regex: str, query: str) str
- Parameters:
regex (str) – regular expression to search
query (str) – string to search against
- Returns:
a character string match
- Return type:
match (str)
- napistu.utils.extract_regex_search(regex: str, query: str, index_value: int = 0) str
Match an identifier substring and otherwise throw an error
- Parameters:
regex (str) – regular expression to search
query (str) – string to search against
index_value (int) – entry in index to return
- Returns:
a character string match
- Return type:
match (str)
- napistu.utils.find_weakly_connected_subgraphs(edgelist: DataFrame) DataFrame
Find all cliques of loosly connected components.
- napistu.utils.format_identifiers_as_edgelist(df: DataFrame, defining_vars: list[str], verbose: bool = False) DataFrame
Format Identifiers as Edgelist
Collapse a multiindex to an index (if needed), and similarly collapse multiple variables to a single entry. This indexed pd.Sereies of index - ids can be treated as an edgelist for greedy clustering.
- Parameters:
df (pd.DataFrame) – Any pd.DataFrame
defining_vars (list[str]) – A set of attributes which define a distinct entry in df
verbose (bool, default=False) – If True, then include detailed logs.
- Returns:
df – A pd.DataFrame with an “ind” and “id” variable added indicating rolled up values of the index and defining_vars
- Return type:
pd.DataFrame
- napistu.utils.get_extn_from_url(url: str) str
Retrieve file extension from a URL.
- Parameters:
url (str) – URL to extract extension from.
- Returns:
File extension including the leading dot (e.g., ‘.gz’, ‘.tar.gz’).
- Return type:
str
- Raises:
ValueError – If no file extension can be identified in the URL.
Examples
>>> get_extn_from_url('https://test/test.gz') '.gz' >>> get_extn_from_url('https://test/test.tar.gz') '.tar.gz' >>> get_extn_from_url('https://test/test.tar.gz/bla') Traceback (most recent call last): ... ValueError: File extension not identifiable: https://test/test.tar.gz/bla
- napistu.utils.get_source_base_and_path(uri: str) tuple[str, str]
Get the base of a bucket or folder and the path to the file.
For URIs with a scheme (e.g., ‘gs://’), returns the scheme + netloc as base. For local paths, returns the directory as base.
- Parameters:
uri (str) – URI or path to parse.
- Returns:
A tuple of (base, path) where: - base : str
The base URI or directory (e.g., ‘gs://bucket’ or ‘/local/dir’).
- pathstr
The relative path to the file (e.g., ‘folder/file’ or ‘file’).
- Return type:
tuple[str, str]
Examples
>>> get_source_base_and_path("gs://bucket/folder/file") ('gs://bucket', 'folder/file') >>> get_source_base_and_path("/bucket/folder/file") ('/bucket/folder', 'file')
- napistu.utils.get_target_base_and_path(uri: str) tuple[str, str]
Get the base directory + parent path and the filename.
Splits the URI at the last path separator to extract the filename.
- Parameters:
uri (str) – URI or path to parse.
- Returns:
A tuple of (base, filename) where: - base : str
The directory path (e.g., ‘gs://bucket/folder’ or ‘/local/folder’).
- filenamestr
The filename (e.g., ‘file’).
- Return type:
tuple[str, str]
Examples
>>> get_target_base_and_path("gs://bucket/folder/file") ('gs://bucket/folder', 'file') >>> get_target_base_and_path("bucket/folder/file") ('bucket/folder', 'file') >>> get_target_base_and_path("/bucket/folder/file") ('/bucket/folder', 'file')
- napistu.utils.gunzip(gzipped_path: str, outpath: str | None = None) None
Gunzip a file to an output path.
- Parameters:
gzipped_path (str) – Path or URI to the gzipped file (e.g., ‘/local/file.gz’, ‘gs://bucket/file.gz’).
outpath (str | None, optional) – Path or URI to the output file. If None, automatically determined by removing the .gz extension from gzipped_path.
- Return type:
None
- Raises:
FileNotFoundError – If gzipped_path does not exist.
Examples
>>> gunzip('/tmp/data.txt.gz') # Creates /tmp/data.txt >>> gunzip('gs://bucket/data.txt.gz', 'gs://bucket/output.txt')
- napistu.utils.infer_entity_type(df: DataFrame) str
Infer the entity type of a DataFrame based on its structure and schema.
- Parameters:
df (pd.DataFrame) – The DataFrame to analyze
- Returns:
The inferred entity type name
- Return type:
str
- Raises:
ValueError – If no entity type can be determined
- napistu.utils.initialize_dir(output_dir_path: str, overwrite: bool) None
Initialize a filesystem directory.
Creates a new directory or optionally overwrites an existing one. Works with any fsspec-supported filesystem (local, GCS, S3, etc.).
- Parameters:
output_dir_path (str) – Path or URI to the directory to create (e.g., ‘/local/path’, ‘gs://bucket/path’).
overwrite (bool) – If True, delete and recreate the directory if it exists. If False, raise FileExistsError if the directory exists.
- Raises:
FileExistsError – If directory exists and overwrite is False.
Examples
>>> initialize_dir('/tmp/newdir', overwrite=False) >>> initialize_dir('gs://bucket/path', overwrite=True)
- napistu.utils.load_json(uri: str) Any
Read JSON from a URI.
- Parameters:
uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
- Returns:
The parsed JSON object (dict, list, etc.).
- Return type:
Any
Examples
>>> data = load_json('/tmp/config.json') >>> data = load_json('gs://bucket/config.json')
- napistu.utils.load_parquet(uri: str | Path) DataFrame
Read a DataFrame from a Parquet file.
- Parameters:
uri (Union[str, Path]) – Path or URI to the Parquet file to load (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’).
- Returns:
The DataFrame loaded from the Parquet file.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified file does not exist.
Examples
>>> df = load_parquet('/tmp/data.parquet') >>> df = load_parquet('gs://bucket/data.parquet')
- napistu.utils.load_pickle(path: str) Any
Load a pickle object from a path or URI.
- Parameters:
path (str) – Path or URI to the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
- Returns:
The unpickled object.
- Return type:
Any
Examples
>>> obj = load_pickle('/tmp/data.pkl') >>> obj = load_pickle('gs://bucket/data.pkl')
- napistu.utils.match_regex_dict(s: str, regex_dict: Dict[str, any]) any | None
Apply each regex in regex_dict to the string s. If a regex matches, return its value. If no regex matches, return None.
- Parameters:
s (str) – The string to test.
regex_dict (dict) – Dictionary where keys are regex patterns (str), and values are the values to return.
- Return type:
The value associated with the first matching regex, or None if no match.
- napistu.utils.matrix_to_edgelist(matrix, row_labels=None, col_labels=None)
- napistu.utils.path_exists(path: str) bool
Check if a path or URI exists.
Works with any fsspec-supported filesystem (local, GCS, S3, memory, etc.).
- Parameters:
path (str) – Path or URI to check (e.g., ‘/local/path’, ‘gs://bucket/path’, ‘memory://path’).
- Returns:
True if the path exists, False otherwise.
- Return type:
bool
Examples
>>> path_exists('/tmp/myfile.txt') False >>> path_exists('gs://bucket/existing_file.txt') True >>> path_exists('.') True
- napistu.utils.pickle_cache(path: str, overwrite: bool = False) Callable
A decorator to cache a function call result to pickle
Attention: this does not care about the function arguments All function calls will be served by the same pickle file.
- Parameters:
path (str) – Path to the cache pickle file
overwrite (bool) – Should an existing cache be overwritten even if it exists?
- Returns:
A function whos output will be cached to pickle.
- Return type:
Callable
- napistu.utils.requests_retry_session(retries=5, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session: Session | None = None, **kwargs) Session
Requests session with retry logic
This should help to combat flaky apis, eg Brenda. From: https://stackoverflow.com/a/58687549
- Parameters:
retries (int) – Number of retries. Defaults to 5.
backoff_factor (float) – Backoff factor. Defaults to 0.3.
status_forcelist (tuple) – Errors to retry. Defaults to (500, 502, 503, 504).
session (requests.Session | None) – Existing session. Defaults to None.
- Return type:
requests.Session
- napistu.utils.safe_capitalize(text: str) str
Capitalize first letter only, preserve case of rest.
- napistu.utils.safe_fill(x: str, fill_width: int = 15) str
Safely wrap a string to a specified width.
- Parameters:
x (str) – The string to wrap.
fill_width (int, optional) – The width to wrap the string to. Default is 15.
- Returns:
The wrapped string.
- Return type:
str
- napistu.utils.safe_join_set(values: Any) str | None
Safely join values, filtering out None values.
Converts input to a set (ensuring uniqueness), removes None values, and joins remaining values with “ OR “ separator in sorted order.
- Parameters:
values (Any) – Values to join. Can be list, tuple, set, pandas Series, string, or other iterable. Strings are treated as single values, not character sequences.
- Returns:
Joined string with “ OR “ separator in alphabetical order, or None if no valid values remain after filtering.
- Return type:
str or None
Examples
>>> safe_join_set([1, 2, 3]) '1 OR 2 OR 3' >>> safe_join_set([3, 1, 2, 1]) # Removes duplicates and sorts '1 OR 2 OR 3' >>> safe_join_set([1, None, 3]) '1 OR 3' >>> safe_join_set([None, None]) None >>> safe_join_set("hello") # String treated as single value 'hello'
- napistu.utils.save_json(uri: str, obj: Any) None
Write object to JSON file at URI.
- Parameters:
uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
obj (Any) – Object to serialize to JSON.
- Return type:
None
Examples
>>> save_json('/tmp/config.json', {'key': 'value'}) >>> save_json('gs://bucket/config.json', {'key': 'value'})
- napistu.utils.save_parquet(df: DataFrame, uri: str | Path, compression: str = 'snappy') None
Write a DataFrame to a single Parquet file.
- Parameters:
df (pd.DataFrame) – The DataFrame to save.
uri (Union[str, Path]) – Path or URI where to save the Parquet file (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’). Recommended extensions: .parquet or .pq
compression (str, default='snappy') – Compression algorithm. Options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.
- Raises:
OSError – If the file cannot be written to (permission issues, etc.).
Examples
>>> save_parquet(df, '/tmp/data.parquet') >>> save_parquet(df, 'gs://bucket/data.parquet', compression='gzip')
- napistu.utils.save_pickle(path: str, dat: Any) None
Save object to path as pickle.
- Parameters:
path (str) – Path or URI where to save the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
dat (Any) – Object to pickle.
- Return type:
None
Examples
>>> save_pickle('/tmp/data.pkl', my_object) >>> save_pickle('gs://bucket/data.pkl', my_object)
- napistu.utils.score_nameness(string: str)
Score Nameness
This utility assigns a numeric score to a string reflecting how likely it is to be a human readable name. This will help to prioritize readable entries when we are trying to pick out a single name to display from a set of values which may also include entries like systematic ids.
- Parameters:
string (str) – An alphanumeric string
- Returns:
An integer score indicating how name-like the string is (low is more name-like)
- Return type:
score (int)
- napistu.utils.show(obj, method='auto', headers='keys', hide_index=False, left_align_strings=True, max_rows=20)
Show a table using the appropriate method for the environment.
- Parameters:
obj (pd.DataFrame or any other object) – The object to show
method (str) – The method to use to show the object - “string” : show the object as a string - “jupyter” : show the object in a Jupyter notebook - “auto” : show the object in a Jupyter notebook if available, otherwise show as a string
headers (str, list, or None) – The headers to use for the object
left_align_strings (bool) – Should strings be left aligned?
max_rows (int) – The maximum number of rows to show
- Return type:
None
Examples
>>> show(pd.DataFrame({"a": [1, 2, 3]}), headers="keys", hide_index=True)
- napistu.utils.style_df(df: DataFrame, headers: str | list[str] | None = 'keys', hide_index: bool = False) Styler
Style DataFrame
Provide some simple options for styling a pd.DataFrame
- Parameters:
df (pd.DataFrame) – A table to style
headers (Union[str, list[str], None]) –
“keys” to use the current column names
None to suppress column names
list[str] to overwrite and show column names
hide_index (bool) – Should rows be displayed?
- Returns:
styled_df – df with styles updated
- Return type:
Styler
- napistu.utils.update_pathological_names(names: Series, prefix: str) Series
Update pathological names in a pandas Series.
Add a prefix to the names if they are all numeric.
- napistu.utils.write_file_contents_to_path(path: str, contents: bytes) None
Write file contents to a path or URI.
Handles both file-like objects with write() method and string paths/URIs.
- Parameters:
path (str) – Destination path or URI, or a file-like object with write() method.
contents (bytes) – File contents to write.
- Return type:
None
Examples
>>> write_file_contents_to_path('/tmp/file.txt', b'Hello') >>> write_file_contents_to_path('gs://bucket/file.txt', b'Hello')
Modules
Constants for the utils module. |
|
Base Docker image management for Napistu workflows. |
|
Utilities for igraph operations. |
|
Utilities for input and output operations. |
|
Utilities for handling optional dependencies in Napistu. |
|
Utilities for path and URI operations. |
|
Utilities for pandas DataFrame operations. |
|
Utilities for string operations and text processing. |