napistu.utils.io_utils

Utilities for input and output operations.

Public Functions

download_and_extract(url: str, output_dir_path: str = “.”, download_method: str = DOWNLOAD_METHODS.WGET, overwrite: bool = False) -> None:

Download an archive and then extract to a new folder.

download_ftp(url: str, path: str) -> None:

Download a file from an FTP server.

download_wget(url: str, path: str, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) -> None:

Download a file / archive with wget.

extract(file: str) -> None:

Untar, unzip and ungzip compressed files.

gunzip(gzipped_path: str, outpath: str | None = None) -> None:

Gunzip a file to an output path.

load_json(uri: str) -> Any:

Read JSON from URI.

load_parquet(uri: Union[str, Path]) -> pd.DataFrame:

Read a DataFrame from a Parquet file.

load_pickle(path: str) -> Any:

Load pickle object from path.

pickle_cache(path: str, overwrite: bool = False) -> Callable:

Decorator to cache a function call result to pickle.

requests_retry_session(retries: int = 5, backoff_factor: float = 0.3, status_forcelist: tuple = (500, 502, 503, 504), session: requests.Session | None = None, **kwargs) -> requests.Session:

Create a requests session with retry logic.

save_json(uri: str, object: Any) -> None:

Write object to JSON file at URI.

save_parquet(df: pd.DataFrame, uri: Union[str, Path], compression: str = “snappy”) -> None:

Write a DataFrame to a single Parquet file.

save_pickle(path: str, dat: object) -> None:

Save object to path as pickle.

write_file_contents_to_path(path: str, contents: Any) -> None:

Helper function to write file contents to a path.

Functions

download_and_extract(url[, output_dir_path, ...])

Download archive and extract to directory.

download_ftp(url, path)

Download a file from an FTP server.

download_wget(url, path[, target_filename, ...])

Downloads file / archive with wget

extract(file_uri)

Extract archive at file_uri to same directory.

gunzip(gzipped_path[, outpath])

Gunzip a file to an output path.

load_json(uri)

Read JSON from a URI.

load_parquet(uri)

Read a DataFrame from a Parquet file.

load_pickle(path)

Load a pickle object from a path or URI.

pickle_cache(path[, overwrite])

A decorator to cache a function call result to pickle

requests_retry_session([retries, ...])

Requests session with retry logic

save_json(uri, obj)

Write object to JSON file at URI.

save_parquet(df, uri[, compression])

Write a DataFrame to a single Parquet file.

save_pickle(path, dat)

Save object to path as pickle.

write_file_contents_to_path(path, contents)

Write file contents to a path or URI.

napistu.utils.io_utils._copy_tree(source_dir: str, dest_uri: str) None

Copy directory tree to any fsspec destination.

napistu.utils.io_utils._extract_tarball(tar_path: str, output_uri: str) None

Extract tarball using standard library.

napistu.utils.io_utils._extract_zip(zip_path: str, output_uri: str) None

Extract zip using standard library.

napistu.utils.io_utils.download_and_extract(url: str, output_dir_path: str = '.', download_method: str = 'wget', overwrite: bool = False) None

Download archive and extract to directory.

napistu.utils.io_utils.download_ftp(url: str, path: str) None

Download a file from an FTP server.

Parameters:
  • url (str) – URL of the file to download

  • path (str) – Path to the output file

Return type:

None

napistu.utils.io_utils.download_wget(url: str, path, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) None

Downloads file / archive with wget

Parameters:
  • url (str) – URL of the file to download

  • path (FilePath | WriteBuffer) – File path or buffer

  • target_filename (str) – Specific file to extract from ZIP if URL is a ZIP file

  • verify (bool) – url (str): url

  • timeout (int) – Timeout in seconds for the request

  • max_retries (int) – Number of times to retry the download if it fails

Return type:

None

napistu.utils.io_utils.extract(file_uri: str) None

Extract archive at file_uri to same directory.

Supports: .tar.gz, .tgz, .zip, .gz

napistu.utils.io_utils.gunzip(gzipped_path: str, outpath: str | None = None) None

Gunzip a file to an output path.

Parameters:
  • gzipped_path (str) – Path or URI to the gzipped file (e.g., ‘/local/file.gz’, ‘gs://bucket/file.gz’).

  • outpath (str | None, optional) – Path or URI to the output file. If None, automatically determined by removing the .gz extension from gzipped_path.

Return type:

None

Raises:

FileNotFoundError – If gzipped_path does not exist.

Examples

>>> gunzip('/tmp/data.txt.gz')  # Creates /tmp/data.txt
>>> gunzip('gs://bucket/data.txt.gz', 'gs://bucket/output.txt')
napistu.utils.io_utils.load_json(uri: str) Any

Read JSON from a URI.

Parameters:

uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).

Returns:

The parsed JSON object (dict, list, etc.).

Return type:

Any

Examples

>>> data = load_json('/tmp/config.json')
>>> data = load_json('gs://bucket/config.json')
napistu.utils.io_utils.load_parquet(uri: str | Path) DataFrame

Read a DataFrame from a Parquet file.

Parameters:

uri (Union[str, Path]) – Path or URI to the Parquet file to load (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’).

Returns:

The DataFrame loaded from the Parquet file.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If the specified file does not exist.

Examples

>>> df = load_parquet('/tmp/data.parquet')
>>> df = load_parquet('gs://bucket/data.parquet')
napistu.utils.io_utils.load_pickle(path: str) Any

Load a pickle object from a path or URI.

Parameters:

path (str) – Path or URI to the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).

Returns:

The unpickled object.

Return type:

Any

Examples

>>> obj = load_pickle('/tmp/data.pkl')
>>> obj = load_pickle('gs://bucket/data.pkl')
napistu.utils.io_utils.pickle_cache(path: str, overwrite: bool = False) Callable

A decorator to cache a function call result to pickle

Attention: this does not care about the function arguments All function calls will be served by the same pickle file.

Parameters:
  • path (str) – Path to the cache pickle file

  • overwrite (bool) – Should an existing cache be overwritten even if it exists?

Returns:

A function whos output will be cached to pickle.

Return type:

Callable

napistu.utils.io_utils.requests_retry_session(retries=5, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session: Session | None = None, **kwargs) Session

Requests session with retry logic

This should help to combat flaky apis, eg Brenda. From: https://stackoverflow.com/a/58687549

Parameters:
  • retries (int) – Number of retries. Defaults to 5.

  • backoff_factor (float) – Backoff factor. Defaults to 0.3.

  • status_forcelist (tuple) – Errors to retry. Defaults to (500, 502, 503, 504).

  • session (requests.Session | None) – Existing session. Defaults to None.

Return type:

requests.Session

napistu.utils.io_utils.save_json(uri: str, obj: Any) None

Write object to JSON file at URI.

Parameters:
  • uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).

  • obj (Any) – Object to serialize to JSON.

Return type:

None

Examples

>>> save_json('/tmp/config.json', {'key': 'value'})
>>> save_json('gs://bucket/config.json', {'key': 'value'})
napistu.utils.io_utils.save_parquet(df: DataFrame, uri: str | Path, compression: str = 'snappy') None

Write a DataFrame to a single Parquet file.

Parameters:
  • df (pd.DataFrame) – The DataFrame to save.

  • uri (Union[str, Path]) – Path or URI where to save the Parquet file (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’). Recommended extensions: .parquet or .pq

  • compression (str, default='snappy') – Compression algorithm. Options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.

Raises:

OSError – If the file cannot be written to (permission issues, etc.).

Examples

>>> save_parquet(df, '/tmp/data.parquet')
>>> save_parquet(df, 'gs://bucket/data.parquet', compression='gzip')
napistu.utils.io_utils.save_pickle(path: str, dat: Any) None

Save object to path as pickle.

Parameters:
  • path (str) – Path or URI where to save the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).

  • dat (Any) – Object to pickle.

Return type:

None

Examples

>>> save_pickle('/tmp/data.pkl', my_object)
>>> save_pickle('gs://bucket/data.pkl', my_object)
napistu.utils.io_utils.write_file_contents_to_path(path: str, contents: bytes) None

Write file contents to a path or URI.

Handles both file-like objects with write() method and string paths/URIs.

Parameters:
  • path (str) – Destination path or URI, or a file-like object with write() method.

  • contents (bytes) – File contents to write.

Return type:

None

Examples

>>> write_file_contents_to_path('/tmp/file.txt', b'Hello')
>>> write_file_contents_to_path('gs://bucket/file.txt', b'Hello')