napistu.utils.io_utils
Utilities for input and output operations.
Public Functions
- download_and_extract(url: str, output_dir_path: str = “.”, download_method: str = DOWNLOAD_METHODS.WGET, overwrite: bool = False) -> None:
Download an archive and then extract to a new folder.
- download_ftp(url: str, path: str) -> None:
Download a file from an FTP server.
- download_wget(url: str, path: str, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) -> None:
Download a file / archive with wget.
- extract(file: str) -> None:
Untar, unzip and ungzip compressed files.
- gunzip(gzipped_path: str, outpath: str | None = None) -> None:
Gunzip a file to an output path.
- load_json(uri: str) -> Any:
Read JSON from URI.
- load_parquet(uri: Union[str, Path]) -> pd.DataFrame:
Read a DataFrame from a Parquet file.
- load_pickle(path: str) -> Any:
Load pickle object from path.
- pickle_cache(path: str, overwrite: bool = False) -> Callable:
Decorator to cache a function call result to pickle.
- requests_retry_session(retries: int = 5, backoff_factor: float = 0.3, status_forcelist: tuple = (500, 502, 503, 504), session: requests.Session | None = None, **kwargs) -> requests.Session:
Create a requests session with retry logic.
- save_json(uri: str, object: Any) -> None:
Write object to JSON file at URI.
- save_parquet(df: pd.DataFrame, uri: Union[str, Path], compression: str = “snappy”) -> None:
Write a DataFrame to a single Parquet file.
- save_pickle(path: str, dat: object) -> None:
Save object to path as pickle.
- write_file_contents_to_path(path: str, contents: Any) -> None:
Helper function to write file contents to a path.
Functions
|
Download archive and extract to directory. |
|
Download a file from an FTP server. |
|
Downloads file / archive with wget |
|
Extract archive at file_uri to same directory. |
|
Gunzip a file to an output path. |
|
Read JSON from a URI. |
|
Read a DataFrame from a Parquet file. |
|
Load a pickle object from a path or URI. |
|
A decorator to cache a function call result to pickle |
|
Requests session with retry logic |
|
Write object to JSON file at URI. |
|
Write a DataFrame to a single Parquet file. |
|
Save object to path as pickle. |
|
Write file contents to a path or URI. |
- napistu.utils.io_utils._copy_tree(source_dir: str, dest_uri: str) None
Copy directory tree to any fsspec destination.
- napistu.utils.io_utils._extract_tarball(tar_path: str, output_uri: str) None
Extract tarball using standard library.
- napistu.utils.io_utils._extract_zip(zip_path: str, output_uri: str) None
Extract zip using standard library.
- napistu.utils.io_utils.download_and_extract(url: str, output_dir_path: str = '.', download_method: str = 'wget', overwrite: bool = False) None
Download archive and extract to directory.
- napistu.utils.io_utils.download_ftp(url: str, path: str) None
Download a file from an FTP server.
- Parameters:
url (str) – URL of the file to download
path (str) – Path to the output file
- Return type:
None
- napistu.utils.io_utils.download_wget(url: str, path, target_filename: str = None, verify: bool = True, timeout: int = 30, max_retries: int = 3) None
Downloads file / archive with wget
- Parameters:
url (str) – URL of the file to download
path (FilePath | WriteBuffer) – File path or buffer
target_filename (str) – Specific file to extract from ZIP if URL is a ZIP file
verify (bool) – url (str): url
timeout (int) – Timeout in seconds for the request
max_retries (int) – Number of times to retry the download if it fails
- Return type:
None
- napistu.utils.io_utils.extract(file_uri: str) None
Extract archive at file_uri to same directory.
Supports: .tar.gz, .tgz, .zip, .gz
- napistu.utils.io_utils.gunzip(gzipped_path: str, outpath: str | None = None) None
Gunzip a file to an output path.
- Parameters:
gzipped_path (str) – Path or URI to the gzipped file (e.g., ‘/local/file.gz’, ‘gs://bucket/file.gz’).
outpath (str | None, optional) – Path or URI to the output file. If None, automatically determined by removing the .gz extension from gzipped_path.
- Return type:
None
- Raises:
FileNotFoundError – If gzipped_path does not exist.
Examples
>>> gunzip('/tmp/data.txt.gz') # Creates /tmp/data.txt >>> gunzip('gs://bucket/data.txt.gz', 'gs://bucket/output.txt')
- napistu.utils.io_utils.load_json(uri: str) Any
Read JSON from a URI.
- Parameters:
uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
- Returns:
The parsed JSON object (dict, list, etc.).
- Return type:
Any
Examples
>>> data = load_json('/tmp/config.json') >>> data = load_json('gs://bucket/config.json')
- napistu.utils.io_utils.load_parquet(uri: str | Path) DataFrame
Read a DataFrame from a Parquet file.
- Parameters:
uri (Union[str, Path]) – Path or URI to the Parquet file to load (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’).
- Returns:
The DataFrame loaded from the Parquet file.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified file does not exist.
Examples
>>> df = load_parquet('/tmp/data.parquet') >>> df = load_parquet('gs://bucket/data.parquet')
- napistu.utils.io_utils.load_pickle(path: str) Any
Load a pickle object from a path or URI.
- Parameters:
path (str) – Path or URI to the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
- Returns:
The unpickled object.
- Return type:
Any
Examples
>>> obj = load_pickle('/tmp/data.pkl') >>> obj = load_pickle('gs://bucket/data.pkl')
- napistu.utils.io_utils.pickle_cache(path: str, overwrite: bool = False) Callable
A decorator to cache a function call result to pickle
Attention: this does not care about the function arguments All function calls will be served by the same pickle file.
- Parameters:
path (str) – Path to the cache pickle file
overwrite (bool) – Should an existing cache be overwritten even if it exists?
- Returns:
A function whos output will be cached to pickle.
- Return type:
Callable
- napistu.utils.io_utils.requests_retry_session(retries=5, backoff_factor=0.3, status_forcelist=(500, 502, 503, 504), session: Session | None = None, **kwargs) Session
Requests session with retry logic
This should help to combat flaky apis, eg Brenda. From: https://stackoverflow.com/a/58687549
- Parameters:
retries (int) – Number of retries. Defaults to 5.
backoff_factor (float) – Backoff factor. Defaults to 0.3.
status_forcelist (tuple) – Errors to retry. Defaults to (500, 502, 503, 504).
session (requests.Session | None) – Existing session. Defaults to None.
- Return type:
requests.Session
- napistu.utils.io_utils.save_json(uri: str, obj: Any) None
Write object to JSON file at URI.
- Parameters:
uri (str) – Path or URI to the JSON file (e.g., ‘/local/path.json’, ‘gs://bucket/file.json’).
obj (Any) – Object to serialize to JSON.
- Return type:
None
Examples
>>> save_json('/tmp/config.json', {'key': 'value'}) >>> save_json('gs://bucket/config.json', {'key': 'value'})
- napistu.utils.io_utils.save_parquet(df: DataFrame, uri: str | Path, compression: str = 'snappy') None
Write a DataFrame to a single Parquet file.
- Parameters:
df (pd.DataFrame) – The DataFrame to save.
uri (Union[str, Path]) – Path or URI where to save the Parquet file (e.g., ‘/local/data.parquet’, ‘gs://bucket/data.parquet’). Recommended extensions: .parquet or .pq
compression (str, default='snappy') – Compression algorithm. Options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.
- Raises:
OSError – If the file cannot be written to (permission issues, etc.).
Examples
>>> save_parquet(df, '/tmp/data.parquet') >>> save_parquet(df, 'gs://bucket/data.parquet', compression='gzip')
- napistu.utils.io_utils.save_pickle(path: str, dat: Any) None
Save object to path as pickle.
- Parameters:
path (str) – Path or URI where to save the pickle file (e.g., ‘/local/file.pkl’, ‘gs://bucket/file.pkl’).
dat (Any) – Object to pickle.
- Return type:
None
Examples
>>> save_pickle('/tmp/data.pkl', my_object) >>> save_pickle('gs://bucket/data.pkl', my_object)
- napistu.utils.io_utils.write_file_contents_to_path(path: str, contents: bytes) None
Write file contents to a path or URI.
Handles both file-like objects with write() method and string paths/URIs.
- Parameters:
path (str) – Destination path or URI, or a file-like object with write() method.
contents (bytes) – File contents to write.
- Return type:
None
Examples
>>> write_file_contents_to_path('/tmp/file.txt', b'Hello') >>> write_file_contents_to_path('gs://bucket/file.txt', b'Hello')