napistu.utils.string_utils

Utilities for string operations and text processing.

Public Functions

extract_regex_match(regex: str, query: str) -> str:

Extract a matched substring using regex match on the full string.

extract_regex_search(regex: str, query: str, index_value: int = 0) -> str:

Extract a matched substring using regex search.

match_regex_dict(s: str, regex_dict: Dict[str, any]) -> Optional[any]:

Apply each regex in regex_dict to the string s and return the first match value.

safe_capitalize(text: str) -> str:

Capitalize first letter only, preserving case of rest.

safe_fill(x: str, fill_width: int = 15) -> str:

Safely wrap a string to a specified width.

safe_join_set(values: Any) -> str | None:

Safely join values, filtering out None values with “ OR “ separator.

safe_series_tolist(x: str | pd.Series) -> list:

Convert either a list or str to a list.

score_nameness(string: str) -> int:

Score how name-like a string is (lower score is more name-like).

Functions

extract_regex_match(regex, query)

extract_regex_search(regex, query[, index_value])

Match an identifier substring and otherwise throw an error

match_regex_dict(s, regex_dict)

Apply each regex in regex_dict to the string s.

safe_capitalize(text)

Capitalize first letter only, preserve case of rest.

safe_fill(x[, fill_width])

Safely wrap a string to a specified width.

safe_join_set(values)

Safely join values, filtering out None values.

score_nameness(string)

Score Nameness

napistu.utils.string_utils._add_nameness_score(df, name_var)

Add a nameness_score variable which reflects how name-like each entry is.

napistu.utils.string_utils._add_nameness_score_wrapper(df, name_var, table_schema)

Call _add_nameness_score with default value.

napistu.utils.string_utils.extract_regex_match(regex: str, query: str) str
Parameters:
  • regex (str) – regular expression to search

  • query (str) – string to search against

Returns:

a character string match

Return type:

match (str)

Match an identifier substring and otherwise throw an error

Parameters:
  • regex (str) – regular expression to search

  • query (str) – string to search against

  • index_value (int) – entry in index to return

Returns:

a character string match

Return type:

match (str)

napistu.utils.string_utils.match_regex_dict(s: str, regex_dict: Dict[str, any]) any | None

Apply each regex in regex_dict to the string s. If a regex matches, return its value. If no regex matches, return None.

Parameters:
  • s (str) – The string to test.

  • regex_dict (dict) – Dictionary where keys are regex patterns (str), and values are the values to return.

Return type:

The value associated with the first matching regex, or None if no match.

napistu.utils.string_utils.safe_capitalize(text: str) str

Capitalize first letter only, preserve case of rest.

napistu.utils.string_utils.safe_fill(x: str, fill_width: int = 15) str

Safely wrap a string to a specified width.

Parameters:
  • x (str) – The string to wrap.

  • fill_width (int, optional) – The width to wrap the string to. Default is 15.

Returns:

The wrapped string.

Return type:

str

napistu.utils.string_utils.safe_join_set(values: Any) str | None

Safely join values, filtering out None values.

Converts input to a set (ensuring uniqueness), removes None values, and joins remaining values with “ OR “ separator in sorted order.

Parameters:

values (Any) – Values to join. Can be list, tuple, set, pandas Series, string, or other iterable. Strings are treated as single values, not character sequences.

Returns:

Joined string with “ OR “ separator in alphabetical order, or None if no valid values remain after filtering.

Return type:

str or None

Examples

>>> safe_join_set([1, 2, 3])
'1 OR 2 OR 3'
>>> safe_join_set([3, 1, 2, 1])  # Removes duplicates and sorts
'1 OR 2 OR 3'
>>> safe_join_set([1, None, 3])
'1 OR 3'
>>> safe_join_set([None, None])
None
>>> safe_join_set("hello")  # String treated as single value
'hello'
napistu.utils.string_utils.score_nameness(string: str)

Score Nameness

This utility assigns a numeric score to a string reflecting how likely it is to be a human readable name. This will help to prioritize readable entries when we are trying to pick out a single name to display from a set of values which may also include entries like systematic ids.

Parameters:

string (str) – An alphanumeric string

Returns:

An integer score indicating how name-like the string is (low is more name-like)

Return type:

score (int)