napistu.utils.string_utils

Utilities for string operations and text processing.

Public Functions

extract_regex_match(regex: str, query: str) -> str:: Extract a matched substring using regex match on the full string.
extract_regex_search(regex: str, query: str, index_value: int = 0) -> str:: Extract a matched substring using regex search.
match_regex_dict(s: str, regex_dict: Dict[str, any]) -> Optional[any]:: Apply each regex in regex_dict to the string s and return the first match value.
safe_capitalize(text: str) -> str:: Capitalize first letter only, preserving case of rest.
safe_fill(x: str, fill_width: int = 15) -> str:: Safely wrap a string to a specified width.
safe_join_set(values: Any) -> str | None:: Safely join values, filtering out None values with “ OR “ separator.
safe_series_tolist(x: str | pd.Series) -> list:: Convert either a list or str to a list.
score_nameness(string: str) -> int:: Score how name-like a string is (lower score is more name-like).

Functions

`extract_regex_match`(regex, query)
`extract_regex_search`(regex, query[, index_value])	Match an identifier substring and otherwise throw an error
`match_regex_dict`(s, regex_dict)	Apply each regex in regex_dict to the string s.
`safe_capitalize`(text)	Capitalize first letter only, preserve case of rest.
`safe_fill`(x[, fill_width])	Safely wrap a string to a specified width.
`safe_join_set`(values)	Safely join values, filtering out None values.
`score_nameness`(string)	Score Nameness

napistu.utils.string_utils._add_nameness_score(df, name_var): Add a nameness_score variable which reflects how name-like each entry is.

napistu.utils.string_utils._add_nameness_score_wrapper(df, name_var, table_schema): Call _add_nameness_score with default value.

napistu.utils.string_utils.extract_regex_match(regex: str, query: str) → str

Parameters:

regex (str) – regular expression to search
query (str) – string to search against

Returns:

a character string match

Return type:

match (str)

napistu.utils.string_utils.extract_regex_search(regex: str, query: str, index_value: int = 0) → str

Match an identifier substring and otherwise throw an error

Parameters:

regex (str) – regular expression to search
query (str) – string to search against
index_value (int) – entry in index to return

Returns:

a character string match

Return type:

match (str)

napistu.utils.string_utils.match_regex_dict(s: str, regex_dict: Dict[str, any]) → any | None

Apply each regex in regex_dict to the string s. If a regex matches, return its value. If no regex matches, return None.

Parameters:

s (str) – The string to test.
regex_dict (dict) – Dictionary where keys are regex patterns (str), and values are the values to return.

Return type:

The value associated with the first matching regex, or None if no match.

napistu.utils.string_utils.safe_capitalize(text: str) → str: Capitalize first letter only, preserve case of rest.

napistu.utils.string_utils.safe_fill(x: str, fill_width: int = 15) → str

Safely wrap a string to a specified width.

Parameters:

x (str) – The string to wrap.
fill_width (int, optional) – The width to wrap the string to. Default is 15.

Returns:

The wrapped string.

Return type:

str

napistu.utils.string_utils.safe_join_set(values: Any) → str | None

Safely join values, filtering out None values.

Converts input to a set (ensuring uniqueness), removes None values, and joins remaining values with “ OR “ separator in sorted order.

Parameters:: values (Any) – Values to join. Can be list, tuple, set, pandas Series, string, or other iterable. Strings are treated as single values, not character sequences.
Returns:: Joined string with “ OR “ separator in alphabetical order, or None if no valid values remain after filtering.
Return type:: str or None

Examples

>>> safe_join_set([1, 2, 3])
'1 OR 2 OR 3'
>>> safe_join_set([3, 1, 2, 1])  # Removes duplicates and sorts
'1 OR 2 OR 3'
>>> safe_join_set([1, None, 3])
'1 OR 3'
>>> safe_join_set([None, None])
None
>>> safe_join_set("hello")  # String treated as single value
'hello'

napistu.utils.string_utils.score_nameness(string: str)

Score Nameness

This utility assigns a numeric score to a string reflecting how likely it is to be a human readable name. This will help to prioritize readable entries when we are trying to pick out a single name to display from a set of values which may also include entries like systematic ids.

Parameters:: string (str) – An alphanumeric string
Returns:: An integer score indicating how name-like the string is (low is more name-like)
Return type:: score (int)