zvec_db.rerankers.utils.normalize

Normalization utilities for post-processing raw document relevance scores.

This module defines a lightweight helper that is used by the retrieval layer in the surrounding application. Scoring routines such as those implemented in zvec_db.rerankers return raw floats that can vary wildly in magnitude between queries and algorithms. Feeding these unbounded values directly into other parts of the stack (reranking, ensembles, thresholding) results in unexpected behaviour and makes tuning difficult.

The Normalize class converts a hunter-style list of (uid, score) tuples into the unit interval [0.0, 1.0]. Multiple normalisation strategies are supported:

standard (default) - index-aware scaling that divides by an estimated maximum score and clips values to the unit interval.
Bayesian / BB25 - sigmoid calibration particularly useful when only the relative ordering of positive scores matters. Robust to outliers.
minmax - simple (x - min) / (max - min) scaling. Preserves relative distances.
percentile - rank-based normalization. Very robust to outliers.
cosine - no-op (identity). COSINE scores are already in [0, 1] after conversion.
atan - arctan-based normalization for unbounded scores.

Configuration may be supplied as a simple string (e.g. "bayes") or as a more detailed dictionary containing method, alpha and beta keys. Default values are chosen to mirror those described in the reference implementation.

Constants

SIGMOID_CLIP_MIN, SIGMOID_CLIP_MAXfloat: Bounds for clipping logits before sigmoid computation. The value ±500 prevents overflow in np.exp() while being large enough to not affect practical score ranges (exp(-500) ≈ 10^-217, effectively zero).
DEFAULT_ALPHAfloat: Default scale parameter for Bayesian normalization.
DEFAULT_BETANone: Default center parameter (None triggers median-based automatic selection).

Example usage:

normaliser = Normalize({'method': 'bayes', 'alpha': 2.0})
results = [("doc1", 3.2), ("doc2", 0.5), ("doc3", -1.0)]
calibrated = normaliser(results)
# calibrated -> [("doc1", 1.0), ("doc2", 0.646...), ("doc3", 0.0)]

The module has no dependencies outside of NumPy, which is already required by other parts of the project.

Classes

Normalize([config])

Callable normaliser for lists of (uid, score) pairs.

class zvec_db.rerankers.utils.normalize.Normalize(config=None)[source]

Callable normaliser for lists of (uid, score) pairs.

Instances behave like functions: call them with a score list and an optional avgscore and the result will be a new list with all scores mapped into the closed unit interval. The precise transformation is determined by the configuration supplied at construction time.

Parameters:: config (Union[bool, str, Dict[str, Any], None])

method

Lowercase string naming the chosen normalisation algorithm.

Type:: str

alpha

Scale parameter used in Bayesian modes.

Type:: float

beta

Centre parameter used in Bayesian modes; None triggers median-based automatic selection.

Type:: Optional[float]

__init__(config=None)[source]

Initialise a Normalize instance.

Parameters:

config (bool, str, dict or None, optional) –

Configuration object that selects the normalisation strategy. The following forms are interpreted:

None or False : equivalent to "default" - standard index-aware scaling.
truthy non-dict : also selects the default behaviour.
str : the string value is converted to lower case and used as the method name. Supported methods: - "bayes", "bayesian", "bb25" : Bayesian sigmoid calibration - "minmax" : (x - min) / (max - min) - "percentile" (alias: "rank") : rank-based normalization - "default" : standard index-aware scaling
dict : a copy of the dictionary is stored, and may contain the keys method (string), alpha (float) and beta (float or None). Any missing keys will be filled with defaults (alpha defaults to 1.0; beta to None).

Notes

The configuration is shallow-copied to prevent external modification from affecting the normaliser’s internal state.

__call__(scores, avgscore=0.0)[source]

Normalise a list of document scores.

Parameters:

scores (ScoreList) – Sequence of (uid, score) pairs, typically produced by a retrieval algorithm. It is assumed that the list is sorted in descending order of score; the method will use the first entry to compute the maximum when performing default scaling.
avgscore (float, optional) – Average score computed over the entire corpus. This is only used by the default normalisation strategy. In Bayesian modes the value is ignored entirely.

Returns:

New list where each score has been replaced with a value in [0.0, 1.0] according to the chosen transformation.

Return type:

ScoreList

Notes

Multiple normalisation methods are supported:

default – scales scores relative to an estimated maximum and clips values. This keeps the relative ordering intact but bounds the range.
bayesian – applies a sigmoid function calibrated using the positive scores only. Negative or zero input scores are mapped to 0.0 unconditionally. Robust to outliers.
minmax – (x - min) / (max - min). Preserves relative distances.
percentile – rank-based normalization. Very robust to outliers.
cosine – no-op (identity). COSINE conversion (2-score)/2 already produces scores in [0, 1], so no additional normalization is needed.
atan – arctan-based normalization: 1 - 2*atan(s)/pi for L2, 0.5 + atan(s)/pi for IP. Maps unbounded scores to [0, 1].