zvec_db.rerankers.utils.normalize
Normalization utilities for post-processing raw document relevance scores.
This module defines a lightweight helper that is used by the retrieval layer
in the surrounding application. Scoring routines such as those implemented in
zvec_db.rerankers return raw floats that can vary wildly in magnitude
between queries and algorithms. Feeding these unbounded values directly into
other parts of the stack (reranking, ensembles, thresholding) results in
unexpected behaviour and makes tuning difficult.
The Normalize class converts a hunter-style list of (uid, score)
tuples into the unit interval [0.0, 1.0]. Multiple normalisation strategies
are supported:
standard (default) - index-aware scaling that divides by an estimated maximum score and clips values to the unit interval.
Bayesian / BB25 - sigmoid calibration particularly useful when only the relative ordering of positive scores matters. Robust to outliers.
minmax - simple (x - min) / (max - min) scaling. Preserves relative distances.
percentile - rank-based normalization. Very robust to outliers.
cosine - no-op (identity). COSINE scores are already in [0, 1] after conversion.
atan - arctan-based normalization for unbounded scores.
Configuration may be supplied as a simple string (e.g. "bayes") or as a
more detailed dictionary containing method, alpha and beta keys.
Default values are chosen to mirror those described in the reference
implementation.
Constants
- SIGMOID_CLIP_MIN, SIGMOID_CLIP_MAXfloat
Bounds for clipping logits before sigmoid computation. The value ±500 prevents overflow in np.exp() while being large enough to not affect practical score ranges (exp(-500) ≈ 10^-217, effectively zero).
- DEFAULT_ALPHAfloat
Default scale parameter for Bayesian normalization.
- DEFAULT_BETANone
Default center parameter (None triggers median-based automatic selection).
Example usage:
normaliser = Normalize({'method': 'bayes', 'alpha': 2.0})
results = [("doc1", 3.2), ("doc2", 0.5), ("doc3", -1.0)]
calibrated = normaliser(results)
# calibrated -> [("doc1", 1.0), ("doc2", 0.646...), ("doc3", 0.0)]
The module has no dependencies outside of NumPy, which is already required by other parts of the project.
Classes
|
Callable normaliser for lists of |
- class zvec_db.rerankers.utils.normalize.Normalize(config=None)[source]
Callable normaliser for lists of
(uid, score)pairs.Instances behave like functions: call them with a score list and an optional
avgscoreand the result will be a new list with all scores mapped into the closed unit interval. The precise transformation is determined by the configuration supplied at construction time.- beta
Centre parameter used in Bayesian modes;
Nonetriggers median-based automatic selection.- Type:
Optional[float]
- __init__(config=None)[source]
Initialise a
Normalizeinstance.- Parameters:
config (bool, str, dict or None, optional) –
Configuration object that selects the normalisation strategy. The following forms are interpreted:
NoneorFalse: equivalent to"default"- standard index-aware scaling.truthy non-dict : also selects the default behaviour.
str: the string value is converted to lower case and used as themethodname. Supported methods: -"bayes","bayesian","bb25": Bayesian sigmoid calibration -"minmax": (x - min) / (max - min) -"percentile"(alias:"rank") : rank-based normalization -"default": standard index-aware scalingdict: a copy of the dictionary is stored, and may contain the keysmethod(string),alpha(float) andbeta(float orNone). Any missing keys will be filled with defaults (alphadefaults to1.0;betatoNone).
Notes
The configuration is shallow-copied to prevent external modification from affecting the normaliser’s internal state.
- __call__(scores, avgscore=0.0)[source]
Normalise a list of document scores.
- Parameters:
scores (ScoreList) – Sequence of
(uid, score)pairs, typically produced by a retrieval algorithm. It is assumed that the list is sorted in descending order of score; the method will use the first entry to compute the maximum when performing default scaling.avgscore (float, optional) – Average score computed over the entire corpus. This is only used by the
defaultnormalisation strategy. In Bayesian modes the value is ignored entirely.
- Returns:
New list where each score has been replaced with a value in
[0.0, 1.0]according to the chosen transformation.- Return type:
ScoreList
Notes
Multiple normalisation methods are supported:
default – scales scores relative to an estimated maximum and clips values. This keeps the relative ordering intact but bounds the range.
bayesian – applies a sigmoid function calibrated using the positive scores only. Negative or zero input scores are mapped to
0.0unconditionally. Robust to outliers.minmax – (x - min) / (max - min). Preserves relative distances.
percentile – rank-based normalization. Very robust to outliers.
cosine – no-op (identity). COSINE conversion (2-score)/2 already produces scores in [0, 1], so no additional normalization is needed.
atan – arctan-based normalization:
1 - 2*atan(s)/pifor L2,0.5 + atan(s)/pifor IP. Maps unbounded scores to [0, 1].