zvec_db.embedders.sparse.bm25plus

BM25+ sparse embedding with smoothing to prevent zero scores.

This module implements BM25+, an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.

Classes

BM25PlusTransformer: Scikit-learn transformer implementing BM25+ scoring.
BM25PlusEmbedder: High-level embedder wrapping BM25PlusTransformer with zvec-db compatibility.

Example Usage

from zvec_db.embedders import BM25PlusEmbedder

embedder = BM25PlusEmbedder(
    k1=1.2,
    b=0.75,
    delta=0.5,
    max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

`BM25PlusEmbedder`([tokenizer, ...])	Sparse embedder implementing the BM25+ scoring formula.
`BM25PlusTransformer`([k1, b, delta])	Transformer implementing the BM25+ scoring formula.

class zvec_db.embedders.sparse.bm25plus.BM25PlusTransformer(k1=1.2, b=0.75, delta=0.5)[source]

Transformer implementing the BM25+ scoring formula.

BM25+ is an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.

The BM25+ score for a term $t$ in document $d$ is computed as:

\[\begin{split}\text{BM25+}(t, d) = \text{IDF}(t) \times \left(\delta + \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \\frac{|d|}{\text{avgdl}})}\right)\end{split}\]

where:

$f(t, d)$ is the term frequency of $t$ in document $d$
$|d|$ is the document length
$\text{avgdl}$ is the average document length in the corpus
$\text{IDF}(t)$ is the inverse document frequency of term $t$
$\delta$ is the smoothing parameter (default: 0.5)

Key difference from BM25:

BM25: IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl)) BM25+: IDF × (δ + (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl)))

The delta parameter ensures that even terms with TF=0 contribute a small score, which can improve retrieval performance in certain scenarios.

Parameters:

k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.
b (float) – Length normalization parameter. Controls the influence of document length. b=1.0 means full length normalization, b=0.0 disables it. Defaults to 0.75.
delta (float) – Smoothing parameter. Adds a constant to prevent zero scores. Typical range: 0.4 to 1.0. Defaults to 0.5.

Example

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ("count", CountVectorizer()),
...     ("bm25plus", BM25PlusTransformer(k1=1.5, b=0.8, delta=0.6))
... ])
>>> pipeline.fit(documents)

__init__(k1=1.2, b=0.75, delta=0.5)[source]

Initialize the BM25+ transformer.

Parameters:

k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.

class zvec_db.embedders.sparse.bm25plus.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25+ scoring formula.

BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.

This class wires together a CountVectorizer with a BM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
delta (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25+ pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object