zvec_db.embedders.sparse.bm25plus

BM25+ sparse embedding with smoothing to prevent zero scores.

This module implements BM25+, an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.

Classes

BM25PlusTransformer

Scikit-learn transformer implementing BM25+ scoring.

BM25PlusEmbedder

High-level embedder wrapping BM25PlusTransformer with zvec-db compatibility.

Example Usage

from zvec_db.embedders import BM25PlusEmbedder

embedder = BM25PlusEmbedder(
    k1=1.2,
    b=0.75,
    delta=0.5,
    max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

BM25PlusEmbedder([tokenizer, ...])

Sparse embedder implementing the BM25+ scoring formula.

BM25PlusTransformer([k1, b, delta])

Transformer implementing the BM25+ scoring formula.

class zvec_db.embedders.sparse.bm25plus.BM25PlusTransformer(k1=1.2, b=0.75, delta=0.5)[source]

Transformer implementing the BM25+ scoring formula.

BM25+ is an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.

The BM25+ score for a term \(t\) in document \(d\) is computed as:

\[\begin{split}\text{BM25+}(t, d) = \text{IDF}(t) \times \left(\delta + \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \\frac{|d|}{\text{avgdl}})}\right)\end{split}\]
where:
  • \(f(t, d)\) is the term frequency of \(t\) in document \(d\)

  • \(|d|\) is the document length

  • \(\text{avgdl}\) is the average document length in the corpus

  • \(\text{IDF}(t)\) is the inverse document frequency of term \(t\)

  • \(\delta\) is the smoothing parameter (default: 0.5)

Key difference from BM25:

BM25: IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl)) BM25+: IDF × + (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl)))

The delta parameter ensures that even terms with TF=0 contribute a small score, which can improve retrieval performance in certain scenarios.

Parameters:
  • k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.

  • b (float) – Length normalization parameter. Controls the influence of document length. b=1.0 means full length normalization, b=0.0 disables it. Defaults to 0.75.

  • delta (float) – Smoothing parameter. Adds a constant to prevent zero scores. Typical range: 0.4 to 1.0. Defaults to 0.5.

Example

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ("count", CountVectorizer()),
...     ("bm25plus", BM25PlusTransformer(k1=1.5, b=0.8, delta=0.6))
... ])
>>> pipeline.fit(documents)
__init__(k1=1.2, b=0.75, delta=0.5)[source]

Initialize the BM25+ transformer.

Parameters:
  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.

class zvec_db.embedders.sparse.bm25plus.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25+ scoring formula.

BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.

This class wires together a CountVectorizer with a BM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • delta (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25+ pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object