zvec_db.embedders.sparse.bm25

BM25 sparse embedding using scikit-learn pipelines.

This module implements the BM25 (Best Matching 25) scoring formula, a probabilistic ranking function widely used in information retrieval. BM25 improves upon simple term frequency by accounting for document length normalization and term saturation.

Classes

BM25Transformer: Scikit-learn transformer implementing BM25 scoring.
BM25Embedder: High-level embedder wrapping BM25Transformer with zvec-db compatibility.

Example Usage

from zvec_db.embedders import BM25Embedder

embedder = BM25Embedder(
    k1=1.2,
    b=0.75,
    max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

`BM25Embedder`([tokenizer, is_pretokenized, ...])	Sparse embedder implementing the BM25 scoring formula.
`BM25Transformer`([k1, b])	Transformer implementing the BM25 scoring formula.

class zvec_db.embedders.sparse.bm25.BM25Transformer(k1=1.2, b=0.75)[source]

Transformer implementing the BM25 scoring formula.

BM25 (Best Matching 25) is a probabilistic ranking function widely used in information retrieval. It improves upon simple term frequency by accounting for document length normalization and term saturation.

The BM25 score for a term $t$ in document $d$ is computed as:

\[\text{BM25}(t, d) = \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)} {f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}\]

where:

$f(t, d)$ is the term frequency of $t$ in document $d$
$|d|$ is the document length
$\text{avgdl}$ is the average document length in the corpus
$\text{IDF}(t)$ is the inverse document frequency of term $t$

Parameters:

k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.
b (float) – Length normalization parameter. Controls the influence of document length. b=1.0 means full length normalization, b=0.0 disables it. Defaults to 0.75.

Example

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ("count", CountVectorizer()),
...     ("bm25", BM25Transformer(k1=1.5, b=0.8))
... ])
>>> pipeline.fit(documents)

__init__(k1=1.2, b=0.75)[source]

Initialize the BM25 transformer.

Parameters:

k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

class zvec_db.embedders.sparse.bm25.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25 scoring formula.

This class wires together a CountVectorizer with a lightweight BM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function.
is_pretokenized (bool) – If True, input documents must be lists of tokens.
max_features (Optional[int]) – Maximum number of features to retain.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
b (float) – Length normalization parameter. Defaults to 0.75.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional parameters for CountVectorizer.

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25 pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25Transformer: Applies BM25 weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object