zvec_db.embedders.sparse.bm25l

BM25L sparse embedding with linear length normalization.

This module implements BM25L, a variant of BM25 that uses linear length normalization instead of the standard BM25 convex combination. BM25L is particularly suitable for corpora with highly variable document lengths.

Classes

BM25LTransformer: Scikit-learn transformer implementing BM25L scoring.
BM25LEmbedder: High-level embedder wrapping BM25LTransformer with zvec-db compatibility.

Example Usage

from zvec_db.embedders import BM25LEmbedder

embedder = BM25LEmbedder(
    k1=1.2,
    max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

`BM25LEmbedder`([tokenizer, is_pretokenized, ...])	Sparse embedder implementing the BM25L scoring formula.
`BM25LTransformer`([k1])	Transformer implementing the BM25L scoring formula.

class zvec_db.embedders.sparse.bm25l.BM25LTransformer(k1=1.2)[source]

Transformer implementing the BM25L scoring formula.

BM25L is a variant of BM25 that uses linear length normalization instead of the standard BM25 convex combination. This makes it more suitable for corpora with highly variable document lengths.

The BM25L score for a term $t$ in document $d$ is computed as:

\[\text{BM25L}(t, d) = \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)} {f(t, d) + k_1 \times \frac{|d|}{\text{avgdl}}}\]

where:

$f(t, d)$ is the term frequency of $t$ in document $d$
$|d|$ is the document length
$\text{avgdl}$ is the average document length in the corpus
$\text{IDF}(t)$ is the inverse document frequency of term $t$

Key difference from BM25:

BM25 uses: 1 - b + b × (|d|/avgdl) with b ∈ [0, 1] BM25L uses: |d|/avgdl directly (pure linear normalization)

This makes BM25L more aggressive in penalizing long documents, which can be beneficial when document lengths vary significantly in the corpus.

Parameters:: k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.

Example

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ("count", CountVectorizer()),
...     ("bm25l", BM25LTransformer(k1=1.5))
... ])
>>> pipeline.fit(documents)

__init__(k1=1.2)[source]

Initialize the BM25L transformer.

Parameters:: k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

class zvec_db.embedders.sparse.bm25l.BM25LEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25L scoring formula.

BM25L is a variant of BM25 that uses linear length normalization, making it more suitable for corpora with highly variable document lengths.

This class wires together a CountVectorizer with a BM25LTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25LEmbedder(k1=1.5, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25L pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25LTransformer: Applies BM25L weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object