zvec_db.embedders.sparse.bm25plus
BM25+ sparse embedding with smoothing to prevent zero scores.
This module implements BM25+, an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.
Classes
- BM25PlusTransformer
Scikit-learn transformer implementing BM25+ scoring.
- BM25PlusEmbedder
High-level embedder wrapping BM25PlusTransformer with zvec-db compatibility.
Example Usage
from zvec_db.embedders import BM25PlusEmbedder
embedder = BM25PlusEmbedder(
k1=1.2,
b=0.75,
delta=0.5,
max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")
Classes
|
Sparse embedder implementing the BM25+ scoring formula. |
|
Transformer implementing the BM25+ scoring formula. |
- class zvec_db.embedders.sparse.bm25plus.BM25PlusTransformer(k1=1.2, b=0.75, delta=0.5)[source]
Transformer implementing the BM25+ scoring formula.
BM25+ is an extension of BM25 that adds a smoothing parameter (delta) to prevent documents with zero term frequency from having a zero score. This is particularly useful for corpora with many rare terms or when combining scores from multiple sources.
The BM25+ score for a term \(t\) in document \(d\) is computed as:
\[\begin{split}\text{BM25+}(t, d) = \text{IDF}(t) \times \left(\delta + \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \\frac{|d|}{\text{avgdl}})}\right)\end{split}\]- where:
\(f(t, d)\) is the term frequency of \(t\) in document \(d\)
\(|d|\) is the document length
\(\text{avgdl}\) is the average document length in the corpus
\(\text{IDF}(t)\) is the inverse document frequency of term \(t\)
\(\delta\) is the smoothing parameter (default: 0.5)
- Key difference from BM25:
BM25:
IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl))BM25+:IDF × (δ + (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × |d|/avgdl)))
The delta parameter ensures that even terms with TF=0 contribute a small score, which can improve retrieval performance in certain scenarios.
- Parameters:
k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.
b (float) – Length normalization parameter. Controls the influence of document length.
b=1.0means full length normalization,b=0.0disables it. Defaults to 0.75.delta (float) – Smoothing parameter. Adds a constant to prevent zero scores. Typical range: 0.4 to 1.0. Defaults to 0.5.
Example
>>> from sklearn.feature_extraction.text import CountVectorizer >>> from sklearn.pipeline import Pipeline >>> pipeline = Pipeline([ ... ("count", CountVectorizer()), ... ("bm25plus", BM25PlusTransformer(k1=1.5, b=0.8, delta=0.6)) ... ]) >>> pipeline.fit(documents)
- __init__(k1=1.2, b=0.75, delta=0.5)[source]
Initialize the BM25+ transformer.
- Parameters:
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.
- class zvec_db.embedders.sparse.bm25plus.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the BM25+ scoring formula.
BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.
This class wires together a
CountVectorizerwith aBM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited fromBaseSparseEmbedder:is_pretokenizedtells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.tokenizerallows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.
The two options are mutually exclusive and validated by the base class.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the BM25+ pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.BM25PlusTransformer: Applies BM25+ weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BM25PlusEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BM25PlusEmbedder)
- Returns:
self – The updated object.
- Return type: