zvec_db.embedders.sparse.bm25
BM25 sparse embedding using scikit-learn pipelines.
This module implements the BM25 (Best Matching 25) scoring formula, a probabilistic ranking function widely used in information retrieval. BM25 improves upon simple term frequency by accounting for document length normalization and term saturation.
Classes
- BM25Transformer
Scikit-learn transformer implementing BM25 scoring.
- BM25Embedder
High-level embedder wrapping BM25Transformer with zvec-db compatibility.
Example Usage
from zvec_db.embedders import BM25Embedder
embedder = BM25Embedder(
k1=1.2,
b=0.75,
max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")
Classes
|
Sparse embedder implementing the BM25 scoring formula. |
|
Transformer implementing the BM25 scoring formula. |
- class zvec_db.embedders.sparse.bm25.BM25Transformer(k1=1.2, b=0.75)[source]
Transformer implementing the BM25 scoring formula.
BM25 (Best Matching 25) is a probabilistic ranking function widely used in information retrieval. It improves upon simple term frequency by accounting for document length normalization and term saturation.
The BM25 score for a term \(t\) in document \(d\) is computed as:
\[\text{BM25}(t, d) = \text{IDF}(t) \times \frac{f(t, d) \times (k_1 + 1)} {f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}\]- where:
\(f(t, d)\) is the term frequency of \(t\) in document \(d\)
\(|d|\) is the document length
\(\text{avgdl}\) is the average document length in the corpus
\(\text{IDF}(t)\) is the inverse document frequency of term \(t\)
- Parameters:
k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.
b (float) – Length normalization parameter. Controls the influence of document length.
b=1.0means full length normalization,b=0.0disables it. Defaults to 0.75.
Example
>>> from sklearn.feature_extraction.text import CountVectorizer >>> from sklearn.pipeline import Pipeline >>> pipeline = Pipeline([ ... ("count", CountVectorizer()), ... ("bm25", BM25Transformer(k1=1.5, b=0.8)) ... ]) >>> pipeline.fit(documents)
- class zvec_db.embedders.sparse.bm25.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the BM25 scoring formula.
This class wires together a
CountVectorizerwith a lightweightBM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited fromBaseSparseEmbedder:is_pretokenizedtells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.tokenizerallows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.
The two options are mutually exclusive and validated by the base class.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function.
is_pretokenized (bool) – If True, input documents must be lists of tokens.
max_features (Optional[int]) – Maximum number of features to retain.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
b (float) – Length normalization parameter. Defaults to 0.75.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional parameters for CountVectorizer.
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the BM25 pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.BM25Transformer: Applies BM25 weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BM25Embedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BM25Embedder)
- Returns:
self – The updated object.
- Return type: