zvec_db.embedders.sparse
Sparse embedders for lexical search.
- class zvec_db.embedders.sparse.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the BM25 scoring formula.
This class wires together a
CountVectorizerwith a lightweightBM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited fromBaseSparseEmbedder:is_pretokenizedtells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.tokenizerallows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.
The two options are mutually exclusive and validated by the base class.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function.
is_pretokenized (bool) – If True, input documents must be lists of tokens.
max_features (Optional[int]) – Maximum number of features to retain.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
b (float) – Length normalization parameter. Defaults to 0.75.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional parameters for CountVectorizer.
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the BM25 pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.BM25Transformer: Applies BM25 weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BM25Embedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BM25Embedder)
- Returns:
self – The updated object.
- Return type:
- class zvec_db.embedders.sparse.BM25LEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the BM25L scoring formula.
BM25L is a variant of BM25 that uses linear length normalization, making it more suitable for corpora with highly variable document lengths.
This class wires together a
CountVectorizerwith aBM25LTransformer. Tokenisation behaviour is controlled by the two parameters inherited fromBaseSparseEmbedder:is_pretokenizedtells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.tokenizerallows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.
The two options are mutually exclusive and validated by the base class.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = BM25LEmbedder(k1=1.5, min_df=2) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the BM25L pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.BM25LTransformer: Applies BM25L weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BM25LEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BM25LEmbedder)
- Returns:
self – The updated object.
- Return type:
- class zvec_db.embedders.sparse.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the BM25+ scoring formula.
BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.
This class wires together a
CountVectorizerwith aBM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited fromBaseSparseEmbedder:is_pretokenizedtells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.tokenizerallows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.
The two options are mutually exclusive and validated by the base class.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the BM25+ pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.BM25PlusTransformer: Applies BM25+ weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BM25PlusEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BM25PlusEmbedder)
- Returns:
self – The updated object.
- Return type:
- class zvec_db.embedders.sparse.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
Sparse TF-IDF embedder using scikit-learn’s
TfidfVectorizer.TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:
Term Frequency (TF): How often a term appears in a document.
Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.
This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying
TfidfVectorizer(e.g.,min_df,max_df,ngram_range,sublinear_tf).- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**tfidf_params – Additional keyword arguments passed to
TfidfVectorizer.
Example
>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
- fit(corpus, y=None)[source]
Fit the TF-IDF vectorizer on a corpus of documents.
The corpus is pre-processed according to the embedder’s configuration:
Custom tokenizer: Each document is tokenized before vectorization.
Pre-tokenized mode: Documents are expected to be lists of tokens.
Default: Raw strings are passed directly to
TfidfVectorizer.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (TfidfEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (TfidfEmbedder)
- Returns:
self – The updated object.
- Return type:
- class zvec_db.embedders.sparse.CountEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
Count-based sparse embedder wrapping scikit-learn’s
CountVectorizer.This embedder converts text documents into sparse vectors based on term frequencies (raw counts of each token). It is the simplest sparse embedding method and serves as a foundation for more advanced techniques like BM25 and TF-IDF.
The embedder accepts raw strings or pre-tokenized input. Any keyword arguments are forwarded to the underlying
CountVectorizerafter being normalized byBaseSparseEmbedder._prepare_vectorizer_params().- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = CountEmbedder(min_df=2, ngram_range=(1, 2)) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the embedder on a corpus of documents.
The supplied
corpusis normalised according to the instance configuration:is_pretokenized=True- the caller must provide lists of tokens.tokenizer=...- each string in the corpus will be passed through the tokenizer before vectorisation.neither set - raw strings are passed to
CountVectorizerdirectly.
_prepare_corpushandles the validation and transformation logic.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (CountEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (CountEmbedder)
- Returns:
self – The updated object.
- Return type:
- class zvec_db.embedders.sparse.DisMaxEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]
Sparse embedder implementing the DisMax scoring formula.
DisMax (Disjunctive Maximum) takes the maximum score across multiple terms or fields, rather than summing them. This is useful when you want documents that match at least one term well, rather than documents that match all terms moderately.
The DisMax score formula is:
\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) + t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]where \(t\) is the tie breaker parameter.
This embedder is particularly useful for:
Multi-field search (title, content, tags) where matching any field well should rank highly.
Disjunctive queries where documents matching any query term should be retrieved.
Avoiding score inflation from documents matching many terms weakly.
- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. 0.0 = pure maximum, 1.0 = sum all scores.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = DisMaxEmbedder(k1=1.5, tie_breaker=0.1, min_df=2) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the DisMax pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of: 1.
CountVectorizer: Tokenizes documents and builds term counts. 2.DisMaxTransformer: Applies DisMax weighting to the count matrix.The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- embed(input_text)[source]
Embed text into sparse vectors with DisMax scores.
Unlike other embedders that return a vector with multiple non-zero entries, DisMaxEmbedder returns a single score per document (the maximum term score).
- Parameters:
input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.
- Returns:
For each document, returns a dictionary with a single entry {0: dismax_score} representing the maximum term score.
- Return type:
Union[SparseVector, List[SparseVector]]
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (DisMaxEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (DisMaxEmbedder)
- Returns:
self – The updated object.
- Return type:
Modules
Base classes for sparse embedding transformers. |
|
BM25 sparse embedding using scikit-learn pipelines. |
|
BM25L sparse embedding with linear length normalization. |
|
BM25+ sparse embedding with smoothing to prevent zero scores. |
|
Count-based sparse embedding using term frequencies. |
|
DisMax (Disjunctive Maximum) sparse embedding for multi-field search. |
|
TF-IDF (Term Frequency-Inverse Document Frequency) sparse embedding. |