zvec_db.embedders.sparse

Sparse embedders for lexical search.

class zvec_db.embedders.sparse.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25 scoring formula.

This class wires together a CountVectorizer with a lightweight BM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function.

  • is_pretokenized (bool) – If True, input documents must be lists of tokens.

  • max_features (Optional[int]) – Maximum number of features to retain.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2.

  • b (float) – Length normalization parameter. Defaults to 0.75.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional parameters for CountVectorizer.

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25 pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25Transformer: Applies BM25 weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object

class zvec_db.embedders.sparse.BM25LEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25L scoring formula.

BM25L is a variant of BM25 that uses linear length normalization, making it more suitable for corpora with highly variable document lengths.

This class wires together a CountVectorizer with a BM25LTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25LEmbedder(k1=1.5, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25L pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25LTransformer: Applies BM25L weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object

class zvec_db.embedders.sparse.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25+ scoring formula.

BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.

This class wires together a CountVectorizer with a BM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • delta (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25+ pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

class zvec_db.embedders.sparse.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]

Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:

  • Term Frequency (TF): How often a term appears in a document.

  • Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.

This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying TfidfVectorizer (e.g., min_df, max_df, ngram_range, sublinear_tf).

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **tfidf_params – Additional keyword arguments passed to TfidfVectorizer.

Example

>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Fit the TF-IDF vectorizer on a corpus of documents.

The corpus is pre-processed according to the embedder’s configuration:

  • Custom tokenizer: Each document is tokenized before vectorization.

  • Pre-tokenized mode: Documents are expected to be lists of tokens.

  • Default: Raw strings are passed directly to TfidfVectorizer.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object

class zvec_db.embedders.sparse.CountEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]

Count-based sparse embedder wrapping scikit-learn’s CountVectorizer.

This embedder converts text documents into sparse vectors based on term frequencies (raw counts of each token). It is the simplest sparse embedding method and serves as a foundation for more advanced techniques like BM25 and TF-IDF.

The embedder accepts raw strings or pre-tokenized input. Any keyword arguments are forwarded to the underlying CountVectorizer after being normalized by BaseSparseEmbedder._prepare_vectorizer_params().

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = CountEmbedder(min_df=2, ngram_range=(1, 2))
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the embedder on a corpus of documents.

The supplied corpus is normalised according to the instance configuration:

  • is_pretokenized=True - the caller must provide lists of tokens.

  • tokenizer=... - each string in the corpus will be passed through the tokenizer before vectorisation.

  • neither set - raw strings are passed to CountVectorizer directly.

_prepare_corpus handles the validation and transformation logic.

Parameters:

corpus (list[str] | list[list[str]]) – Sequence of documents (strings or token lists depending on configuration).

Returns:

self to allow chaining.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object

class zvec_db.embedders.sparse.DisMaxEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the DisMax scoring formula.

DisMax (Disjunctive Maximum) takes the maximum score across multiple terms or fields, rather than summing them. This is useful when you want documents that match at least one term well, rather than documents that match all terms moderately.

The DisMax score formula is:

\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) + t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]

where \(t\) is the tie breaker parameter.

This embedder is particularly useful for:

  • Multi-field search (title, content, tags) where matching any field well should rank highly.

  • Disjunctive queries where documents matching any query term should be retrieved.

  • Avoiding score inflation from documents matching many terms weakly.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. 0.0 = pure maximum, 1.0 = sum all scores.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = DisMaxEmbedder(k1=1.5, tie_breaker=0.1, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • tie_breaker (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the DisMax pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. DisMaxTransformer: Applies DisMax weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

embed(input_text)[source]

Embed text into sparse vectors with DisMax scores.

Unlike other embedders that return a vector with multiple non-zero entries, DisMaxEmbedder returns a single score per document (the maximum term score).

Parameters:

input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.

Returns:

For each document, returns a dictionary with a single entry {0: dismax_score} representing the maximum term score.

Return type:

Union[SparseVector, List[SparseVector]]

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object

Modules

base

Base classes for sparse embedding transformers.

bm25

BM25 sparse embedding using scikit-learn pipelines.

bm25l

BM25L sparse embedding with linear length normalization.

bm25plus

BM25+ sparse embedding with smoothing to prevent zero scores.

count

Count-based sparse embedding using term frequencies.

dismax

DisMax (Disjunctive Maximum) sparse embedding for multi-field search.

tfidf

TF-IDF (Term Frequency-Inverse Document Frequency) sparse embedding.