zvec_db.embedders.sparse.dismax

DisMax (Disjunctive Maximum) sparse embedding for multi-field search.

This module implements the DisMax (Disjunctive Maximum) scoring formula, which takes the maximum score across multiple queries or fields rather than summing them. This is useful when you want documents that match at least one query/field well, rather than documents that match all queries/fields moderately.

Classes

DisMaxTransformer: Scikit-learn transformer implementing DisMax scoring.
DisMaxEmbedder: High-level embedder wrapping DisMaxTransformer with zvec-db compatibility.

Example Usage

from zvec_db.embedders import DisMaxEmbedder

embedder = DisMaxEmbedder(
    k1=1.2,
    b=0.75,
    tie_breaker=0.1,
    max_features=4096
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

`DisMaxEmbedder`([tokenizer, is_pretokenized, ...])	Sparse embedder implementing the DisMax scoring formula.
`DisMaxTransformer`([k1, b, tie_breaker])	Custom scikit-learn transformer implementing the DisMax scoring formula.

class zvec_db.embedders.sparse.dismax.DisMaxTransformer(k1=1.2, b=0.75, tie_breaker=0.0)[source]

Custom scikit-learn transformer implementing the DisMax scoring formula.

DisMax (Disjunctive Maximum) is a scoring function that takes the maximum score across multiple queries or fields, rather than summing them. This is useful when you want documents that match at least one query/field well, rather than documents that match all queries/fields moderately.

The DisMax score for a document is computed as:

\[\text{DisMax}(d) = \max_{q \in Q}(\text{score}_q(d)) + t \times \sum_{q \in Q \setminus \{\text{argmax}\}} \text{score}_q(d)\]

where:

$Q$ is the set of queries or fields
$t$ is the tie breaker parameter (0.0 = pure max, 1.0 = sum)

When tie_breaker=0.0, only the maximum score is used (pure DisMax). When tie_breaker=1.0, all scores are summed (equivalent to standard fusion). Intermediate values provide a blend of both approaches.

This transformer is typically used in multi-query scenarios where:

Different queries target different aspects of relevance.
You want to avoid penalizing documents that match only one query well.
Summing scores would unfairly advantage documents matching multiple queries.

Parameters:

k1 (float) – Term frequency saturation parameter. Controls how quickly term frequency saturates. Higher values mean slower saturation. Typical range: 1.2 to 2.0. Defaults to 1.2.
b (float) – Length normalization parameter. Controls the influence of document length. b=1.0 means full length normalization, b=0.0 disables it. Defaults to 0.75.
tie_breaker (float) – Tie breaker parameter. When multiple queries match, adds a fraction of non-maximum scores to the maximum. 0.0 = use only maximum score, 1.0 = sum all scores. Typical range: 0.0 to 0.5. Defaults to 0.0 (pure DisMax).

idf_

Computed inverse document frequencies for all terms.

Type:: ndarray

avgdl_

Average document length in the training corpus.

Type:: float

is_fitted_

Whether the transformer has been fitted.

Type:: bool

Example

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> pipeline = Pipeline([
...     ("count", CountVectorizer()),
...     ("dismax", DisMaxTransformer(k1=1.5, tie_breaker=0.1))
... ])
>>> pipeline.fit(documents)

__init__(k1=1.2, b=0.75, tie_breaker=0.0)[source]

Initialize the DisMax transformer.

Parameters:

k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. Typical range: 0.0-0.5. 0.0 = pure max, 1.0 = sum.

fit(X, y=None)[source]

Compute IDF values and average document length from a count matrix.

Parameters:

X (csr_matrix) – Sparse count matrix of shape (n_docs, n_terms).
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted transformer.

Return type:

self

Raises:

ValueError – If the corpus is empty (average document length is zero).

transform(X)[source]

Apply DisMax scoring to a count matrix.

For each document, computes the BM25 score for each term and takes the maximum across terms, with optional tie breaking from other terms.

Parameters:: X (csr_matrix) – Sparse count matrix of shape (n_docs, n_terms).
Returns:: DisMax-weighted sparse matrix of shape (n_docs, 1), where each row contains the DisMax score for that document.
Return type:: csr_matrix
Raises:: RuntimeError – If the transformer has not been fitted.

class zvec_db.embedders.sparse.dismax.DisMaxEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the DisMax scoring formula.

DisMax (Disjunctive Maximum) takes the maximum score across multiple terms or fields, rather than summing them. This is useful when you want documents that match at least one term well, rather than documents that match all terms moderately.

The DisMax score formula is:

\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) + t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]

where $t$ is the tie breaker parameter.

This embedder is particularly useful for:

Multi-field search (title, content, tags) where matching any field well should rank highly.
Disjunctive queries where documents matching any query term should be retrieved.
Avoiding score inflation from documents matching many terms weakly.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. 0.0 = pure maximum, 1.0 = sum all scores.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = DisMaxEmbedder(k1=1.5, tie_breaker=0.1, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
tie_breaker (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the DisMax pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. DisMaxTransformer: Applies DisMax weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

embed(input_text)[source]

Embed text into sparse vectors with DisMax scores.

Unlike other embedders that return a vector with multiple non-zero entries, DisMaxEmbedder returns a single score per document (the maximum term score).

Parameters:: input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.
Returns:: For each document, returns a dictionary with a single entry {0: dismax_score} representing the maximum term score.
Return type:: Union[SparseVector, List[SparseVector]]

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object