This embedder converts text documents into sparse vectors based on term
frequencies (raw counts of each token). It is the simplest sparse embedding
method and serves as a foundation for more advanced techniques like BM25
and TF-IDF.
The embedder accepts raw strings or pre-tokenized input. Any keyword
arguments are forwarded to the underlying CountVectorizer after being
normalized by BaseSparseEmbedder._prepare_vectorizer_params().
Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided,
it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists
of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per
document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
This is the primary user-facing method for generating embeddings. Unlike
transform() which returns a scipy sparse matrix, this method
returns zvec-compatible dictionaries mapping {feature_index:value}.
The method automatically handles both single documents and batches,
returning a single dictionary for a single input or a list of
dictionaries for batch input.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Single document: dict[int,float] mapping feature indices
to values.
Batch: list[dict[int,float]] with one dictionary per
document.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Sparse embedder implementing the BM25 scoring formula.
This class wires together a CountVectorizer with a lightweight
BM25Transformer. Tokenisation behaviour is controlled by the two
parameters inherited from BaseSparseEmbedder:
is_pretokenized tells the embedder to expect lists of tokens as input
and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be
executed on every raw text document before vectorisation. When a
tokenizer is used the data passed to the scikit-learn pipeline consists
of token lists as well; the vectorizer is therefore configured to act as
an identity transformer.
The two options are mutually exclusive and validated by the base class.
is_pretokenized (bool) – If True, input documents must be lists of tokens.
max_features (Optional[int]) – Maximum number of features to retain.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
b (float) – Length normalization parameter. Defaults to 0.75.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional parameters for CountVectorizer.
This method builds a scikit-learn pipeline consisting of:
1. CountVectorizer: Tokenizes documents and builds term counts.
2. BM25Transformer: Applies BM25 weighting to the count matrix.
The corpus is pre-processed according to the embedder’s configuration
(custom tokenizer or pre-tokenized mode) before being passed to the
pipeline.
Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.
Returns:
The fitted embedder.
Return type:
self
Raises:
ValueError – If corpus format doesn’t match the configuration.
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
This is the primary user-facing method for generating embeddings. Unlike
transform() which returns a scipy sparse matrix, this method
returns zvec-compatible dictionaries mapping {feature_index:value}.
The method automatically handles both single documents and batches,
returning a single dictionary for a single input or a list of
dictionaries for batch input.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Single document: dict[int,float] mapping feature indices
to values.
Batch: list[dict[int,float]] with one dictionary per
document.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Sparse embedder implementing the BM25L scoring formula.
BM25L is a variant of BM25 that uses linear length normalization, making it
more suitable for corpora with highly variable document lengths.
This class wires together a CountVectorizer with a BM25LTransformer.
Tokenisation behaviour is controlled by the two parameters inherited from
BaseSparseEmbedder:
is_pretokenized tells the embedder to expect lists of tokens as input
and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be
executed on every raw text document before vectorisation. When a
tokenizer is used the data passed to the scikit-learn pipeline consists
of token lists as well; the vectorizer is therefore configured to act as
an identity transformer.
The two options are mutually exclusive and validated by the base class.
Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided,
it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists
of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per
document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
Typical range: 1.2-2.0. Higher values mean slower saturation.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
Train the BM25L pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of:
1. CountVectorizer: Tokenizes documents and builds term counts.
2. BM25LTransformer: Applies BM25L weighting to the count matrix.
The corpus is pre-processed according to the embedder’s configuration
(custom tokenizer or pre-tokenized mode) before being passed to the
pipeline.
Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.
Returns:
The fitted embedder.
Return type:
self
Raises:
ValueError – If corpus format doesn’t match the configuration.
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
This is the primary user-facing method for generating embeddings. Unlike
transform() which returns a scipy sparse matrix, this method
returns zvec-compatible dictionaries mapping {feature_index:value}.
The method automatically handles both single documents and batches,
returning a single dictionary for a single input or a list of
dictionaries for batch input.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Single document: dict[int,float] mapping feature indices
to values.
Batch: list[dict[int,float]] with one dictionary per
document.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Sparse embedder implementing the BM25+ scoring formula.
BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents
zero scores for terms with zero term frequency. This can improve retrieval
performance, especially for corpora with many rare terms.
This class wires together a CountVectorizer with a BM25PlusTransformer.
Tokenisation behaviour is controlled by the two parameters inherited from
BaseSparseEmbedder:
is_pretokenized tells the embedder to expect lists of tokens as input
and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be
executed on every raw text document before vectorisation. When a
tokenizer is used the data passed to the scikit-learn pipeline consists
of token lists as well; the vectorizer is therefore configured to act as
an identity transformer.
The two options are mutually exclusive and validated by the base class.
Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided,
it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists
of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per
document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75.
Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5.
Typical range: 0.4-1.0. Higher values increase the baseline score.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
Train the BM25+ pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of:
1. CountVectorizer: Tokenizes documents and builds term counts.
2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.
The corpus is pre-processed according to the embedder’s configuration
(custom tokenizer or pre-tokenized mode) before being passed to the
pipeline.
Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.
Returns:
The fitted embedder.
Return type:
self
Raises:
ValueError – If corpus format doesn’t match the configuration.
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
This is the primary user-facing method for generating embeddings. Unlike
transform() which returns a scipy sparse matrix, this method
returns zvec-compatible dictionaries mapping {feature_index:value}.
The method automatically handles both single documents and batches,
returning a single dictionary for a single input or a list of
dictionaries for batch input.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Single document: dict[int,float] mapping feature indices
to values.
Batch: list[dict[int,float]] with one dictionary per
document.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Sparse embedder implementing the DisMax scoring formula.
DisMax (Disjunctive Maximum) takes the maximum score across multiple terms
or fields, rather than summing them. This is useful when you want documents
that match at least one term well, rather than documents that match all
terms moderately.
The DisMax score formula is:
\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) +
t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]
where \(t\) is the tie breaker parameter.
This embedder is particularly useful for:
Multi-field search (title, content, tags) where matching any field well
should rank highly.
Disjunctive queries where documents matching any query term should be
retrieved.
Avoiding score inflation from documents matching many terms weakly.
Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided,
it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists
of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per
document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75.
Typical range: 0.5-1.0. b=1.0 means full length normalization.
tie_breaker (float) – Tie breaker parameter. Defaults to 0.0.
0.0 = pure maximum, 1.0 = sum all scores.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
Train the DisMax pipeline on a corpus of documents.
This method builds a scikit-learn pipeline consisting of:
1. CountVectorizer: Tokenizes documents and builds term counts.
2. DisMaxTransformer: Applies DisMax weighting to the count matrix.
The corpus is pre-processed according to the embedder’s configuration
(custom tokenizer or pre-tokenized mode) before being passed to the
pipeline.
Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.
Returns:
The fitted embedder.
Return type:
self
Raises:
ValueError – If corpus format doesn’t match the configuration.
Embed text into sparse vectors with DisMax scores.
Unlike other embedders that return a vector with multiple non-zero
entries, DisMaxEmbedder returns a single score per document (the
maximum term score).
Parameters:
input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.
Returns:
For each document, returns
a dictionary with a single entry {0: dismax_score} representing
the maximum term score.
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure
that evaluates how relevant a word is to a document in a collection of
documents. It is computed as the product of:
Term Frequency (TF): How often a term appears in a document.
Inverse Document Frequency (IDF): A penalty factor for terms that
appear in many documents.
This embedder supports custom tokenization and pre-tokenized inputs. All
additional keyword arguments are passed through to the underlying
TfidfVectorizer (e.g., min_df, max_df, ngram_range,
sublinear_tf).
Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided,
it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists
of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per
document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for
automatic text preprocessing (normalization, stemming, stopwords).
If set, preprocessing is automatically applied during fit() and embed().
**tfidf_params – Additional keyword arguments passed to
TfidfVectorizer.
This uses PEP-487 [1]_ to set the set_{method}_request methods. It
looks for the information available in the set default values which are
set using __metadata_request__* class attributes, or inferred
from method signatures.
The __metadata_request__* class attributes are used when a method
does not explicitly accept a metadata through its arguments or if the
developer would like to specify a request value for those metadata
which are different from the default None.
This is the primary user-facing method for generating embeddings. Unlike
transform() which returns a scipy sparse matrix, this method
returns zvec-compatible dictionaries mapping {feature_index:value}.
The method automatically handles both single documents and batches,
returning a single dictionary for a single input or a list of
dictionaries for batch input.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Single document: dict[int,float] mapping feature indices
to values.
Batch: list[dict[int,float]] with one dictionary per
document.
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
This is a convenience method that calls fit() followed by
transform(). It is useful for training and obtaining embeddings
without storing intermediate results.
Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
This method restores the model state from a file previously saved with
save() or save_pretrained(). The preprocessing configuration
and other settings (is_pretokenized, max_features) are also restored.
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured.
If no preprocessing_config is set, returns the original text unchanged.
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by
scikit-learn models, while preserving information about the original
input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Configure whether metadata should be requested to be passed to the fit method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Configure whether metadata should be requested to be passed to the transform method.
Note that this method is only relevant when this estimator is used as a
sub-estimator within a meta-estimator and metadata routing is enabled
with enable_metadata_routing=True (see sklearn.set_config()).
Please check the User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It
automatically handles tokenization based on the embedder’s configuration
before passing data to the fitted model.
Note
The model must be fitted (via fit() or fit_transform())
or loaded before calling this method.
Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
Returns:
Sparse feature matrix with shape (n_docs,n_features).
Return type:
csr_matrix
Raises:
RuntimeError – If the model has not been fitted or loaded.
Dense embeddings using Sentence Transformers models locally.
This embedder uses pre-trained models from the sentence-transformers
library to generate semantic embeddings. It supports hundreds of
models available on HuggingFace.
Parameters:
model_name (str, optional) – Name of the model from HuggingFace.
Examples:
- “all-MiniLM-L6-v2” (384 dims, fast)
- “all-mpnet-base-v2” (768 dims, best quality)
- “BAAI/bge-small-en-v1.5” (384 dims, good quality)
Defaults to “all-MiniLM-L6-v2”.
device (Optional[str], optional) – Device to run model on.
“cpu”, “cuda”, or None for auto-detect. Defaults to None.
max_length (Optional[int], optional) – Maximum sequence length.
Defaults to 512.
normalize (bool, optional) – Normalize embeddings to unit length.
Defaults to True for cosine similarity compatibility.
trust_remote_code (bool, optional) – Trust remote code in model.
Defaults to False.
model_kwargs (Optional[Mapping[str, Any]], optional) – Additional keyword arguments
passed to SentenceTransformer constructor. Useful for options like:
- torch_dtype: Model dtype (torch.float16, torch.bfloat16, “auto”)
- trust_remote_code: Trust remote code from HuggingFace Hub
- token: HuggingFace API token for private models
- revision: Model revision to load
- cache_dir: Custom cache directory
- local_files_only: Load only local files
- attn_implementation: Attention implementation (e.g., “flash_attention_2”)
Defaults to None (no additional kwargs).
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding
documents in smaller batches. It supports an optional progress bar
for tracking long-running operations.
Parameters:
documents (List[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch.
Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.
Dense embedder using OpenAI-compatible /embeddings endpoint.
This embedder uses the /v1/embeddings endpoint to compute dense
vector representations of texts. It’s compatible with OpenAI’s embedding
API format and supports batch processing.
Works with:
- OpenAI API (text-embedding-3-small, text-embedding-3-large, etc.)
- vLLM serving open-source embedding models
- Any OpenAI-compatible API endpoint
Parameters:
model (str) – Model name to use.
OpenAI: “text-embedding-3-small”, “text-embedding-3-large”
vLLM: Model name configured in vLLM
api_key (Optional[str], optional) – API key for authentication.
Defaults to None (reads from OPENAI_API_KEY env var).
dimensions (Optional[int], optional) – Output embedding dimensions.
Only supported by some models (e.g., text-embedding-3-small).
Defaults to None (use model default).
timeout (float, optional) – HTTP request timeout in seconds.
Defaults to 30.0.
encoding_format (str, optional) – Encoding format for embeddings.
“float” for float32 vectors, “base64” for base64-encoded.
Defaults to “float”.
max_batch_size (Optional[int], optional) – Maximum number of texts
to embed in a single batch. None means no limit.
Defaults to None.
truncate_prompt_tokens (Optional[int], optional) – Maximum number of tokens
for prompt truncation. When set, prompts exceeding this limit are truncated.
By default, APIs reject prompts exceeding max_model_len unless this is set.
Defaults to None (no truncation).
query_prefix (str, optional) – Prefix to add to query texts.
Useful for asymmetric embedding models like E5, GTE, etc.
Example: “query: “ for E5 models.
Defaults to “” (no prefix).
passage_prefix (str, optional) – Prefix to add to passage/document texts.
Useful for asymmetric embedding models like E5, GTE, etc.
Example: “passage: “ for E5 models.
Defaults to “” (no prefix).
model_kwargs (Optional[Mapping[str, Any]], optional) – Additional keyword arguments
passed to the API request. Useful for options like:
- user: Unique identifier for monitoring and abuse detection
- extra_headers: Additional HTTP headers
- extra_query_params: Additional query parameters
Defaults to None (no additional kwargs).
model_name (str, optional) – Deprecated. Use model instead.
This parameter is kept for backward compatibility.
Defaults to None.
max_retries (int, optional) – Maximum number of retry attempts for transient failures.
Set to 0 to disable retries. Defaults to 3.
initial_delay (float, optional) – Initial delay before first retry in seconds.
Defaults to 1.0.
max_delay (float, optional) – Maximum delay cap in seconds. Defaults to 60.0.
exponential_base (float, optional) – Base for exponential backoff. Defaults to 2.0.
jitter (float, optional) – Random jitter factor (0.0-1.0) to avoid thundering herd.
Defaults to 0.1.
retry_config (Optional[RetryConfig], optional) – Pre-configured retry settings.
If provided, overrides individual retry parameters. Defaults to None.
>>> # With truncation to handle long prompts>>> embedder=OpenAIEmbedder(... base_url="http://localhost:8000/v1",... model="embedding",... truncate_prompt_tokens=512... )
>>> # With prefixes for asymmetric models (e.g., E5, GTE)>>> embedder=OpenAIEmbedder(... base_url="http://localhost:8000/v1",... model="intfloat/e5-large-v2",... query_prefix="query: ",... passage_prefix="passage: "... )>>> query_vector=embedder.embed_query("What is machine learning?")>>> doc_vector=embedder.embed_passage("ML is a subset of AI.")
>>> # With custom retry settings for production>>> embedder=OpenAIEmbedder(... model="text-embedding-3-small",... max_retries=5,... initial_delay=2.0,... max_delay=120.0,... )
See also
SentenceTransformersEmbedder: Local dense embeddings using HuggingFace models.
RetryConfig: Configuration class for retry behavior.