Sparse and Dense Embedding

Overview

The zvec_db.embedders sub-package provides sparse and dense embedding models for text vectorization.

Sparse Embedders:

All sparse embedders return dictionaries {index: score, ...} compatible with zvec’s SPARSE_FP32 format.

Embedder

When to use

CountEmbedder

Baseline, documents of similar length

BM25Embedder

General use, good IR performance

BM25LEmbedder

Documents with very variable lengths

BM25PlusEmbedder

Many rare terms, need recall

DisMaxEmbedder

Multi-field, match any field

TfidfEmbedder

Relative term importance in corpus

Dense Embedders:

Embedder

When to use

SentenceTransformersEmbedder | Local models (e.g., all-MiniLM-L6-v2)

OpenAIEmbedder

OpenAI API or compatible endpoints (vLLM)

CountEmbedder

class zvec_db.embedders.CountEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]

Count-based sparse embedder wrapping scikit-learn’s CountVectorizer.

This embedder converts text documents into sparse vectors based on term frequencies (raw counts of each token). It is the simplest sparse embedding method and serves as a foundation for more advanced techniques like BM25 and TF-IDF.

The embedder accepts raw strings or pre-tokenized input. Any keyword arguments are forwarded to the underlying CountVectorizer after being normalized by BaseSparseEmbedder._prepare_vectorizer_params().

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = CountEmbedder(min_df=2, ngram_range=(1, 2))
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the embedder on a corpus of documents.

The supplied corpus is normalised according to the instance configuration:

  • is_pretokenized=True - the caller must provide lists of tokens.

  • tokenizer=... - each string in the corpus will be passed through the tokenizer before vectorisation.

  • neither set - raw strings are passed to CountVectorizer directly.

_prepare_corpus handles the validation and transformation logic.

Parameters:

corpus (list[str] | list[list[str]]) – Sequence of documents (strings or token lists depending on configuration).

Returns:

self to allow chaining.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

  • Single document: dict[int, float] mapping feature indices to values.

  • Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}
embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

BM25Embedder

class zvec_db.embedders.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25 scoring formula.

This class wires together a CountVectorizer with a lightweight BM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function.

  • is_pretokenized (bool) – If True, input documents must be lists of tokens.

  • max_features (Optional[int]) – Maximum number of features to retain.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2.

  • b (float) – Length normalization parameter. Defaults to 0.75.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional parameters for CountVectorizer.

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25 pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25Transformer: Applies BM25 weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

  • Single document: dict[int, float] mapping feature indices to values.

  • Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}
embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25Embedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

BM25LEmbedder

class zvec_db.embedders.BM25LEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25L scoring formula.

BM25L is a variant of BM25 that uses linear length normalization, making it more suitable for corpora with highly variable document lengths.

This class wires together a CountVectorizer with a BM25LTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25LEmbedder(k1=1.5, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25L pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25LTransformer: Applies BM25L weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

  • Single document: dict[int, float] mapping feature indices to values.

  • Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}
embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

BM25PlusEmbedder

class zvec_db.embedders.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25+ scoring formula.

BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.

This class wires together a CountVectorizer with a BM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

  • is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.

  • tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • delta (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25+ pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

  • Single document: dict[int, float] mapping feature indices to values.

  • Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}
embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

DisMaxEmbedder

class zvec_db.embedders.DisMaxEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the DisMax scoring formula.

DisMax (Disjunctive Maximum) takes the maximum score across multiple terms or fields, rather than summing them. This is useful when you want documents that match at least one term well, rather than documents that match all terms moderately.

The DisMax score formula is:

\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) + t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]

where \(t\) is the tie breaker parameter.

This embedder is particularly useful for:

  • Multi-field search (title, content, tags) where matching any field well should rank highly.

  • Disjunctive queries where documents matching any query term should be retrieved.

  • Avoiding score inflation from documents matching many terms weakly.

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.

  • b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.

  • tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. 0.0 = pure maximum, 1.0 = sum all scores.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = DisMaxEmbedder(k1=1.5, tie_breaker=0.1, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • k1 (float)

  • b (float)

  • tie_breaker (float)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the DisMax pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. DisMaxTransformer: Applies DisMax weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

embed(input_text)[source]

Embed text into sparse vectors with DisMax scores.

Unlike other embedders that return a vector with multiple non-zero entries, DisMaxEmbedder returns a single score per document (the maximum term score).

Parameters:

input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.

Returns:

For each document, returns a dictionary with a single entry {0: dismax_score} representing the maximum term score.

Return type:

Union[SparseVector, List[SparseVector]]

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

TfidfEmbedder

class zvec_db.embedders.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]

Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:

  • Term Frequency (TF): How often a term appears in a document.

  • Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.

This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying TfidfVectorizer (e.g., min_df, max_df, ngram_range, sublinear_tf).

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **tfidf_params – Additional keyword arguments passed to TfidfVectorizer.

Example

>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Fit the TF-IDF vectorizer on a corpus of documents.

The corpus is pre-processed according to the embedder’s configuration:

  • Custom tokenizer: Each document is tokenized before vectorization.

  • Pre-tokenized mode: Documents are expected to be lists of tokens.

  • Default: Raw strings are passed directly to TfidfVectorizer.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.

Returns:

Sparse vector(s) as dictionaries.

Return type:

dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

  • size: Current number of cached items

  • max_size: Maximum cache capacity

  • utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:

None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

  • Single document: dict[int, float] mapping feature indices to values.

  • Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}
embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (list[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:
  • X – Training corpus (strings or token lists).

  • y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:

path (str) – File path to the serialized model.

Returns:

None

Return type:

None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")
preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:

text (str) – Raw text to preprocess.

Returns:

Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.

Return type:

str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']
preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

  1. Pre-tokenized mode: Validates and wraps token lists.

  2. Custom tokenizer: Applies the tokenizer to string inputs.

  3. Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

  • If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)

  • If tokenizer is set: str (single) or list[str] (batch)

  • Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

  • is_single (bool): True if input was a single document.

  • processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved (same as input).

Return type:

str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")
save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:

path (str) – File path where the model will be saved.

Returns:

The path where the model was saved.

Return type:

str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Sparse feature matrix with shape (n_docs, n_features).

Return type:

csr_matrix

Raises:

RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None
cache_size: int

Dense Embedding

class zvec_db.embedders.SentenceTransformersEmbedder(model_name='all-MiniLM-L6-v2', device=None, max_length=512, normalize=True, trust_remote_code=False, model_kwargs=None)[source]

Dense embeddings using Sentence Transformers models locally.

This embedder uses pre-trained models from the sentence-transformers library to generate semantic embeddings. It supports hundreds of models available on HuggingFace.

Parameters:
  • model_name (str, optional) – Name of the model from HuggingFace. Examples: - “all-MiniLM-L6-v2” (384 dims, fast) - “all-mpnet-base-v2” (768 dims, best quality) - “BAAI/bge-small-en-v1.5” (384 dims, good quality) Defaults to “all-MiniLM-L6-v2”.

  • device (Optional[str], optional) – Device to run model on. “cpu”, “cuda”, or None for auto-detect. Defaults to None.

  • max_length (Optional[int], optional) – Maximum sequence length. Defaults to 512.

  • normalize (bool, optional) – Normalize embeddings to unit length. Defaults to True for cosine similarity compatibility.

  • trust_remote_code (bool, optional) – Trust remote code in model. Defaults to False.

  • model_kwargs (Optional[Mapping[str, Any]], optional) – Additional keyword arguments passed to SentenceTransformer constructor. Useful for options like: - torch_dtype: Model dtype (torch.float16, torch.bfloat16, “auto”) - trust_remote_code: Trust remote code from HuggingFace Hub - token: HuggingFace API token for private models - revision: Model revision to load - cache_dir: Custom cache directory - local_files_only: Load only local files - attn_implementation: Attention implementation (e.g., “flash_attention_2”) Defaults to None (no additional kwargs).

Example

>>> # Standard embedding
>>> embedder = SentenceTransformersEmbedder(
...     model_name="all-MiniLM-L6-v2",
...     device="cpu"
... )
>>> embedder.fit(["document 1", "document 2"])
>>> vector = embedder.embed("search query")
>>> print(vector.shape)
(384,)
>>> # With model_kwargs for private models
>>> embedder = SentenceTransformersEmbedder(
...     model_name="org/private-model",
...     model_kwargs={"token": "hf_..."}
... )
>>> # With float16 for reduced memory
>>> import torch
>>> embedder = SentenceTransformersEmbedder(
...     model_name="all-MiniLM-L6-v2",
...     model_kwargs={"torch_dtype": torch.float16}
... )

Note

  • Requires the sentence-transformers package

  • Models are downloaded automatically on first use

  • GPU acceleration available if CUDA is installed

See also

OpenAIEmbedder: Dense embeddings via OpenAI-compatible API.

__init__(model_name='all-MiniLM-L6-v2', device=None, max_length=512, normalize=True, trust_remote_code=False, model_kwargs=None)[source]
Parameters:
property device: str | None

Device to run model on.

Type:

Optional[str]

property trust_remote_code: bool

Trust remote code in model.

Type:

bool

property model_kwargs: Mapping[str, Any]

Additional kwargs passed to the model.

Type:

Mapping[str, Any]

property embedding_dim: int

Dimension of the embedding vectors.

Type:

int

property is_fitted: bool

Whether the embedder has been fitted.

Type:

bool

fit(documents)[source]

Initialize the embedder by loading the model.

For Sentence Transformers, this loads the model. No training is performed as models are pre-trained.

Parameters:

documents (List[str]) – List of documents (used for initialization only).

Returns:

For method chaining.

Return type:

self

embed(input_text)[source]

Generate embeddings for text.

Parameters:

input_text (str | List[str]) – Single document or batch.

Returns:

Single numpy array or list for batch.

Raises:

RuntimeError – If model loading fails.

Return type:

ndarray | list[ndarray]

embed_batch(documents, batch_size=32, show_progress=False)[source]

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:
  • documents (List[str]) – List of documents to embed.

  • batch_size (int, optional) – Number of documents per batch. Defaults to 32.

  • show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of embedding arrays, one per document.

Return type:

List[np.ndarray]

Example

>>> embedder = SentenceTransformersEmbedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = SentenceTransformersEmbedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | List[str]) – Single document or batch of documents.

Returns:

List of floats for single input, or list of lists for batch input.

Return type:

ndarray | List[ndarray]

load(path)

Load embedder configuration.

Parameters:

path (str) – Path to configuration file.

Return type:

None

save(path)

Save embedder configuration.

Dense models typically don’t need saving as they load pre-trained weights. This saves configuration only.

Parameters:

path (str) – Path to save configuration.

Return type:

None

transform(input_text)

Alias for embed() returning numpy array.

For single input, returns 2D array with shape (1, dim). For batch input, returns 2D array with shape (n, dim).

Parameters:

input_text (str | List[str]) – Single document or batch.

Returns:

2D numpy array of embeddings.

Return type:

ndarray

class zvec_db.embedders.OpenAIEmbedder(model='text-embedding-3-small', base_url='https://api.openai.com/v1', api_key=None, dimensions=None, timeout=30.0, encoding_format='float', max_batch_size=None, truncate_prompt_tokens=None, query_prefix=None, passage_prefix=None, model_kwargs=None, model_name=None, max_retries=3, initial_delay=1.0, max_delay=60.0, exponential_base=2.0, jitter=0.1, retry_config=None)[source]

Dense embedder using OpenAI-compatible /embeddings endpoint.

This embedder uses the /v1/embeddings endpoint to compute dense vector representations of texts. It’s compatible with OpenAI’s embedding API format and supports batch processing.

Works with: - OpenAI API (text-embedding-3-small, text-embedding-3-large, etc.) - vLLM serving open-source embedding models - Any OpenAI-compatible API endpoint

Parameters:
  • model (str) – Model name to use. OpenAI: “text-embedding-3-small”, “text-embedding-3-large” vLLM: Model name configured in vLLM

  • base_url (str, optional) – API base URL. For OpenAI: “https://api.openai.com/v1” For vLLM local: “http://localhost:8000/v1” Defaults to “https://api.openai.com/v1”.

  • api_key (Optional[str], optional) – API key for authentication. Defaults to None (reads from OPENAI_API_KEY env var).

  • dimensions (Optional[int], optional) – Output embedding dimensions. Only supported by some models (e.g., text-embedding-3-small). Defaults to None (use model default).

  • timeout (float, optional) – HTTP request timeout in seconds. Defaults to 30.0.

  • encoding_format (str, optional) – Encoding format for embeddings. “float” for float32 vectors, “base64” for base64-encoded. Defaults to “float”.

  • max_batch_size (Optional[int], optional) – Maximum number of texts to embed in a single batch. None means no limit. Defaults to None.

  • truncate_prompt_tokens (Optional[int], optional) – Maximum number of tokens for prompt truncation. When set, prompts exceeding this limit are truncated. By default, APIs reject prompts exceeding max_model_len unless this is set. Defaults to None (no truncation).

  • query_prefix (str, optional) – Prefix to add to query texts. Useful for asymmetric embedding models like E5, GTE, etc. Example: “query: “ for E5 models. Defaults to “” (no prefix).

  • passage_prefix (str, optional) – Prefix to add to passage/document texts. Useful for asymmetric embedding models like E5, GTE, etc. Example: “passage: “ for E5 models. Defaults to “” (no prefix).

  • model_kwargs (Optional[Mapping[str, Any]], optional) – Additional keyword arguments passed to the API request. Useful for options like: - user: Unique identifier for monitoring and abuse detection - extra_headers: Additional HTTP headers - extra_query_params: Additional query parameters Defaults to None (no additional kwargs).

  • model_name (str, optional) – Deprecated. Use model instead. This parameter is kept for backward compatibility. Defaults to None.

  • max_retries (int, optional) – Maximum number of retry attempts for transient failures. Set to 0 to disable retries. Defaults to 3.

  • initial_delay (float, optional) – Initial delay before first retry in seconds. Defaults to 1.0.

  • max_delay (float, optional) – Maximum delay cap in seconds. Defaults to 60.0.

  • exponential_base (float, optional) – Base for exponential backoff. Defaults to 2.0.

  • jitter (float, optional) – Random jitter factor (0.0-1.0) to avoid thundering herd. Defaults to 0.1.

  • retry_config (Optional[RetryConfig], optional) – Pre-configured retry settings. If provided, overrides individual retry parameters. Defaults to None.

Example

>>> # OpenAI API
>>> embedder = OpenAIEmbedder(
...     model="text-embedding-3-small",
...     api_key="sk-..."
... )
>>> vector = embedder.embed("search query")
>>> # vLLM local
>>> embedder = OpenAIEmbedder(
...     base_url="http://localhost:8000/v1",
...     api_key="not-needed",
...     model="BAAI/bge-m3"
... )
>>> vector = embedder.embed("search query")
>>> # With truncation to handle long prompts
>>> embedder = OpenAIEmbedder(
...     base_url="http://localhost:8000/v1",
...     model="embedding",
...     truncate_prompt_tokens=512
... )
>>> # With prefixes for asymmetric models (e.g., E5, GTE)
>>> embedder = OpenAIEmbedder(
...     base_url="http://localhost:8000/v1",
...     model="intfloat/e5-large-v2",
...     query_prefix="query: ",
...     passage_prefix="passage: "
... )
>>> query_vector = embedder.embed_query("What is machine learning?")
>>> doc_vector = embedder.embed_passage("ML is a subset of AI.")
>>> # With custom retry settings for production
>>> embedder = OpenAIEmbedder(
...     model="text-embedding-3-small",
...     max_retries=5,
...     initial_delay=2.0,
...     max_delay=120.0,
... )

See also

SentenceTransformersEmbedder: Local dense embeddings using HuggingFace models. RetryConfig: Configuration class for retry behavior.

__init__(model='text-embedding-3-small', base_url='https://api.openai.com/v1', api_key=None, dimensions=None, timeout=30.0, encoding_format='float', max_batch_size=None, truncate_prompt_tokens=None, query_prefix=None, passage_prefix=None, model_kwargs=None, model_name=None, max_retries=3, initial_delay=1.0, max_delay=60.0, exponential_base=2.0, jitter=0.1, retry_config=None)[source]
Parameters:
  • model (str)

  • base_url (str)

  • api_key (str | None)

  • dimensions (int | None)

  • timeout (float)

  • encoding_format (str)

  • max_batch_size (int | None)

  • truncate_prompt_tokens (int | None)

  • query_prefix (str | None)

  • passage_prefix (str | None)

  • model_kwargs (Mapping[str, Any] | None)

  • model_name (str | None)

  • max_retries (int)

  • initial_delay (float)

  • max_delay (float)

  • exponential_base (float)

  • jitter (float)

  • retry_config (RetryConfig | None)

property model_name: str

Model identifier (alias for model for backward compatibility).

Type:

str

property model: str

Model identifier (OpenAI API naming).

Type:

str

property base_url: str

Base URL for the API.

Type:

str

property api_key: str | None

API key for authentication.

Type:

Optional[str]

property dimensions: int | None

Output embedding dimensions.

Type:

Optional[int]

property timeout: float

HTTP request timeout in seconds.

Type:

float

property encoding_format: str

Encoding format for embeddings.

Type:

str

property max_batch_size: int | None

Maximum batch size for embedding.

Type:

Optional[int]

property truncate_prompt_tokens: int | None

Maximum number of tokens for prompt truncation.

Type:

Optional[int]

property query_prefix: str

Prefix added to query texts.

Type:

str

property passage_prefix: str

Prefix added to passage/document texts.

Type:

str

property model_kwargs: Mapping[str, Any]

Additional kwargs passed to the API.

Type:

Mapping[str, Any]

property embedding_dim: int

Dimension of embeddings (available after fit or first embed).

Type:

int

property is_fitted: bool

Whether the embedder has been fitted.

Type:

bool

fit(documents)[source]

Initialize the embedder.

For API-based embedder, this is a no-op as the model is pre-trained. This method exists for API compatibility.

Parameters:

documents (List[str]) – List of documents (not used, for API compatibility).

Returns:

For method chaining.

Return type:

self

embed(input_text, prefix=None)[source]

Embed texts into dense vectors.

Parameters:
  • input_text (Union[str, List[str]]) – Single text or list of texts to embed.

  • prefix (Optional[str], optional) – Prefix to add to each text. Defaults to None (no prefix).

Returns:

  • If single text: np.ndarray of shape (embedding_dim,)

  • If multiple texts: List[np.ndarray] of shape (n_texts, embedding_dim)

Return type:

Union[np.ndarray, List[np.ndarray]]

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = SentenceTransformersEmbedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)
Parameters:

input_text (str | List[str]) – Single document or batch of documents.

Returns:

List of floats for single input, or list of lists for batch input.

Return type:

ndarray | List[ndarray]

embed_query(query)[source]

Embed a query or list of queries with the query prefix.

Parameters:

query (Union[str, List[str]]) – Single query or list of queries to embed.

Returns:

  • If single query: np.ndarray of shape (embedding_dim,)

  • If multiple queries: List[np.ndarray] of shape (n_queries, embedding_dim)

Return type:

Union[np.ndarray, List[np.ndarray]]

load(path)

Load embedder configuration.

Parameters:

path (str) – Path to configuration file.

Return type:

None

save(path)

Save embedder configuration.

Dense models typically don’t need saving as they load pre-trained weights. This saves configuration only.

Parameters:

path (str) – Path to save configuration.

Return type:

None

transform(input_text)

Alias for embed() returning numpy array.

For single input, returns 2D array with shape (1, dim). For batch input, returns 2D array with shape (n, dim).

Parameters:

input_text (str | List[str]) – Single document or batch.

Returns:

2D numpy array of embeddings.

Return type:

ndarray

embed_passage(passage)[source]

Embed a passage/document or list of passages with the passage prefix.

Parameters:

passage (Union[str, List[str]]) – Single passage or list of passages to embed.

Returns:

  • If single passage: np.ndarray of shape (embedding_dim,)

  • If multiple passages: List[np.ndarray] of shape (n_passages, embedding_dim)

Return type:

Union[np.ndarray, List[np.ndarray]]

embed_batch(documents, show_progress=False, prefix=None)[source]

Embed a batch of documents.

Parameters:
  • documents (List[str]) – List of documents to embed.

  • show_progress (bool, optional) – Show progress bar. Not used for API-based embedding. Defaults to False.

  • prefix (Optional[str], optional) – Prefix to add to each document. Defaults to None (no prefix).

Returns:

List of embedding vectors.

Return type:

List[np.ndarray]