zvec_db.embedders.base

Base classes and type definitions for sparse embedding models.

This module provides the abstract base class BaseSparseEmbedder which defines a common interface for all sparse embedding models in this package. It handles tokenization, model persistence, and conversion to zvec-compatible formats.

Constants

DEFAULT_MAX_FEATURESint: Default maximum number of features (non-zero elements) to retain per document. Set to 8192 (2^13) as a power of 2 for memory alignment, balancing between vocabulary coverage and memory efficiency.

Module Attributes

`SparseVector`	A sparse vector represented as a dictionary mapping feature indices to values.
`ExtendedList`	A corpus that can be either raw strings or pre-tokenized lists.
`StrExtendedList`	Input text that can be a single document or a batch.

Classes

BaseSparseEmbedder([tokenizer, ...])

Abstract base class for sparse embedding models using scikit-learn.

zvec_db.embedders.base.SparseVector

A sparse vector represented as a dictionary mapping feature indices to values.

alias of dict[int, float]

zvec_db.embedders.base.ExtendedList = list[str] | list[list[str]]: A corpus that can be either raw strings or pre-tokenized lists.

zvec_db.embedders.base.StrExtendedList = str | list[str] | list[list[str]]: Input text that can be a single document or a batch.

class zvec_db.embedders.base.BaseSparseEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, cache_size=1024, preprocessing_config=None)[source]

Abstract base class for sparse embedding models using scikit-learn.

This class provides a unified interface for:

Training sparse embedding models (Count, BM25, TF-IDF)
Handling custom tokenization or pre-tokenized inputs
Converting scipy sparse matrices to zvec-compatible dictionaries
Saving and loading trained models
LRU caching for repeated embeddings (via LRUCacheMixin)

The class supports two mutually exclusive modes:

Pre-tokenized mode (is_pretokenized=True): Input documents are already tokenized as lists of strings.
Custom tokenizer mode (tokenizer=<callable>): A user-provided function tokenizes each string document before vectorization.

If neither is specified, raw strings are passed directly to the underlying scikit-learn vectorizer.

Parameters:

tokenizer (Optional[Callable]) – A callable that takes a string and returns a list of tokens.
is_pretokenized (bool) – If True, input documents must be pre-tokenized as lists of strings.
max_features (Optional[int]) – Maximum number of features (non-zero elements) to retain per document. Defaults to 8192.
cache_size (int)
preprocessing_config (NormalizationConfig | None)

Raises:

ValueError – If both tokenizer and is_pretokenized=True are set.

Example

>>> embedder = MyEmbedder(tokenizer=my_tokenize_fn)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, cache_size=1024, preprocessing_config=None)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
cache_size (int)
preprocessing_config (NormalizationConfig | None)

model: csr_matrix | None

preprocess(text)[source]

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

abstractmethod fit(corpus, y=None)[source]

Train the sparse embedding model on a corpus.

Parameters:

corpus (ExtendedList) – Training documents (strings or token lists depending on configuration).
y – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder instance.

Return type:

self

__call__(input_text)[source]

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

preprocess_input(input_text)[source]

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

fit_transform(X, y=None)[source]

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

transform(input_text)[source]

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

embed(input_text)[source]

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Single document: dict[int, float] mapping feature indices to values.
Batch: list[dict[int, float]] with one dictionary per document.

Return type:

SparseVector | list[SparseVector]

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)[source]

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

list[SparseVector]

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

save(path)[source]

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)[source]

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BaseSparseEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BaseSparseEmbedder)

Returns:

self – The updated object.

Return type:

object

load(path)[source]

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

from_pretrained(path)[source]

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None