zvec_db.embedders.base
Base classes and type definitions for sparse embedding models.
This module provides the abstract base class BaseSparseEmbedder which
defines a common interface for all sparse embedding models in this package.
It handles tokenization, model persistence, and conversion to zvec-compatible
formats.
Constants
- DEFAULT_MAX_FEATURESint
Default maximum number of features (non-zero elements) to retain per document. Set to 8192 (2^13) as a power of 2 for memory alignment, balancing between vocabulary coverage and memory efficiency.
Module Attributes
A sparse vector represented as a dictionary mapping feature indices to values. |
|
A corpus that can be either raw strings or pre-tokenized lists. |
|
Input text that can be a single document or a batch. |
Classes
|
Abstract base class for sparse embedding models using scikit-learn. |
- zvec_db.embedders.base.SparseVector
A sparse vector represented as a dictionary mapping feature indices to values.
- zvec_db.embedders.base.ExtendedList = list[str] | list[list[str]]
A corpus that can be either raw strings or pre-tokenized lists.
- zvec_db.embedders.base.StrExtendedList = str | list[str] | list[list[str]]
Input text that can be a single document or a batch.
- class zvec_db.embedders.base.BaseSparseEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, cache_size=1024, preprocessing_config=None)[source]
Abstract base class for sparse embedding models using scikit-learn.
This class provides a unified interface for:
Training sparse embedding models (Count, BM25, TF-IDF)
Handling custom tokenization or pre-tokenized inputs
Converting scipy sparse matrices to zvec-compatible dictionaries
Saving and loading trained models
LRU caching for repeated embeddings (via LRUCacheMixin)
The class supports two mutually exclusive modes:
Pre-tokenized mode (
is_pretokenized=True): Input documents are already tokenized as lists of strings.Custom tokenizer mode (
tokenizer=<callable>): A user-provided function tokenizes each string document before vectorization.
If neither is specified, raw strings are passed directly to the underlying scikit-learn vectorizer.
- Parameters:
tokenizer (Optional[Callable]) – A callable that takes a string and returns a list of tokens.
is_pretokenized (bool) – If True, input documents must be pre-tokenized as lists of strings.
max_features (Optional[int]) – Maximum number of features (non-zero elements) to retain per document. Defaults to 8192.
cache_size (int)
preprocessing_config (NormalizationConfig | None)
- Raises:
ValueError – If both
tokenizerandis_pretokenized=Trueare set.
Example
>>> embedder = MyEmbedder(tokenizer=my_tokenize_fn) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, cache_size=1024, preprocessing_config=None)[source]
- model: csr_matrix | None
- preprocess(text)[source]
Apply preprocessing to a text (public API).
This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.
- Parameters:
text (str) – Raw text to preprocess.
- Returns:
Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
- Return type:
Example
>>> from zvec_db.embedders import BM25Embedder >>> from zvec_db.preprocessing import NormalizationConfig >>> config = NormalizationConfig.aggressive(language="french") >>> embedder = BM25Embedder(preprocessing_config=config) >>> embedder.preprocess(" CHAT MANGEAIT ") 'chat mang'
>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base") >>> embedder = BM25Embedder(preprocessing_config=config) >>> embedder.preprocess("Le chat mange") ['le', 'chat', 'man', '##ge']
- abstractmethod fit(corpus, y=None)[source]
Train the sparse embedding model on a corpus.
- Parameters:
corpus (ExtendedList) – Training documents (strings or token lists depending on configuration).
y – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder instance.
- Return type:
self
- __call__(input_text)[source]
Call shortcut that delegates to
embed().This allows the embedder to be called like a function:
embedder = BM25Embedder() embedder.fit(documents) vector = embedder("query text") # equivalent to embedder.embed(...)
- preprocess_input(input_text)[source]
Determine if input is a single document or batch, and apply tokenization.
This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.
The method handles three configurations:
Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.
- Parameters:
input_text (StrExtendedList) –
Input to process. Format depends on configuration:
If
is_pretokenized=True:list[str](single) orlist[list[str]](batch)If
tokenizeris set:str(single) orlist[str](batch)Default:
str(single) orlist[str](batch)
- Returns:
A tuple containing:
is_single(bool): True if input was a single document.processed_list(list): Data wrapped as a list for the model.
- Return type:
- Raises:
ValueError – If input format doesn’t match the configuration.
- fit_transform(X, y=None)[source]
Fit the model and transform the data in one step.
This is a convenience method that calls
fit()followed bytransform(). It is useful for training and obtaining embeddings without storing intermediate results.- Parameters:
X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.
- Returns:
Sparse matrix of fitted and transformed data.
- Return type:
csr_matrix
- transform(input_text)[source]
Transform input text into a sparse feature matrix.
This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.
Note
The model must be fitted (via
fit()orfit_transform()) or loaded before calling this method.- Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
- Returns:
Sparse feature matrix with shape
(n_docs, n_features).- Return type:
csr_matrix
- Raises:
RuntimeError – If the model has not been fitted or loaded.
- embed(input_text)[source]
Embed text into sparse vectors as dictionaries.
This is the primary user-facing method for generating embeddings. Unlike
transform()which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping{feature_index: value}.The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.
Note
The model must be fitted (via
fit()orfit_transform()) or loaded before calling this method.- Parameters:
input_text (StrExtendedList) – Single document or batch of documents.
- Returns:
Single document:
dict[int, float]mapping feature indices to values.Batch:
list[dict[int, float]]with one dictionary per document.
- Return type:
- Raises:
RuntimeError – If the model has not been fitted or loaded.
Example
>>> embedder = BM25Embedder().fit(documents) >>> vector = embedder.embed("search query") >>> vector # {42: 0.523, 108: 0.312, ...}
- embed_batch(documents, batch_size=32, show_progress=False)[source]
Embed a large batch of documents with optional progress bar.
This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.
- Parameters:
- Returns:
List of sparse vectors, one per document.
- Return type:
Example
>>> embedder = BM25Embedder().fit(corpus) >>> vectors = embedder.embed_batch( ... large_corpus, ... batch_size=64, ... show_progress=True ... )
Note
For single documents or small batches, use
embed()instead, which includes caching for repeated inputs.
- save(path)[source]
Serialize the model and tokenizer to disk.
The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.
- Parameters:
path (str) – File path where the model will be saved.
- Returns:
The path where the model was saved (same as input).
- Return type:
Example
>>> embedder.fit(documents) >>> embedder.save("models/bm25_model.joblib")
- save_pretrained(path)[source]
Alias for
save().This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (BaseSparseEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (BaseSparseEmbedder)
- Returns:
self – The updated object.
- Return type:
- load(path)[source]
Load a serialized model and tokenizer from disk.
This method restores the model state from a file previously saved with
save()orsave_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.- Parameters:
path (str) – File path to the serialized model.
- Returns:
None
- Return type:
None
Example
>>> embedder = BM25Embedder() >>> embedder.load("models/bm25_model.joblib")