Sparse and Dense Embedding<a class="headerlink" href="#sparse-and-dense-embedding" title="Link to this heading">

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (CountEmbedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (CountEmbedder)

Returns:

self – The updated object.

Return type:

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None

cache_size: int

BM25Embedder

class zvec_db.embedders.BM25Embedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25 scoring formula.

This class wires together a CountVectorizer with a lightweight BM25Transformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function.
is_pretokenized (bool) – If True, input documents must be lists of tokens.
max_features (Optional[int]) – Maximum number of features to retain.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2.
b (float) – Length normalization parameter. Defaults to 0.75.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional parameters for CountVectorizer.

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25 pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25Transformer: Applies BM25 weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

size: Current number of cached items
max_size: Maximum cache capacity
utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:: None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Single document: dict[int, float] mapping feature indices to values.
Batch: list[dict[int, float]] with one dictionary per document.

Return type:

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25Embedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25Embedder)

Returns:

self – The updated object.

Return type:

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None

cache_size: int

BM25LEmbedder

class zvec_db.embedders.BM25LEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25L scoring formula.

BM25L is a variant of BM25 that uses linear length normalization, making it more suitable for corpora with highly variable document lengths.

This class wires together a CountVectorizer with a BM25LTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25LEmbedder(k1=1.5, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25L pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25LTransformer: Applies BM25L weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

size: Current number of cached items
max_size: Maximum cache capacity
utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:: None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Single document: dict[int, float] mapping feature indices to values.
Batch: list[dict[int, float]] with one dictionary per document.

Return type:

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25LEmbedder)

Returns:

self – The updated object.

Return type:

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None

cache_size: int

BM25PlusEmbedder

class zvec_db.embedders.BM25PlusEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the BM25+ scoring formula.

BM25+ extends BM25 by adding a smoothing parameter (delta) that prevents zero scores for terms with zero term frequency. This can improve retrieval performance, especially for corpora with many rare terms.

This class wires together a CountVectorizer with a BM25PlusTransformer. Tokenisation behaviour is controlled by the two parameters inherited from BaseSparseEmbedder:

is_pretokenized tells the embedder to expect lists of tokens as input and avoids any preprocessing altogether.
tokenizer allows the client to supply a callable that will be executed on every raw text document before vectorisation. When a tokenizer is used the data passed to the scikit-learn pipeline consists of token lists as well; the vectorizer is therefore configured to act as an identity transformer.

The two options are mutually exclusive and validated by the base class.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
delta (float) – Smoothing parameter. Defaults to 0.5. Typical range: 0.4-1.0. Higher values increase the baseline score.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = BM25PlusEmbedder(k1=1.5, b=0.8, delta=0.6, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, delta=0.5, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
delta (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the BM25+ pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. BM25PlusTransformer: Applies BM25+ weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

size: Current number of cached items
max_size: Maximum cache capacity
utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:: None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Single document: dict[int, float] mapping feature indices to values.
Batch: list[dict[int, float]] with one dictionary per document.

Return type:

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (BM25PlusEmbedder)

Returns:

self – The updated object.

Return type:

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None

cache_size: int

DisMaxEmbedder

class zvec_db.embedders.DisMaxEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Sparse embedder implementing the DisMax scoring formula.

DisMax (Disjunctive Maximum) takes the maximum score across multiple terms or fields, rather than summing them. This is useful when you want documents that match at least one term well, rather than documents that match all terms moderately.

The DisMax score formula is:

\[\text{DisMax}(d) = \max_{t \in T}(\text{score}_t(d)) + t \times \sum_{t \in T \setminus \{\text{argmax}\}} \text{score}_t(d)\]

where $t$ is the tie breaker parameter.

This embedder is particularly useful for:

Multi-field search (title, content, tags) where matching any field well should rank highly.
Disjunctive queries where documents matching any query term should be retrieved.
Avoiding score inflation from documents matching many terms weakly.

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
k1 (float) – Term frequency saturation parameter. Defaults to 1.2. Typical range: 1.2-2.0. Higher values mean slower saturation.
b (float) – Length normalization parameter. Defaults to 0.75. Typical range: 0.5-1.0. b=1.0 means full length normalization.
tie_breaker (float) – Tie breaker parameter. Defaults to 0.0. 0.0 = pure maximum, 1.0 = sum all scores.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = DisMaxEmbedder(k1=1.5, tie_breaker=0.1, min_df=2)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, k1=1.2, b=0.75, tie_breaker=0.0, preprocessing_config=None, **count_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
k1 (float)
b (float)
tie_breaker (float)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the DisMax pipeline on a corpus of documents.

This method builds a scikit-learn pipeline consisting of: 1. CountVectorizer: Tokenizes documents and builds term counts. 2. DisMaxTransformer: Applies DisMax weighting to the count matrix.

The corpus is pre-processed according to the embedder’s configuration (custom tokenizer or pre-tokenized mode) before being passed to the pipeline.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

embed(input_text)[source]

Embed text into sparse vectors with DisMax scores.

Unlike other embedders that return a vector with multiple non-zero entries, DisMaxEmbedder returns a single score per document (the maximum term score).

Parameters:: input_text (str | List[str] | List[List[str]]) – Single document or batch of documents.
Returns:: For each document, returns a dictionary with a single entry {0: dismax_score} representing the maximum term score.
Return type:: Union[SparseVector, List[SparseVector]]

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

size: Current number of cached items
max_size: Maximum cache capacity
utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:: None

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (DisMaxEmbedder)

Returns:

self – The updated object.

Return type:

transform(input_text)

Transform input text into a sparse feature matrix.

This method follows the standard scikit-learn transformer API. It automatically handles tokenization based on the embedder’s configuration before passing data to the fitted model.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:: input_text (StrExtendedList) – Single document or batch of documents.
Returns:: Sparse feature matrix with shape (n_docs, n_features).
Return type:: csr_matrix
Raises:: RuntimeError – If the model has not been fitted or loaded.

model: csr_matrix | None

cache_size: int

TfidfEmbedder

class zvec_db.embedders.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]

Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:

Term Frequency (TF): How often a term appears in a document.
Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.

This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying TfidfVectorizer (e.g., min_df, max_df, ngram_range, sublinear_tf).

Parameters:

tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.
max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**tfidf_params – Additional keyword arguments passed to TfidfVectorizer.

Example

>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])

__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]

Parameters:

tokenizer (Callable | None)
is_pretokenized (bool)
max_features (int | None)
preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Fit the TF-IDF vectorizer on a corpus of documents.

The corpus is pre-processed according to the embedder’s configuration:

Custom tokenizer: Each document is tokenized before vectorization.
Pre-tokenized mode: Documents are expected to be lists of tokens.
Default: Raw strings are passed directly to TfidfVectorizer.

Parameters:

corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.
y – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

__call__(input_text)

Call shortcut that delegates to embed().

This allows the embedder to be called like a function:

embedder = BM25Embedder()
embedder.fit(documents)
vector = embedder("query text")  # equivalent to embedder.embed(...)

Parameters:: input_text (str | list[str] | list[list[str]]) – Single document or batch of documents.
Returns:: Sparse vector(s) as dictionaries.
Return type:: dict[int, float] | list[dict[int, float]]

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1]_ to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

cache_info()

Get cache statistics.

Returns:

size: Current number of cached items
max_size: Maximum cache capacity
utilization: Current utilization as percentage (0-100)

Return type:

Dictionary with cache statistics

clear_cache()

Clear the embedding cache.

This method removes all cached entries, freeing memory. Useful when you want to force recomputation of all embeddings.

Note

This method is thread-safe.

Return type:: None

embed(input_text)

Embed text into sparse vectors as dictionaries.

This is the primary user-facing method for generating embeddings. Unlike transform() which returns a scipy sparse matrix, this method returns zvec-compatible dictionaries mapping {feature_index: value}.

The method automatically handles both single documents and batches, returning a single dictionary for a single input or a list of dictionaries for batch input.

Note

The model must be fitted (via fit() or fit_transform()) or loaded before calling this method.

Parameters:

input_text (StrExtendedList) – Single document or batch of documents.

Returns:

Single document: dict[int, float] mapping feature indices to values.
Batch: list[dict[int, float]] with one dictionary per document.

Return type:

Raises:

RuntimeError – If the model has not been fitted or loaded.

Example

>>> embedder = BM25Embedder().fit(documents)
>>> vector = embedder.embed("search query")
>>> vector  # {42: 0.523, 108: 0.312, ...}

embed_batch(documents, batch_size=32, show_progress=False)

Embed a large batch of documents with optional progress bar.

This method is optimized for processing large corpora by embedding documents in smaller batches. It supports an optional progress bar for tracking long-running operations.

Parameters:

documents (list[str]) – List of documents to embed.
batch_size (int, optional) – Number of documents per batch. Defaults to 32.
show_progress (bool, optional) – Show progress bar. Defaults to False.

Returns:

List of sparse vectors, one per document.

Return type:

Example

>>> embedder = BM25Embedder().fit(corpus)
>>> vectors = embedder.embed_batch(
...     large_corpus,
...     batch_size=64,
...     show_progress=True
... )

Note

For single documents or small batches, use embed() instead, which includes caching for repeated inputs.

fit_transform(X, y=None)

Fit the model and transform the data in one step.

This is a convenience method that calls fit() followed by transform(). It is useful for training and obtaining embeddings without storing intermediate results.

Parameters:

X – Training corpus (strings or token lists).
y – Ignored; present for scikit-learn compatibility.

Returns:

Sparse matrix of fitted and transformed data.

Return type:

csr_matrix

from_pretrained(path)

Alias for load().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

load(path)

Load a serialized model and tokenizer from disk.

This method restores the model state from a file previously saved with save() or save_pretrained(). The preprocessing configuration and other settings (is_pretokenized, max_features) are also restored.

Parameters:: path (str) – File path to the serialized model.
Returns:: None
Return type:: None

Example

>>> embedder = BM25Embedder()
>>> embedder.load("models/bm25_model.joblib")

preprocess(text)

Apply preprocessing to a text (public API).

This method applies the preprocessing configuration to a single text. It is useful for preprocessing queries or documents before embedding.

Parameters:: text (str) – Raw text to preprocess.
Returns:: Preprocessed text (str) or list of tokens (list) if HF tokenizer is configured. If no preprocessing_config is set, returns the original text unchanged.
Return type:: str | list

Example

>>> from zvec_db.embedders import BM25Embedder
>>> from zvec_db.preprocessing import NormalizationConfig
>>> config = NormalizationConfig.aggressive(language="french")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("  CHAT MANGEAIT  ")
'chat mang'

>>> config = NormalizationConfig.with_hf_tokenizer("gbert-base")
>>> embedder = BM25Embedder(preprocessing_config=config)
>>> embedder.preprocess("Le chat mange")
['le', 'chat', 'man', '##ge']

preprocess_input(input_text)

Determine if input is a single document or batch, and apply tokenization.

This method normalizes all input types into a list format expected by scikit-learn models, while preserving information about the original input structure to restore the correct return type.

The method handles three configurations:

Pre-tokenized mode: Validates and wraps token lists.
Custom tokenizer: Applies the tokenizer to string inputs.
Default: Wraps strings without modification.

Parameters:

input_text (StrExtendedList) –

Input to process. Format depends on configuration:

If is_pretokenized=True: list[str] (single) or list[list[str]] (batch)
If tokenizer is set: str (single) or list[str] (batch)
Default: str (single) or list[str] (batch)

Returns:

A tuple containing:

is_single (bool): True if input was a single document.

processed_list (list): Data wrapped as a list for the model.

Return type:

Tuple[bool, str | list]

Raises:

ValueError – If input format doesn’t match the configuration.

save(path)

Serialize the model and tokenizer to disk.

The model is saved using joblib, which efficiently handles the scikit-learn pipeline and any fitted parameters.

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved (same as input).
Return type:: str

Example

>>> embedder.fit(documents)
>>> embedder.save("models/bm25_model.joblib")

save_pretrained(path)

Alias for save().

This method is provided for compatibility with common naming conventions in NLP libraries (e.g., Hugging Face Transformers).

Parameters:: path (str) – File path where the model will be saved.
Returns:: The path where the model was saved.
Return type:: str

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.
self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.
self (TfidfEmbedder)

Returns:

self – The updated object.

Return type: