zvec_db.embedders.sparse.tfidf

TF-IDF (Term Frequency-Inverse Document Frequency) sparse embedding.

This module implements TF-IDF embedding using scikit-learn’s TfidfVectorizer. TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents, computed as the product of term frequency and inverse document frequency.

Classes

TfidfEmbedder

Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.

Example Usage

from zvec_db.embedders import TfidfEmbedder

embedder = TfidfEmbedder(
    max_features=4096,
    sublinear_tf=True
)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

TfidfEmbedder([tokenizer, is_pretokenized, ...])

Sparse TF-IDF embedder using scikit-learn's TfidfVectorizer.

class zvec_db.embedders.sparse.tfidf.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]

Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:

  • Term Frequency (TF): How often a term appears in a document.

  • Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.

This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying TfidfVectorizer (e.g., min_df, max_df, ngram_range, sublinear_tf).

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **tfidf_params – Additional keyword arguments passed to TfidfVectorizer.

Example

>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True)
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Fit the TF-IDF vectorizer on a corpus of documents.

The corpus is pre-processed according to the embedder’s configuration:

  • Custom tokenizer: Each document is tokenized before vectorization.

  • Pre-tokenized mode: Documents are expected to be lists of tokens.

  • Default: Raw strings are passed directly to TfidfVectorizer.

Parameters:
  • corpus (ExtendedList) – Training documents. Must be strings unless is_pretokenized=True or a custom tokenizer is set.

  • y – Ignored; present for scikit-learn compatibility.

Returns:

The fitted embedder.

Return type:

self

Raises:

ValueError – If corpus format doesn’t match the configuration.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (TfidfEmbedder)

Returns:

self – The updated object.

Return type:

object