zvec_db.embedders.sparse.tfidf
TF-IDF (Term Frequency-Inverse Document Frequency) sparse embedding.
This module implements TF-IDF embedding using scikit-learn’s TfidfVectorizer. TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents, computed as the product of term frequency and inverse document frequency.
Classes
- TfidfEmbedder
Sparse TF-IDF embedder using scikit-learn’s TfidfVectorizer.
Example Usage
from zvec_db.embedders import TfidfEmbedder
embedder = TfidfEmbedder(
max_features=4096,
sublinear_tf=True
)
embedder.fit(documents)
vector = embedder.embed("search query")
Classes
|
Sparse TF-IDF embedder using scikit-learn's |
- class zvec_db.embedders.sparse.tfidf.TfidfEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
Sparse TF-IDF embedder using scikit-learn’s
TfidfVectorizer.TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. It is computed as the product of:
Term Frequency (TF): How often a term appears in a document.
Inverse Document Frequency (IDF): A penalty factor for terms that appear in many documents.
This embedder supports custom tokenization and pre-tokenized inputs. All additional keyword arguments are passed through to the underlying
TfidfVectorizer(e.g.,min_df,max_df,ngram_range,sublinear_tf).- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**tfidf_params – Additional keyword arguments passed to
TfidfVectorizer.
Example
>>> embedder = TfidfEmbedder(min_df=2, sublinear_tf=True) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **tfidf_params)[source]
- fit(corpus, y=None)[source]
Fit the TF-IDF vectorizer on a corpus of documents.
The corpus is pre-processed according to the embedder’s configuration:
Custom tokenizer: Each document is tokenized before vectorization.
Pre-tokenized mode: Documents are expected to be lists of tokens.
Default: Raw strings are passed directly to
TfidfVectorizer.
- Parameters:
corpus (ExtendedList) – Training documents. Must be strings unless
is_pretokenized=Trueor a customtokenizeris set.y – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted embedder.
- Return type:
self
- Raises:
ValueError – If corpus format doesn’t match the configuration.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (TfidfEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (TfidfEmbedder)
- Returns:
self – The updated object.
- Return type: