zvec_db.embedders.sparse.count
Count-based sparse embedding using term frequencies.
This module implements simple count-based sparse embedding using scikit-learn’s CountVectorizer. It converts text documents into sparse vectors based on raw term frequencies (count of each token).
Classes
- CountEmbedder
Count-based embedder wrapping scikit-learn’s CountVectorizer.
Example Usage
from zvec_db.embedders import CountEmbedder
embedder = CountEmbedder(max_features=4096, binary=True)
embedder.fit(documents)
vector = embedder.embed("search query")
Classes
|
Count-based sparse embedder wrapping scikit-learn's |
- class zvec_db.embedders.sparse.count.CountEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
Count-based sparse embedder wrapping scikit-learn’s
CountVectorizer.This embedder converts text documents into sparse vectors based on term frequencies (raw counts of each token). It is the simplest sparse embedding method and serves as a foundation for more advanced techniques like BM25 and TF-IDF.
The embedder accepts raw strings or pre-tokenized input. Any keyword arguments are forwarded to the underlying
CountVectorizerafter being normalized byBaseSparseEmbedder._prepare_vectorizer_params().- Parameters:
tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.
is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with
tokenizer.max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.
preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().
**count_params – Additional keyword arguments passed to
CountVectorizer(e.g.,min_df,max_df,ngram_range).
Example
>>> embedder = CountEmbedder(min_df=2, ngram_range=(1, 2)) >>> embedder.fit(documents) >>> vectors = embedder.embed(["query text"])
- __init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
- fit(corpus, y=None)[source]
Train the embedder on a corpus of documents.
The supplied
corpusis normalised according to the instance configuration:is_pretokenized=True- the caller must provide lists of tokens.tokenizer=...- each string in the corpus will be passed through the tokenizer before vectorisation.neither set - raw strings are passed to
CountVectorizerdirectly.
_prepare_corpushandles the validation and transformation logic.
- set_fit_request(*, corpus='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
corpusparameter infit.self (CountEmbedder)
- Returns:
self – The updated object.
- Return type:
- set_transform_request(*, input_text='$UNCHANGED$')
Configure whether metadata should be requested to be passed to the
transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
input_textparameter intransform.self (CountEmbedder)
- Returns:
self – The updated object.
- Return type: