zvec_db.embedders.sparse.count

Count-based sparse embedding using term frequencies.

This module implements simple count-based sparse embedding using scikit-learn’s CountVectorizer. It converts text documents into sparse vectors based on raw term frequencies (count of each token).

Classes

CountEmbedder

Count-based embedder wrapping scikit-learn’s CountVectorizer.

Example Usage

from zvec_db.embedders import CountEmbedder

embedder = CountEmbedder(max_features=4096, binary=True)
embedder.fit(documents)
vector = embedder.embed("search query")

Classes

CountEmbedder([tokenizer, is_pretokenized, ...])

Count-based sparse embedder wrapping scikit-learn's CountVectorizer.

class zvec_db.embedders.sparse.count.CountEmbedder(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]

Count-based sparse embedder wrapping scikit-learn’s CountVectorizer.

This embedder converts text documents into sparse vectors based on term frequencies (raw counts of each token). It is the simplest sparse embedding method and serves as a foundation for more advanced techniques like BM25 and TF-IDF.

The embedder accepts raw strings or pre-tokenized input. Any keyword arguments are forwarded to the underlying CountVectorizer after being normalized by BaseSparseEmbedder._prepare_vectorizer_params().

Parameters:
  • tokenizer (Optional[Callable]) – Custom tokenizer function. If provided, it will be called on each document before vectorization.

  • is_pretokenized (bool) – If True, input documents must already be lists of tokens. Mutually exclusive with tokenizer.

  • max_features (Optional[int]) – Maximum number of features to retain per document. Defaults to 8192.

  • preprocessing_config (Optional[NormalizationConfig]) – Configuration for automatic text preprocessing (normalization, stemming, stopwords). If set, preprocessing is automatically applied during fit() and embed().

  • **count_params – Additional keyword arguments passed to CountVectorizer (e.g., min_df, max_df, ngram_range).

Example

>>> embedder = CountEmbedder(min_df=2, ngram_range=(1, 2))
>>> embedder.fit(documents)
>>> vectors = embedder.embed(["query text"])
__init__(tokenizer=None, is_pretokenized=False, max_features=8192, preprocessing_config=None, **count_params)[source]
Parameters:
  • tokenizer (Callable | None)

  • is_pretokenized (bool)

  • max_features (int | None)

  • preprocessing_config (NormalizationConfig | None)

fit(corpus, y=None)[source]

Train the embedder on a corpus of documents.

The supplied corpus is normalised according to the instance configuration:

  • is_pretokenized=True - the caller must provide lists of tokens.

  • tokenizer=... - each string in the corpus will be passed through the tokenizer before vectorisation.

  • neither set - raw strings are passed to CountVectorizer directly.

_prepare_corpus handles the validation and transformation logic.

Parameters:

corpus (list[str] | list[list[str]]) – Sequence of documents (strings or token lists depending on configuration).

Returns:

self to allow chaining.

set_fit_request(*, corpus='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • corpus (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for corpus parameter in fit.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, input_text='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • input_text (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for input_text parameter in transform.

  • self (CountEmbedder)

Returns:

self – The updated object.

Return type:

object