zvec_db.embedders.sparse.base

Base classes for sparse embedding transformers.

This module provides abstract base classes that factor out common logic for BM25-family transformers (BM25, BM25L, BM25+, etc.).

Classes

BaseBM25Transformer([k1])

Abstract base class for BM25-family transformers.

class zvec_db.embedders.sparse.base.BaseBM25Transformer(k1=1.2)[source]

Abstract base class for BM25-family transformers.

This class factorizes common logic shared across BM25 variants: - IDF computation from document frequencies - Average document length calculation - Fit/transform boilerplate with validation

Subclasses must implement the _compute_scores() method to define their specific scoring formula.

Parameters:

k1 (float)

k1

Term frequency saturation parameter.

Type:

float

idf_

Computed inverse document frequencies for all terms.

Type:

ndarray

avgdl_

Average document length in the training corpus.

Type:

float

is_fitted_

Whether the transformer has been fitted.

Type:

bool

Notes

Subclasses should only override: - _compute_scores(): Core scoring formula (required) - __init__(): To add variant-specific parameters (optional)

__init__(k1=1.2)[source]

Initialize the base transformer.

Parameters:

k1 (float) – Term frequency saturation parameter. Typical range: 1.2-2.0. Higher values mean slower saturation.

fit(X, y=None)[source]

Compute IDF values and average document length from a count matrix.

Parameters:
  • X (csr_matrix) – Sparse count matrix of shape (n_docs, n_terms).

  • y (Any) – Ignored; present for scikit-learn compatibility.

Returns:

The fitted transformer.

Return type:

self

Raises:

ValueError – If the corpus is empty (average document length is zero).

transform(X)[source]

Apply BM25 scoring to a count matrix.

Parameters:

X (csr_matrix) – Sparse count matrix of shape (n_docs, n_terms).

Returns:

BM25-weighted sparse matrix of the same shape.

Return type:

csr_matrix

Raises:

RuntimeError – If the transformer has not been fitted.