zvec_db.embedders.sparse.base
Base classes for sparse embedding transformers.
This module provides abstract base classes that factor out common logic for BM25-family transformers (BM25, BM25L, BM25+, etc.).
Classes
|
Abstract base class for BM25-family transformers. |
- class zvec_db.embedders.sparse.base.BaseBM25Transformer(k1=1.2)[source]
Abstract base class for BM25-family transformers.
This class factorizes common logic shared across BM25 variants: - IDF computation from document frequencies - Average document length calculation - Fit/transform boilerplate with validation
Subclasses must implement the
_compute_scores()method to define their specific scoring formula.- Parameters:
k1 (float)
- idf_
Computed inverse document frequencies for all terms.
- Type:
ndarray
Notes
Subclasses should only override: -
_compute_scores(): Core scoring formula (required) -__init__(): To add variant-specific parameters (optional)- __init__(k1=1.2)[source]
Initialize the base transformer.
- Parameters:
k1 (float) – Term frequency saturation parameter. Typical range: 1.2-2.0. Higher values mean slower saturation.
- fit(X, y=None)[source]
Compute IDF values and average document length from a count matrix.
- Parameters:
X (csr_matrix) – Sparse count matrix of shape
(n_docs, n_terms).y (Any) – Ignored; present for scikit-learn compatibility.
- Returns:
The fitted transformer.
- Return type:
self
- Raises:
ValueError – If the corpus is empty (average document length is zero).
- transform(X)[source]
Apply BM25 scoring to a count matrix.
- Parameters:
X (csr_matrix) – Sparse count matrix of shape
(n_docs, n_terms).- Returns:
BM25-weighted sparse matrix of the same shape.
- Return type:
csr_matrix
- Raises:
RuntimeError – If the transformer has not been fitted.