11 days ago
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Xing Han Lù

Abstract
We introduce BM25S, an efficient Python-based implementation of BM25 thatonly depends on Numpy and Scipy. BM25S achieves up to a 500x speedup comparedto the most popular Python-based framework by eagerly computing BM25 scoresduring indexing and storing them into sparse matrices. It also achievesconsiderable speedups compared to highly optimized Java-based implementations,which are used by popular commercial products. Finally, BM25S reproduces theexact implementation of five BM25 variants based on Kamphuis et al. (2020) byextending eager scoring to non-sparse variants using a novel score shiftingmethod. The code can be found at https://github.com/xhluca/bm25s