Created: 2012-03-24 03:39
Updated: 2013-10-21 14:35


Jotting down an idea before I forget. I've used Lucene in the past for searches ( and I absolutely <3 it. But some of the blogs on repurposing Redis for free-text search is too off - storing all prefixes is such a kludge. So question is, can we do something better?


This is in a very raw form. Use a simple tokenizer and a bunch of stop words to convert the input document into a set of terms. For each term, compute tf which is the term frequency within the document and store that in a sorted set. The docid is a simple opaque document identifier.

<term1>: zset<docid, tf(term in doc)>
<term2>: zset<docid, tf(term in doc)>

During query time, compute idf which is the number of documents that contain the term (zcard for the term really). Lucene does a lot more than that with boosts and what not.

weight = idf(t)^2
idf = no of documents that contain term (zcard for the term)

So when we do the searching, simply intersect the terms of the query, do the per-term weighting as above and then aggregate the sum. This effectively gives a set of docids, sorted by the relevance. Simplistic, but works.

zinterstore dst num_terms termi ... termk weights wi ... wk aggregate sum

Code to come soon.

