Show HN: TF-IDF search engine in 30 lines of Scala

boyter · on June 27, 2011

You can do the same in Python (in about the same amount of lines I suspect). The below is the core of the linked implementation (Vector Space) in 15 lines of Python. All you need is something to build a clean concordance on the search terms/documents, then just compare the terms against the documents and sort based on the return value of relation.

Actually the below implementation is usually the first non trivial thing I try to implement in any language I am learning.

  import math

  class VectorCompare:
    def magnitude(self,concordance):
      total = 0
      for word,count in concordance.iteritems():
        total += count ** 2
      return math.sqrt(total)

    def relation(self,concordance1, concordance2):
      relevance = 0
      topvalue = 0
      for word, count in concordance1.iteritems():
        if concordance2.has_key(word):
          topvalue += count * concordance2[word]
      return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))

EDIT - Was going to fork this on GitHub and make a quick Python port but corporate firewall got in the way.

boyter · on June 28, 2011

Cannot edit anymore, but here is an example I had written some time ago in Python. http://www.wausita.com/2010/08/build-vector-space-search-eng...

DrJosiah · on June 27, 2011

A bit opaque, but cute. I've got a Redis + Python version (keeps the index in Redis) that I talked about last year: http://dr-josiah.blogspot.com/2010/07/building-search-engine...

whakojacko · on June 27, 2011

As someone who works on search and is a scala fan, I'm super impressed. Next obvious question: how many lines would it take to add some sort of stemming?

felipehummel · on June 27, 2011

Maybe in a couple of lines we could do a very naive plural removal. I've seen code for stemming, they have tons of different conditions and possibillities, I guess it would be difficult to condense that into few lines.

freakinasshowl · on June 27, 2011

i don't think tf-idf relevant for tiny in-memory index >__>

felipehummel · on June 27, 2011

The index does not have to be tiny, the source code is. I tested with a 500k documents dataset.

boyter · on June 27, 2011

Out of curiosity how long did that take to index?