Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: TF-IDF search engine in 30 lines of Scala (github.com/felipehummel)
43 points by felipehummel on June 26, 2011 | hide | past | favorite | 8 comments


You can do the same in Python (in about the same amount of lines I suspect). The below is the core of the linked implementation (Vector Space) in 15 lines of Python. All you need is something to build a clean concordance on the search terms/documents, then just compare the terms against the documents and sort based on the return value of relation.

Actually the below implementation is usually the first non trivial thing I try to implement in any language I am learning.

  import math

  class VectorCompare:
    def magnitude(self,concordance):
      total = 0
      for word,count in concordance.iteritems():
        total += count ** 2
      return math.sqrt(total)

    def relation(self,concordance1, concordance2):
      relevance = 0
      topvalue = 0
      for word, count in concordance1.iteritems():
        if concordance2.has_key(word):
          topvalue += count * concordance2[word]
      return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))

EDIT - Was going to fork this on GitHub and make a quick Python port but corporate firewall got in the way.


Cannot edit anymore, but here is an example I had written some time ago in Python. http://www.wausita.com/2010/08/build-vector-space-search-eng...


A bit opaque, but cute. I've got a Redis + Python version (keeps the index in Redis) that I talked about last year: http://dr-josiah.blogspot.com/2010/07/building-search-engine...


As someone who works on search and is a scala fan, I'm super impressed. Next obvious question: how many lines would it take to add some sort of stemming?


Maybe in a couple of lines we could do a very naive plural removal. I've seen code for stemming, they have tons of different conditions and possibillities, I guess it would be difficult to condense that into few lines.


i don't think tf-idf relevant for tiny in-memory index >__>


The index does not have to be tiny, the source code is. I tested with a 500k documents dataset.


Out of curiosity how long did that take to index?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: