You can do the same in Python (in about the same amount of lines I suspect). The below is the core of the linked implementation (Vector Space) in 15 lines of Python. All you need is something to build a clean concordance on the search terms/documents, then just compare the terms against the documents and sort based on the return value of relation.
Actually the below implementation is usually the first non trivial thing I try to implement in any language I am learning.
import math
class VectorCompare:
def magnitude(self,concordance):
total = 0
for word,count in concordance.iteritems():
total += count ** 2
return math.sqrt(total)
def relation(self,concordance1, concordance2):
relevance = 0
topvalue = 0
for word, count in concordance1.iteritems():
if concordance2.has_key(word):
topvalue += count * concordance2[word]
return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
EDIT - Was going to fork this on GitHub and make a quick Python port but corporate firewall got in the way.
As someone who works on search and is a scala fan, I'm super impressed. Next obvious question: how many lines would it take to add some sort of stemming?
Maybe in a couple of lines we could do a very naive plural removal. I've seen code for stemming, they have tons of different conditions and possibillities, I guess it would be difficult to condense that into few lines.
Actually the below implementation is usually the first non trivial thing I try to implement in any language I am learning.
EDIT - Was going to fork this on GitHub and make a quick Python port but corporate firewall got in the way.