2012/05/23 Leave a comment
To add an efficient search function to the product I work on, I was looking for a good indexer. Elastic Search, a Java indexer that is managed through a REST api, looks good but it requires to set-up a dedicated server: it’s not a library but a full software. Another option was Xapian, looks efficient, but not very well documented.
Then I discovered Whoosh, a Python library which offers indexing and search features. The documentation and the API makes it really easy to use. The performance are probably worst than the Elastic Search or Xapian but it should be enough for a lot of projects. The library provides a lot of search strategies and functionalities (stemming, faceting, highlighting…). In conclusion, if you have a Python project that requires full-text search, you should definitely have a look at it.
To illustrate this article here is a little snippet I wrote that index a list of blog posts located in MongoDB database.
import os from whoosh.fields import Schema, ID, KEYWORD, TEXT from whoosh.index import create_in from whoosh.query import Term from pymongo import Connection from bson.objectid import ObjectId # Set index, we index title and content as texts and tags as keywords. # We store inside index only titles and ids. schema = Schema(title=TEXT(stored=True), content=TEXT, nid=ID(stored=True), tags=KEYWORD) # Create index dir if it does not exists. if not os.path.exists("index"): os.mkdir("index") # Initialize index index = create_in("index", schema) # Initiate db connection connection = Connection('localhost', 27017) db = connection["cozy-home"] posts = db.posts # Fill index with posts from DB writer = index.writer() for post in posts.find(): writer.update_document(title=post["title"], content=post["content"], nid=unicode(post["_id"]), tags=post["tags"]) writer.commit() # Search inside index for post containing "test", then it displays # results. with index.searcher() as searcher: result = searcher.search(Term("content", u"test")) post = posts.find_one(ObjectId(result["nid"])) print result["title"] print post["content"]