Whoosh: full-text search with Python

To add an efficient search function to the product I work on, I was looking for a good indexer. Elastic Search, a Java indexer that is managed through a REST api, looks good but it requires to set-up a dedicated server: it’s not a library but a full software. Another option was Xapian, looks efficient, but not very well documented.

Then I discovered Whoosh, a Python library which offers indexing and search features. The documentation and the API makes it really easy to use. The performance are probably worst than the Elastic Search or Xapian but it should be enough for a lot of projects. The library provides a lot of search strategies and functionalities (stemming, faceting, highlighting…). In conclusion, if you have a Python project that requires full-text search, you should definitely have a look at it.

To illustrate this article here is a little snippet I wrote that index a list of blog posts located in MongoDB database.

import os

from whoosh.fields import Schema, ID, KEYWORD, TEXT
from whoosh.index import create_in
from whoosh.query import Term

from pymongo import Connection
from bson.objectid import ObjectId

# Set index, we index title and content as texts and tags as keywords.
# We store inside index only titles and ids.
schema = Schema(title=TEXT(stored=True), content=TEXT,
                nid=ID(stored=True), tags=KEYWORD)

# Create index dir if it does not exists.
if not os.path.exists("index"):
    os.mkdir("index")

# Initialize index
index = create_in("index", schema)

# Initiate db connection
connection = Connection('localhost', 27017)
db = connection["cozy-home"]
posts = db.posts

# Fill index with posts from DB
writer = index.writer()
for post in posts.find():
    writer.update_document(title=post["title"],
                           content=post["content"],
                           nid=unicode(post["_id"]),
                           tags=post["tags"])
writer.commit()

# Search inside index for post containing "test", then it displays
# results.
with index.searcher() as searcher:
    result = searcher.search(Term("content", u"test"))[0]
    post = posts.find_one(ObjectId(result["nid"]))
    print result["title"]
    print post["content"]
About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: