Experimenting with Haystack

As a general principle, I put Whoosh in the same category as SQLite: great for getting started, wonderful for single-user or really small-scale apps, but not suitable for large-scale deployment.

This. The more I poke around, the more I'm convinced this is accurate.

I did some experimenting with Whoosh, Xapian, and Solr the other day, and have compiled the following simple stats. I kept periodically running into a memory wall with Solr (see below), so there are benchmarks from my initial setup with 512MB RAM, as well as from an upgrade to 1GB RAM.

(If you're looking for a setup guide for any of these backends, sorry, this isn't that.)

Some background: I'm the developer for a Q&A site forked from OSQA. This runs django, and we are using django-haystack for search. Based on our indexing rules, we currently have more than 10K indexed questions, growing daily.

$ django-admin.py rebuild_index --noinput
Removing all documents from your index because you said so.
All documents removed.
Indexing 10642 questions.

We're also still ironing out the index schema, so I'm not sure what kind of impact an “optimized schema” will have on any of this.

My expectations were that these would follow a similar progression to SQLite, MySQL, and PostgreSQL (not to bash any of those, of course ;) ). Whoosh would be the worst performer -- everyone uses it because it's the easiest to set up, so it has to suffer somewhere, right? Next would be Xapian: better than whoosh for largish sites, but you don't see many people talking about it, so it must not be much better. Last, and best, would be Solr. In casual haystack-related browsing, you see the most people talking about Solr for big installations. And it's a standalone server app (runs on tomcat or jetty), so it has to be really performant -- right?

Before I tried benchmarking any searches, I did a quick test of how long it took each engine to build our index from scratch (on the original 512MB RAM). Times are the “real” time reported by the time command.

$ time django-admin.py rebuild_index --noinput

Whoosh: 3:03m
Xapian: 2:06m
Solr:   0:18m

No surprises there -- Whoosh is the worst, Xapian is a bit better, and Solr blows them both out of the water.

Next I ran four benchmarks -- single (common) term search, single term sorted search, mutliple (common) term search, and multiple term sorted search. Note that these are full queryset evaluations (using list to pull all the results), sort of worst-case scenario type stuff. We don't actually do this in any real code ;) .

Each engine was setup with the bare minimum of work -- no tweaks, no optimizations, etc. The only difference was changing the length of indexed Question URLs (returned by get_absolute_url) to come in under Xapian's 245-character term limit.

Benchmark setup:

>>> from haystack.query import SearchQuerySet

Benchmarks (512MB RAM, 1GB RAM):

>>> timeit -r5 -n5 list(SearchQuerySet().auto_query('term'))
Whoosh: 899ms, 922ms
Xapian: 597ms, 577ms
Solr:   3.55s, 1.22s

>>> timeit -r5 -n5 list(SearchQuerySet().auto_query('term').order_by('added_at'))
Whoosh: 5.72s, 5.99s
Xapian: 613ms, 557ms
Solr:   1.17s, 1.22s

>>> timeit -r5 -n5 list(SearchQuerySet().auto_query('three term phrase'))
Whoosh: 899ms, 853ms
Xapian: 210ms, 200ms
Solr:   2.24s, 481ms

>>> timeit -r5 -n5 list(SearchQuerySet().auto_query('three term phrase').order_by('added_at'))
Whoosh: 6.32s, 6.1s
Xapian: 196ms, 199ms
Solr:   1.32s, 492ms

Whoosh performed pretty bad, which was not surprising; it was exponentially worse for sorted searches, though I couldn't tell you why. Solr (which I previously saw as a sort of search analog to Redis) obviously benefits from more RAM, but still has pretty lousy times for short queries. But then Xapian comes along with really good times for all benchmarks. Who'd have thought!

Xapian's triumph was surprising, but at the same time welcome. Xapian is similar to Whoosh in terms of setup (that is, very easy), and we wanted to avoid adding another server app to our deployment setup, which Solr would have required. So we'll be switching to Xapian for our search, and should see a pretty good performance boost as a result.