As a general principle, I put Whoosh in the same category as SQLite: great for getting started, wonderful for single-user or really small-scale apps, but not suitable for large-scale deployment.
This. The more I poke around, the more I'm convinced this is accurate.
I did some experimenting with Whoosh, Xapian, and Solr the other day, and have compiled the following simple stats. I kept periodically running into a memory wall with Solr (see below), so there are benchmarks from my initial setup with 512MB RAM, as well as from an upgrade to 1GB RAM.
(If you're looking for a setup guide for any of these backends, sorry, this isn't that.)
Some background: I'm the developer for a Q&A site forked from OSQA. This runs django, and we are using django-haystack for search. Based on our indexing rules, we currently have more than 10K indexed questions, growing daily.
$ django-admin.py rebuild_index --noinput Removing all documents from your index because you said so. All documents removed. Indexing 10642 questions.
We're also still ironing out the index schema, so I'm not sure what kind of impact an “optimized schema” will have on any of this.
My expectations were that these would follow a similar progression to SQLite, MySQL, and PostgreSQL (not to bash any of those, of course ;) ). Whoosh would be the worst performer -- everyone uses it because it's the easiest to set up, so it has to suffer somewhere, right? Next would be Xapian: better than whoosh for largish sites, but you don't see many people talking about it, so it must not be much better. Last, and best, would be Solr. In casual haystack-related browsing, you see the most people talking about Solr for big installations. And it's a standalone server app (runs on tomcat or jetty), so it has to be really performant -- right?
Before I tried benchmarking any searches, I did a quick test of how long it took each engine to build our index from scratch (on the original 512MB RAM). Times are the “real” time reported by the
$ time django-admin.py rebuild_index --noinput Whoosh: 3:03m Xapian: 2:06m Solr: 0:18m
No surprises there -- Whoosh is the worst, Xapian is a bit better, and Solr blows them both out of the water.
Next I ran four benchmarks -- single (common) term search, single term sorted search, mutliple (common) term search, and multiple term sorted search. Note that these are full queryset evaluations (using
list to pull all the results), sort of worst-case scenario type stuff. We don't actually do this in any real code ;) .
Each engine was setup with the bare minimum of work -- no tweaks, no optimizations, etc. The only difference was changing the length of indexed Question URLs (returned by
get_absolute_url) to come in under Xapian's 245-character term limit.
Benchmark setup: >>> from haystack.query import SearchQuerySet Benchmarks (512MB RAM, 1GB RAM): >>> timeit -r5 -n5 list(SearchQuerySet().auto_query('term')) Whoosh: 899ms, 922ms Xapian: 597ms, 577ms Solr: 3.55s, 1.22s >>> timeit -r5 -n5 list(SearchQuerySet().auto_query('term').order_by('added_at')) Whoosh: 5.72s, 5.99s Xapian: 613ms, 557ms Solr: 1.17s, 1.22s >>> timeit -r5 -n5 list(SearchQuerySet().auto_query('three term phrase')) Whoosh: 899ms, 853ms Xapian: 210ms, 200ms Solr: 2.24s, 481ms >>> timeit -r5 -n5 list(SearchQuerySet().auto_query('three term phrase').order_by('added_at')) Whoosh: 6.32s, 6.1s Xapian: 196ms, 199ms Solr: 1.32s, 492ms
Whoosh performed pretty bad, which was not surprising; it was exponentially worse for sorted searches, though I couldn't tell you why. Solr (which I previously saw as a sort of search analog to Redis) obviously benefits from more RAM, but still has pretty lousy times for short queries. But then Xapian comes along with really good times for all benchmarks. Who'd have thought!
Xapian's triumph was surprising, but at the same time welcome. Xapian is similar to Whoosh in terms of setup (that is, very easy), and we wanted to avoid adding another server app to our deployment setup, which Solr would have required. So we'll be switching to Xapian for our search, and should see a pretty good performance boost as a result.