indexing and storage considerations

During this project we have considered several options for storing and indexing our data. Here is a chronological review of what we tried, and reasoning for our current setup (CouchDB for storage, SOLR for indexing)

CouchDB

Stores JSON really easily, provides functionality like simple API and replication, allows for MapReduce sort of functionality for searching and displaying data.

However, does not include full text searching, and although range queries can be done via MapReduce, it cannot be done if a key is missing – e.g. a view defined on ["author","title","year"] requires all three to succeed – wildcard not available. Given that I want to index data without stipulating content, this is an issue.

CouchDB with ElasticSearch

ES provides really easy full text indexing. But on first attempt, it was not capable of offering the faceted browsing I had hoped for. Therefore, although great, it was not up to scratch – documentation was a bit confusing too. ES was very new at the time, but this situation is much improved. More on ES to follow.

CouchDB with Lucene

Given lack of faceting on ES, I moved to Lucene directly with CouchDB. Lucene is what ES and SOLR run underneath, and is the de facto software for actually doing full text searching. Given I did not get suitable faceting from ES, I decided to try Lucene directly with CouchDB. This worked well, and allowed me to query full text indexes from views defined in CouchDB. However, Lucene itself does not do faceting – I knew this, but it was no loss from the first ES attempt. I considered doing the faceting at the client side, but anything more than about 1000 records was just too slow.

CouchDB with SOLR

So I used SOLR. There is nothing wrong with SOLR, it is mature and well supported; I had just tried to avoid it on this project as I was looking for a more flexible JSON / schema-less approach. However, given previous issues, I tried it again. I found that it was actually not too difficult to write a dynamic schema that is flexible enough for me to throw data at it without knowing the keys in advance, and get facet counts back quite easily. Also, later versions of SOLR support JSON in and out, so that is good too. I was settled on SOLR, but then:

ElasticSearch

At OKCON I met someone who was also using ES and very enthusiastic. He believed that later versions than I had tried would enable me to do the faceting I required, and I was encouraged by meeting someone else using ES and hearing about others doing so. I really want the flexibility and scalability of ES, and ideally could use it as storage without CouchDB, but need to overcome the aforementioned problems.

Current situation

Still using CouchDB and SOLR, which works well, but re-considering ES. If it works, and I have no other functional need for CouchDB, this would be a great simplification of the stack. However, I am still working on solving the faceting problem:

Update 6th July 2011

Latest versions of ElasticSearch allow for dynamic definitions for how a document should be indexed, as well as versioning, providing the flexibility required for pulling in the sorts of data we could see in BibServer. It is definitely looking like a good solution instead of CouchDB and SOLR. However, it is probably a good idea to support multiple indexing engines anyway – the architecture should support plugging in alternative underlying storage and index solutions.

May well switch back to ES for continuing development.

On July 2nd, 2011, posted in: bibserver, bibsoup by

Leave a Reply