The impact of document IDs on performance of CouchDB
(This text is copied from blog.inoi.fi to my personal blog, as I no longer work there and think this is a keeper)
One major part of our product is a data acquisition framework. The framework is responsible of gathering data on devices and sending it to the server. The server then processes the data and eventually stores it in CouchDB.
CouchDB storage is pretty simple. It's just a single database where the data gets inserted with a random ID. Removals or modifications are hardly done to the data after it has been inserted there, its only function is just to be read later and to be displayed as graphs and such
Things started to get weird
However, the data server started to behave strangely. First symptoms were slow inserts and huge disk usage of the database. For the slow inserts we first suspected the reason was that we were not using CouchDB's bulk document API for inserting the data but after fixing that we only got a minor speed increase. The compaction of the database didn't really help, as it compacted the database too slow, peak rate being 300 documents per second and the average about 100 doc/s. With 8 million docs that takes a while. Worst thing was that the compaction didn't even reduce the disk usage. The database ate 26 gigabytes of disk containing only 8 million documents. That's a whopping 3 kilobytes per document, and the documents were really about 100 bytes in size.
What was even weirder we didn't really see anything strange in the way the server performed. IO ops were low, CPU usage was low and there was a plenty of free memory. Heck, we even could write zeros (dd if=/dev/zero of=/var/somefile.img) on the disk at the rate of 50Mb/s. Even killing most of the other services didn't help: CouchDB just kept being slow. We even upgraded the CouchDB in our test environment to try if it helped, but we only gained small performance gain from that operation.
Let's not be that random
As you might have heard, the problem with the random is that you never can be sure. So after a week or so pondering the issue we stumbled upon on a chapter about bulk inserts and monotonic DocIDs in the excellent CouchDB guide. With a tip from the awesome folks at #couchdb we rebuilt our system using sequential IDs instead of purely random IDs.
We copied the idea of sequential IDs straight from CouchDB which uses them for its auto-generated DocIDs. We just needed to implement it in Python and ended up with the following:
SequentialID is a random 32 character hexadecimal string. The first 26 characters is a prefix that stays the same for approx. 8000 subsequent calls. The suffix is increased monotonically. After around 8000 calls the the prefix is regenerated and the suffix is reseted to a small positive value. We also built another one using ISO-formatted UTC-timestamp with few random digits suffixed.
Now you might think that we would have collisions with the
SequentialID, especially if multiple processes are writing to the same CouchDB database. However, we're not since the prefix is a random generated string and the entropy (i.e. string length) is big enough to make the collision highly unlikely. Plus the SequentialID is never shared between processes it is regenerated for every single one instead.
Performance rocketed through the roof
No fix would be good without performance metrics so Petri wrote a small benchmarking script.
This showed us the huge performance increase we got. The write speed is almost four times as fast with sequential IDs than with random IDs. Not to mention that the database takes a one seventh of the space on the disk! We didn't try the compaction speed, but everything indicates that should be a lot faster too.
How does it work?
The reason why this worked is due to the design how CouchDB stores it data on the disk. Internally CouchDB uses B+-tree (from now on referred as B-tree) to store the data of a single database. Now if the data that is inserted in the B-tree (in this case, the DocIDs) is random, it causes the tree to fan out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes than before.
More the internal nodes the more disk seeks the CouchDB has to perform to write the correct leaf to write the data in to. This is why we didn't see any IO ops stacking up: the disk seeks do not show up in the iostat! And this was the single biggest cause to slow down both reads and especially writes of our CouchDB database.
The problem aren't even limited to the disk seeking. The randomness of the DocIDs causes the tree structure to be changed often (due to the nature of B-trees). When using append only log the database has no other way to do than rewrite the whole new structure of the tree to the end of the append only log. As this gets more common and common as more random data gets poured in, the disk usage grows rapidly, eventually hogging a lot more disk than the data in the database is actually requiring. Not only this, this slow down compaction too, as the compaction needs to constantly rebuild the new database.
The B-tree is why using the (semi-)sequential IDs is a real life saver. The new model causes the database to be filled in orderly fashion and the buckets (i.e. leafs) are filled in instead of leaving them half full. Best part here is that the auto generated IDs by CouchDB (which were not an option for us) already use the sequential ID scheme, so using those IDs you don't really need to worry a thing.
So remember kids: if you cram loads of data in your CouchDB, remember to select your document ID scheme carefully!