Lessons in Couchbase: The Danger of Underprovisioning

It's story time! Recently, I consulted for a client on upgrading from their current big data infrastructure centered around Azure SQL and Azure Queues, to a more robust and scalable architecture, with Couchbase Server and HDInsight. As part of the Couchbase POC, we ran load tests on various configurations of Couchbase clusters on Azure.

The particular test I'm writing about involved loading a dataset of 100,000,000 documents into the database, and then testing how it handles 10k updates per second.

The test parameters were as follows:

100M documents
2.5k average JSON document size
Working set of 10% (10M items)
GUID-based key (38 bytes)
1 replica configured for the bucket
Azure worker role instances inserting data in a tight loop
4 x A5 Azure VM instances (CPU: 2 x 1.66Ghz, RAM: 14GB) running Windows Server 2012 R2
Data files on a striped 4 x Azure Blob 30GB volume (120GB total), which gives a theoretical throughput of ~2000 IOPS
Index files on the VM's temporary drive (D:)

In our case, using Azure blobs as storage, let's assume 30% headroom, and we're using the default 85% high water mark. Using the cluster sizing guidelines from the Couchbase documentation, gives us:

Metadata = 100M * (56 + 38) * 2 = 17.5GB
Total Data = 100M * 2.5K * 2 = 476.8GB
Working Set = 476.8 * 10% = 47.7GB
Total RAM required = (Metadata + Working Set) * <storage headroom> / <high water mark>  
Total RAM = (17.5 + 47.7) * 1.3 / 0.85 = 99.7GB

Using 14GB VMs, and leaving about 1GB for the OS, that's 13GB for Couchbase, meaning we need 99.7 / 13 = 7.67 ~= 8 nodes. We went with 4 to save costs and to see what will happen.

We created the 4-node cluster, and began provisioning worker roles to load the initial data. The first worker gave us an average of 4.5k store operations per second. Adding a second worker brought us to around 8k, and adding a third got us to just under 12k store ops per second.

With just one loader the disk write queues drained faster than they filled, with two loaders it was about even, with three the write queues kept growing continuously. It was pretty clear where the disk throughput cap was, so we brought it back down to 2 workers loading the data and left them alone for a while. The loading ran fine for the first 79M documents. Near the end of that range, we noticed that by then all the nodes were heavily prioritizing active data writes (which go into the disk write queue) over replication (which goes into the replication TAP queue) - the write queue had about 1.5M items in it and was keeping steady, but the replication TAP queue grew to over 3.5M items and kept increasing. Note that at this point, we had around 14GB of RAM taken up by the keys+metadata, and the rest by documents. Around 8% of the active documents were resident (cached), and less than 3% of the replicas.

That 79M items point is important, because when the RAM usage passed the default 85% High Water Mark, Couchbase started trying to evict documents from memory to make room. However, it never managed to successfully move down from the High Water Mark of RAM usage, the cluster stayed above 85% RAM use. Around 80M items loaded, the data loader worker roles began getting "Temporary failure" errors en masse. The "Temporary OOM" error counter in the bucket monitor spiked to about the number as the Ops per second counter. Clearly the nodes were rejecting store operations until the cluster could evict enough items from memory. At the same time, the backoff message counter also spiked to about 10k per second, and the replication TAP queue stopped draining, because the nodes were all sending "backoff" messages to each other. CPU usage went from around 60% per node, to a steady 100%.

At this point we stopped the worker role instances that were loading the data, but the cluster never recovered. Memory use never went down below the High Water Mark, the disk write queue stopped draining, and the TAP queue stayed at 4M items plus 1.5M waiting on disk.

To fix the problem we first tried adding another A5 instance node to the cluster, which failed. In hindsight, it's obvious why - rebalancing uses a TAP queue to transfer data between nodes. We monitored the bucket carefully, and the replication TAP queue never moved even a single item. The rebalance stayed at 0% complection. Deleting items to free up room failed as well. Leaving the cluster with no new incoming data overnight, in the hopes that it will manage to either free up memory through working set eviction, or by draining the disk write queue failed as well. The cluster remained in exactly the same condition through the night. At that point we considered either shutting down the entire cluster (accepting the loss of any unpersisted data), or using the cbbackup/cbrestore tools to backup the data, and then killing the cluster and building a new one.

We never finished the data loading stage of the test, so the 10k updates per second test wasn't even on the menu.

Important lessons learned:

It is possible to bring a Couchbase cluster into an irreversible failure state. I'm not saying this as criticism, but rather to reinforce the next point, which is:
Skimping on resources (RAM, disk IOPS) when provisioning a Couchbase cluster is very bad.
The RAM sizing guidelines in the Couchbase documentation are phrased as suggestions. They're not. They are holy commandments.
Overprovision your cluster. Then, if you see that it's handling the load well and has plenty of headroom, you can scale it down gradually.
Monitor. Monitor. Monitor! I can't emphasize this point enough. With proper monitoring, you have plenty of warning of the impending failure. You can see the disk queue growing. You can see that nodes are prioritizing active writes over replication. You can see that the TAP queues are growing fast. You can see when nodes start sending each other backoff messages. And finally, you can see that nodes begin returning temporary OOM errros.
Without monitoring, your first warning will be your application getting a spectacular number of "temporary failure" errors.
Always - always! - perform full-scale load testing with your real dataset before trying anything in production. Then double, or triple, the load and see how long it takes the cluster to begin to fail. Notice all the warning signs, and under what conditions the cluster fails, then draw appropriate conclustions about your future provisioning needs.
And finally, do not underprovision your Couchbase cluster!