Last week I upgraded the production cluster for one of my customers to Riak 1.3.0 and applied an additonal patch. Like the last time, I’d like to share my experiences of the process.
The cluster I was about to update had 8 nodes running on EC2. We had issues with increased latencies for the 99th+ percentiles in situations when traffic was increased. But the overall performance in the higher percentiles was sub-optimal.
Reading the riak-users mailing list, I stumbled upon a post mentioning a already closed pull request for riak_kv. I immediately realized that the bug could be very well responsible for the latency issues we were facing. Since the pull request was already finished and relatively small and straight forward, we decided to upgrade to 1.3.0 (from 1.2) and also apply the patch.
The issue with the particular
riak_kv bug is that, when you use Bitcask
and last-write-wins for your bucket, the vector clock gets incremented everytime before
it is copied around — for every replica! This leads to a read repair, when you read the object
back (since the vector clocks do not match). When the repair is done,
the new “repaired” version is written back to all replicas, triggering
another incorrect increment of the vector clock. If you read the object
again, well, you get the idea…
You should really take a look at the read repair counter in case you are using Bitcask and last-write-wins. If the value is suspiciously high (see riak-admin status), you should consider the upgrade and patch too.
Preparing the Patch
Before you can apply the patch, you need Erlang R15B01. If you don’t have it, here is what you do (I’m assuming a bare Ubuntu here):
1 2 3 4 5 6 7 8 9
Depending on your box, this can take a while. When you are done, we can
riak_kv, apply the patch and compile it too:
1 2 3 4 5 6 7 8 9 10
Riak has a pretty nice “feature” to install patches. For Ubuntu, you can simply drop
ebin/riak_kv/riak_kv_vnode.beam file in
/usr/lib/riak/lib/basho-patches and it will
be picked up, once you restart Riak. To check, wether your patched
module was loaded, you can attach to the Riak console (
code:which('riak_kv_vnode').. If you see
"/usr/lib/riak/lib/basho-patches/riak_kv_vnode.beam" your patched
riak_kv_vnode module was used and you can leave
the console via
We upgraded all Riak nodes during normal operation of the day by doing a rolling upgrade, that is, upgrading one node at a time, wait for it to come up and synced again and continue with the next one. The whole procedure went fine, no surprises to expect here and Riak performed very well once again from an operational perspective. In order to do a rolling upgrade, just do the following steps at one node at a time.
First, get Riak 1.3 (for Ubuntu precise 64bit in this case):
Then you need to stop riak via
riak stop and backup the existing
configuration and ring files in case you need to rollback and restore
sudo tar -czf riak_1.2_backup.tar.gz /var/lib/riak
Now you can install 1.3.0:
sudo dpkg -i riak_1.3.0-1_amd64.deb, change
your riak configuration files (
/etc/riak/app.config) according to your needs. You should take a look
at the 1.3 release notes. In our case
e.g. we decided to deactivate the new active anti-entropy mechanism,
since we already had issues with increased response times and disk IO in
Copy the patched
riak_kv_vnode.beam file to the right spot:
sudo cp ~/riak_kv/ebin/riak_kv_vnode.beam /usr/lib/riak/lib/basho-patches.
Start up riak again with
riak start. To be sure the patched file is
used, see the instructions above. Now you just need to wait for riak to
be fully operational:
riak-admin wait-for-service riak_kv riak@host.
You might also want to watch the log files for any suspicious things.
riak_kv is up you might also want to wait for outstanding
Now repeat these steps until your cluster is completely updated.
Impact of the Upgrade
The impact of the upgrade (and the patch) were quite amazing. We were able to significantly reduce the cluster size (8 to 5 nodes) while vastly improving the service quality. I cannot provide you with detailed numbers, but I can show you some graphs.
These are the latencies for a typical day, before we did the upgrade with eight nodes, measured on the application servers performing Riak fetch operations:
We are killing requests that take longer than a certain threshold. This is the reason why the 100th percentile is sort of capped. You can also clearly see, when we let Bitcask do the merge in the middle of the night.
Now compare that to after the update and patch, using the same scale, same production traffic, same weekday, one week later:
Again, this is the same graph scale! You cannot even tell when the merge operation is happening. The improvement to the 99th and 100th percentiles were simply amazing! We were able to reduce the aborted request rate to about 0.01% which is awesome. Keeping in mind that we also reduced the operational costs by killing three servers.