Anatomy of a shared memory node failure.

After my previous post about teams, I quickly received several requests to provide more details. Voici!

Gnip's redundant; you can walk up to any of our cloud instances, vaporize it, and Gnip chugs along its merry way. We use shared memory (via TerraCotta) to replicate memory across nodes. As you can imagine, shared memory across network nodes isn't all that cheap. Just like anything else, when its over used, things can melt down.

One of our customers started injecting hundreds of thousands more actors into their Filter rules than we'd tested for in a long time (or... ever, in the true production environment (there's a "you can never actually replicate production conditions in your staging/demo/review environment" blog post brewing in me). This caused one of the nodes to start working really hard to build the objects to support the additional actors. In turn, TerraCotta had to keep up its replication, going on its own merry way. The number, and size, of objects we were asking TC to manage (across clients, and three TC nodes as well (one primary, one secondary, and a third for good measure) caused too much lock contention across the system, and TC clients started dropping (heartbeats couldn't be kept up between clients and servers) because they were spending too much time processing locks. Once a TC client drops out of rotation, it has to be bounced in order to reconnect to the TC server. (in shared memory situations, you can't let your objects between client and server get "too" far away from eachother, otherwise you have bigger problems).

So, a node was dropping out of the TC network, we'd bounce it, it would come back up, try to recreate all the objects again, and crater. We'd restart it, it'd come back up.... rinse repeat, rinse repeat. Viscous cycle.

We resolved the issue by dramatically (several orders of magnitude) reducing the number of objects TC was managing in this code path. We optimized the object model to only keep the bare minimum in TC in order to keep our cherished clustered approach; the rest of the state stays put in local VM space, and is not shared.

There were other side effects floating around which got cleaned up in the process which was nice. We reduced some function call times from 45 minutes at their worst, to 45 seconds. We reduced our TC data set size from 16G to a few hundred meg. In the process, we also upgraded to TerraCotta 2.7 which  further reduced in-memory, and on-disk, data set sizes.