Our transition to AWS/Ec2 is complete. It's been a fascinating three weeks of porting. Last night, we internally deployed Gnip to several Ec2 instances. Our current partners are still on the "old Gnip" which is hosted at a traditional managed hosting environment, but we'll be cutting everyone over to the cloud soon.
Here's where we wound up:
- We managed to completely drop our dependency on a database for the actual application. I'm stoked about this as DBs suck, though I'm not naive enough to think we'll be able to get away with it for long; feature needs will ultimately force one into the system. We do use Amazon's SimpleDB for account management stuff, but could just as easily be using flat files stashed on S3 (remember, no persistent storage on Ec2 yet; lame!).
- We went out on a limb and transitioned from a messaging based app, to traditional object/memory mapping for our data model. In order to cluster, we're using a network attached memory framework called TerraCotta. The basic promise of it is that you write your single instance assuming app code, run it, and TerraCotta manages the memory across an arbitrary number of nodes/instances. Basically, replication of your app's memory across as many instances as you'd like. Conceptually, super cool and simple, technically, wicked cool memory management across nodes. I'm sure we'll wrestle with memory management tuning fun, but, the lacking multi-cast support in Ec2 meant we'd have to cobble together our own point-to-point infrastructure for off-the-shelf Message Queuing services (we were using ActiveMQ, and were liking it), or use Amazon's Simple Queuing Service, which didn't taste so good after a few bites.
- We're using RightScale to manage the various AWS instances so we wouldn't have to build our own tool to setup/tear-down instances. It's a little pricey, but we saved the few days it would have taken "doing it ourselves."
- Performance. It's slower than running on raw hardware; any virtualized service will be. Our load tests between our hosted hardware, and Ec2 however, were close enough to go for it. Gnip has a premium placed on being able to setup and tear-down instances based on potentially highly variable load. Our traffic patterns don't match consumer based products like web apps which tend to grown organically and steadily, with a digg induced spike here and there. When we gain a partner, we have to flip the switch on handling millions of new requests at the flip of a switch.
- A multi-cast solution across Ec2 instances.
- A persistent storage solution for Ec2 instances/clusters.
- Connection monitoring and optimization across their backend domains. If they see that a service is gobbling up the equivalent of a dozen physical machines, and doing intra-machine communication (memory level, socket level, whatever), please cordon off those systems so they get the efficiency gains they normally would in a more physically managed hardware scenario. This will hopefully come with time.
- Their own failover scenario between Data Centers. Mirror Data Center A onto DC B so if A starts failing, you kick everything over to B. Super pricey, but so what; offer the service and I bet folks will pay. Otherwise, you're doing the macarena dance to manage failovers; see Oren Michel's post on how they do it over at Mashery; lots of legwork.
Sorry... tripped down memory lane.