Gnip's Head is in the Clouds
Our transition to
AWS
/
Ec
2 is complete. It's been a fascinating three weeks of porting. Last night, we internally deployed
Gnip
to several
Ec
2 instances. Our current partners are still on the "old
Gnip
" which is hosted at a traditional managed hosting environment, but we'll be cutting everyone over to the cloud soon.
Here's where we wound up:
- We managed to completely drop our dependency on a database for the actual application. I'm stoked about this as DBs suck, though I'm not naive enough to think we'll be able to get away with it for long; feature needs will ultimately force one into the system. We do use Amazon's SimpleDB for account management stuff, but could just as easily be using flat files stashed on S3 (remember, no persistent storage on Ec2 yet; lame!).
- We went out on a limb and transitioned from a messaging based app, to traditional object/memory mapping for our data model. In order to cluster, we're using a network attached memory framework called TerraCotta. The basic promise of it is that you write your single instance assuming app code, run it, and TerraCotta manages the memory across an arbitrary number of nodes/instances. Basically, replication of your app's memory across as many instances as you'd like. Conceptually, super cool and simple, technically, wicked cool memory management across nodes. I'm sure we'll wrestle with memory management tuning fun, but, the lacking multi-cast support in Ec2 meant we'd have to cobble together our own point-to-point infrastructure for off-the-shelf Message Queuing services (we were using ActiveMQ, and were liking it), or use Amazon's Simple Queuing Service, which didn't taste so good after a few bites.
- We're using RightScale to manage the various AWS instances so we wouldn't have to build our own tool to setup/tear-down instances. It's a little pricey, but we saved the few days it would have taken "doing it ourselves."
- Performance. It's slower than running on raw hardware; any virtualized service will be. Our load tests between our hosted hardware, and Ec2 however, were close enough to go for it. Gnip has a premium placed on being able to setup and tear-down instances based on potentially highly variable load. Our traffic patterns don't match consumer based products like web apps which tend to grown organically and steadily, with a digg induced spike here and there. When we gain a partner, we have to flip the switch on handling millions of new requests at the flip of a switch.
What I want to see from Amazon:
- A multi-cast solution across Ec2 instances.
- A persistent storage solution for Ec2 instances/clusters.
- Connection monitoring and optimization across their backend domains. If they see that a service is gobbling up the equivalent of a dozen physical machines, and doing intra-machine communication (memory level, socket level, whatever), please cordon off those systems so they get the efficiency gains they normally would in a more physically managed hardware scenario. This will hopefully come with time.
- Their own failover scenario between Data Centers. Mirror Data Center A onto DC B so if A starts failing, you kick everything over to B. Super pricey, but so what; offer the service and I bet folks will pay. Otherwise, you're doing the macarena dance to manage failovers; see Oren Michel's post on how they do it over at Mashery; lots of legwork.
As an aside, the advent of "cloud/grid computing" I find funny. The marketplace is doing the best it can to find the
sweetspot
, and let's assume it all settles into a warm cozy place. However, during my time at AOL, I got to know process
virtualization
that puts all the currently deployed solutions in the market today to utter and complete shame. It was one of AOL's greatest lost opportunities (there were many). Most of the acronyms have faded from memory, but the "old guard" at AOL had pulled off true server computing miracles (
SAPI
it's roots were called; server
API
). Imagine a world where the entire stack was custom written from the ground up (OS, socket/connection layers, object models). You could write an app and deploy it across multiple data centers to ensure redundancy, without thinking about clustering; the ability was baked into the
API
itself. Your app could transition, in real-time, from data center to data center (that implies
intra
-DC machine
xfers
as well), without a
hiccup
in end-user experience. While
imaginable
in today's stateless web architecture, imagine doing it with
stateful
, persistent, socket connections. Yup... blows your mind huh?
Sorry... tripped down memory lane.