Wednesday, May 28, 2008

Gnip's Head is in the Clouds


Our transition to AWS/Ec2 is complete. It's been a fascinating three weeks of porting. Last night, we internally deployed Gnip to several Ec2 instances. Our current partners are still on the "old Gnip" which is hosted at a traditional managed hosting environment, but we'll be cutting everyone over to the cloud soon.

Here's where we wound up:
  • We managed to completely drop our dependency on a database for the actual application. I'm stoked about this as DBs suck, though I'm not naive enough to think we'll be able to get away with it for long; feature needs will ultimately force one into the system. We do use Amazon's SimpleDB for account management stuff, but could just as easily be using flat files stashed on S3 (remember, no persistent storage on Ec2 yet; lame!).
  • We went out on a limb and transitioned from a messaging based app, to traditional object/memory mapping for our data model. In order to cluster, we're using a network attached memory framework called TerraCotta. The basic promise of it is that you write your single instance assuming app code, run it, and TerraCotta manages the memory across an arbitrary number of nodes/instances. Basically, replication of your app's memory across as many instances as you'd like. Conceptually, super cool and simple, technically, wicked cool memory management across nodes. I'm sure we'll wrestle with memory management tuning fun, but, the lacking multi-cast support in Ec2 meant we'd have to cobble together our own point-to-point infrastructure for off-the-shelf Message Queuing services (we were using ActiveMQ, and were liking it), or use Amazon's Simple Queuing Service, which didn't taste so good after a few bites.
  • We're using RightScale to manage the various AWS instances so we wouldn't have to build our own tool to setup/tear-down instances. It's a little pricey, but we saved the few days it would have taken "doing it ourselves."
  • Performance. It's slower than running on raw hardware; any virtualized service will be. Our load tests between our hosted hardware, and Ec2 however, were close enough to go for it. Gnip has a premium placed on being able to setup and tear-down instances based on potentially highly variable load. Our traffic patterns don't match consumer based products like web apps which tend to grown organically and steadily, with a digg induced spike here and there. When we gain a partner, we have to flip the switch on handling millions of new requests at the flip of a switch.
What I want to see from Amazon:
  • A multi-cast solution across Ec2 instances.
  • A persistent storage solution for Ec2 instances/clusters.
  • Connection monitoring and optimization across their backend domains. If they see that a service is gobbling up the equivalent of a dozen physical machines, and doing intra-machine communication (memory level, socket level, whatever), please cordon off those systems so they get the efficiency gains they normally would in a more physically managed hardware scenario. This will hopefully come with time.
  • Their own failover scenario between Data Centers. Mirror Data Center A onto DC B so if A starts failing, you kick everything over to B. Super pricey, but so what; offer the service and I bet folks will pay. Otherwise, you're doing the macarena dance to manage failovers; see Oren Michel's post on how they do it over at Mashery; lots of legwork.
As an aside, the advent of "cloud/grid computing" I find funny. The marketplace is doing the best it can to find the sweetspot, and let's assume it all settles into a warm cozy place. However, during my time at AOL, I got to know process virtualization that puts all the currently deployed solutions in the market today to utter and complete shame. It was one of AOL's greatest lost opportunities (there were many). Most of the acronyms have faded from memory, but the "old guard" at AOL had pulled off true server computing miracles (SAPI it's roots were called; server API). Imagine a world where the entire stack was custom written from the ground up (OS, socket/connection layers, object models). You could write an app and deploy it across multiple data centers to ensure redundancy, without thinking about clustering; the ability was baked into the API itself. Your app could transition, in real-time, from data center to data center (that implies intra-DC machine xfers as well), without a hiccup in end-user experience. While imaginable in today's stateless web architecture, imagine doing it with stateful, persistent, socket connections. Yup... blows your mind huh?

Sorry... tripped down memory lane.

Tuesday, May 27, 2008

Hiring; and Personal Responsibility

Gnip is growing, so I'm face-to-face with the beast that is hiring. Finding smart, compelling, passionate, committed, persistent people is hard. Assuming you find one (big assumption), testing your thesis is the big gamble. Even if you do a half-dozen face to face interviews, some "try before you buy" contracting projects, and spend some time with the person socially, you won't know what you've got until a couple of months into the arrangement. It all boils down to, yup, you guessed it, gut.

Some folks recommend contracting for awhile before committing to a full-time job offer. While contracting provides some flexibility, for both parties, it's pros are precisely its cons. There's generally little skin in the game, and its that skin that can make or break a company. A contractor will generally lose little, if any, sleep over your company, and there's little that's more motivating to solve a problem than the prospect of not being sound asleep tonight at 2am. I've had good luck with small, short, contained contracts. I've had not-so-good luck with long-term "pretend the contractor is a full-time employee" contracts.

I've seen some funny/odd scenarios wherein a hiring manager blames the hiree for not meeting expectations. While we all screw up here and there, and hire the wrong person for the job, this needs to be the exception when determining whether or not you're good at hiring. Sometimes you simply miss. However, you need to assume responsibility for the situation and work to rectify it. That may mean you find a new role (with more, or less, respsibility) for the individual. Spend more time with them to help get them on the right track. Or, let them go. I've seen situations however wherein the hiring manager makes bad call after bad call after bad call. Needless to say, it's the hiring manager's manager who needs to take responsibility for what's going on, and rectify it. Ultimately, building a team is a job responsibility (albeit probably the most difficult, and important one you'll ever have), and if you do it well, you'll be rewarded; vise versa of course.

Finally, hiring jerks can be fatal. Robert Sutton wrote "The No Asshole Rule", and I highly recommend it. If it looks like a duck, quacks like a duck.... it... is... a... duck.

Monday, May 26, 2008

Cloudy skies and rain.

I'm a data junkie. Monitoring data brings awareness in so many forms. Some of it is very personal, and some of it's great for business. One of my favorite data generators has been the Vantage Pro weather station I mounted to my house five years ago or so. Yes, it's ugly (it took some serious negotiation with my wife before I could install it), but I get great real-time weather data that is highly relevant to my personal being. This post is related to my previous post about sensors (they have, and will continue to, change the world).

Anyway, the image in this post is a dashboard screen shot of my weather station. I also publish station info every 120 seconds here, at weatherunderground. If you live in downtown Boulder, I'm the most relevant station for your up-to-the-minute weather needs.

There isn't a ton of practical info I can gather from the station, most of it just satisfies personal itches for real-time data regarding my surroundings, but I do know when I don't have to bother watering the lawn (rainfall greater than 0.1" means I don't have to water).

What can I say, I like to understand my surroundings!

Thursday, May 22, 2008

Two things that will change the world.

Last night my backordered "smart" powerstrip arrived. It kills power to other outlets when the master voltage drops (e.g. the TV goes to sleep or is powered off). Simple, genius. I had the same level of excitement plugging things into it as I do when a new Apple laptop arrives. Total rush!

Sensors
The world will change when sensors are everywhere. From RFID tags, to power consumption sensing, once little bits of information are available, so many cool things can happen. Disclaimer: I'm a data junkie.

Feedback Loops & Awareness
I have friends at two energy management companies that are on fire right now. Tendrill, and GridPoint. Both firms promise to give me control, and data, around the power my home uses. Sadly, we've been running on dumb power grids for decades, with zero awareness of peak energy useage times. Viewing my home's power consumption patters at a URL is something I've wanted ever since buying my house. I feel lucky that the town I live in, Boulder, CO, will be the first in the nation to have smart power panels installed on most of the homes in the city by the end of the year. Closing loops changes how we think and act, and power consumption loops have been left open for far too long.

My Macbook Air...
... is so smart. It senses ambient lighting and adjust keyboard backlight, and screen, brightness in real-time. Clouds roll in, blocking out the sun, and my laptop brightens. When I'm sitting in a dark room and someone turns on the lights, my laptop adjusts. So cool.

lat, lng, alt
With location based sensors, latitude, longitude, and altitude will be associated with vast amounts of new data. Cameras will embed geodata into every image they snap. Phones will literally know where you are, and you can pump that info to products like brightkite with nary a button press (or even automatically).

Exciting times
While I'm totally stoked at what the future will bring with sensors, I'm bummed that I'll personally be further down the foodchain in it's creation. I'm a software guy, not a hardware guy, and this game starts with hardware and devices. Sadly, that restricts the playing field for innovation dramatically as well. Hardware notoriously evolves like a glacier. Software on the other hand iterates faster than a speeding bullet. That said, I anxiously await the arrival of the new data, so I can build software to use it and make the world a better place.

Friday, May 9, 2008

Forecast: Sunny Skies & Windy

It took us a few days, but we've ported our system over to AWS. We're not clustered in the cloud yet, but that's next. We'll be conducting load tests next week. My first impression of AWS is posted here.

What's Clear

AWS is a great place for Tier 2/3 web app architectures. AWS has a long way to go to make my dreams come true for more complex systems.

Load Balancing

There is no load balancing facility in AWS. You have to build one, and Amazon's elastic IP address product adds a layer you have to contend with. This means the "spinning up instances when load goes up" isn't *that* simple if you're spinning them up for the purpose of managing load.

Persistent Storage

Until their persistent storage (not to be confused with S3) product hits the light of day, you'll be playing with S3 like a distributed file system which brings with it the challenges of a distributed file system :-).

Multi-casting

There's no support for it. You heard me right. This means getting your system's services to broadcast existence to each other requires manual management of IP addresses and lookup of said address *outside* EC2 entirely. You literally have to use dynamic DNS resolution services external to the system. _K_ _L_ _U_ _G_ _E_.

In the end... you should consider AWS to be nothing more than the traditional managed server hosting services (e.g. RackSpace) we've been using for years. There are two exceptions, and they both matter: one, the servers are totally virtualized and can be setup/torn-down programatically (that's a tasty proposition), two, lacking multi-cast support; painful.

I'm hopeful our system performs within our tolerance levels. I want this to work for convenience's sake.

It's all in the timing.

Some good friends of mine started a company called Summize a couple years or so ago. It's comprised of a few of the top search/math/algorithm brains on planet Earth. I'd argue they have the highest ratio of search genius to employee count. Anyway, their initial, technically sound, product which does/did sentiment analysis (better than anything you've ever used) for products on the web, has/had been moving on at a decent, but not radical pace, for awhile now. What's funny is that when they applied a relatively tiny fraction of their brains (think weekend project) to something with buzz, Twitter, they're all the rage. Checkout their Twitter search API, and their recent mention on Read Write Web. A total testament to market timing. Way to go guys!

Thursday, May 1, 2008

Tomorrow's Forecast: Cloudy Skies & Sunshine

Our managed hosting solution reach a tipping point yesterday. We hit a environment configuration bump that was going to eat 24-36 of our precious startup development hours. This was second bump we'd hit with our hosted hardware provider, and it caused me to pick up the phone to evaluate a different vendor. With our provider in flux, we took off our shoes and socks and started to wade into Amazon's Web Services (AWS). Programmatic access to server instances running the environment/configuration of our choosing (for the most part) became too tempting. Google's App Engine is way too high-level in its current state to have even considered. Here's my first impression of AWS. The following assumes you have basic knowledge of EC2 and S3 as concepts.

Account Setup

For some reason Amazon thought they'd leverage their consumer shopping product UI for AWS. C'mon. I don't want to feel like I'm shopping for bathroom soap while setting up 509 certificates for API use. After getting over the UI, account setup was pretty straightforward. A Public Key handshake here, and Private Key store there, with a dash of PKI setup, and we had a full fledged AWS account ready for accessing EC2 and S3.

Clustering and what-not

AWS is dirt simple if you don't have a complex clustered network topology with lots of services running across multiple machines. If you're hosting a simple shopping website for example, you should be using AWS; no question. Amazon's ability to understand the relative importance of machines in your cluster doesn't exist; all instances (logical machines) are treated equal. That's great for AWS, but not necessarily for you. EC2 doesn't support multi-cast between machines, so if your model needs self-discovery, you'll need to come up with a homegrown solution, or find one cobbled together on the net.

Queues

If you're using a queueing framework, you may want to consider replacing it with Amazon's version (Simple Queue Service). Doing so alleviates some of the clustering/multi-cast issues I mentioned. If you need machine-to-machine level performance, there are potentially significant downsides however (note: we haven't finished load testing here, so the performance issues I outline are educated guesses; not based on empirical data):

  • There's an HTTP version. While this is nice and standard, message overhead is likely much larger than its brethren.
  • There's a SOAP version. While this is nice and standard...
  • Message movement in and out of AWS will cost you financially ($1 per 1M messages), though intra-EC2 communication is free.
  • Message movement in and out of AWS will cost you in latency (2-10 seconds to get a msg on the queue; YIKES!)

Persistent Storage

While S3 obviously provides persistent storage, the coupling with EC2 appears crude enough that instance-level storage requires some hacking. Amazon is wait-list-beta testing a more streamlined persistent storage model for EC2, and we're not in the beta yet, so I can't comment.

Ecosystem

There are already companies that provide nice automated instance provisioning services with the click of a UI element. If you don't use these (they can be pricey), you'll be building your own or writing scripts to setup/tear-down machines on-demand.

If I've missed something, or someone has better data, I'd love to hear about it. I have high hopes for cloud computing, and our initial experience is pretty good. I should disclose that we're building infrastructure software that has significant performance needs around message transmission, and a highly dynamic data set. We're not building websites.