A week ago, almost to the minute, the following message was generated by an internal Gnip monitoring server, and sent to the person on-call.
** PROBLEM alert - production-head2/production-head2-gnip is CRITICAL **
It was the start to a very long, non-stop, few days at Gnip.
Much of the rapport on your team gets defined in moments like this. Your team's ability to solve hard, live, problems is thrust into the foreground. About five hours into the ordeal, my appreciation for having focused very hard on bringing software veterans into Gnip was peaking. The problem was being sliced and diced, and the collective experience of everyone on the team was winnowing things down quickly. "It can't be that!" "It must be this!" "I think we should focus here."
- Step 1: we isolated the symptoms. exactly what was going on!?!
- Step 2: we checked configurations/environments
- Step 3: we identified potential code inefficiencies
- Step 4: we verified probabilities
- Step 5: we placed a bet on what we thought the problem was, and wrote code to address it
- Step 6: we watched our hard work pay off; production issue resolved; it was the right bet
Knowing which bet to place comes from experience. The only problem with experience at a startup is that it can be expensive. Like so much in life, you get what you pay for. Had Gnip been tilted toward relatively in-experienced, in-expensive, junior team members, what turned out to be a production blip, could have been a true nightmare for the company.
Glad that problem is behind us, and we all have a nice new chunk of experience to put into the bag of tricks for future use.