A Couple of Caveats on Queuing

July 7th, 2008

“The Great Rotary Duck Race,” benefiting agencies that fight child abuse.

Les’ “Delight Everyone” post is latest greatest addition to the 17th letter of the alphabet for savior conversation.

And believe me I’m a huge fan, and am busy carving out a night sometime this week to play with the RabbitMQ/XMPP bridge (/waves hi Alexis).

But …. there are a couple of caveats:

1) Some writes need to be real time.

Les notes this as well, but I just wanted to emphasize because really, they do.

If you can’t see your changes take effect in a system your understanding of cause and effect breaks down. It doesn’t matter that your understanding is wrong, you still need one to function. Ideally a physical analogy too. There are no real world effects that get queued for later application. Violate the principle of (falsely) seeming to respect real world cause and effect and your users will remain forever confused.

del.icio.us showing you the wrong state when you use the inline editing tool, and Flickr taking a handful of seconds to index a newly tagged photo are both good examples of subtly broken interfaces that can really throw people.

My data, now real time. Everyone else can wait (how long depends on how social your users are).

2) You’ve got to process that queue eventually.

Ideally you can add processing boxes in parallel forever but if your dequeuing rate falls below your queuing rate you are, in technical terms, screwed.

Think about it, if you’re falling behind 1 event per second, processing 1,000,000 events a second, but adding 1,000,001 for example, at the end of the day your 86,400 events in debt and counting. It’s likes losing money on individual sales, but trying to make it up in volume.

Good news: Traffic is spiky and most sites see daily cycles with quiet times.

Bad news: Many highly tuned systems exhibit slow down properties as their backlogs increase. Like a credit card, processing debt can get exponentially unmanageable.

In practice this means that most of the time your queue consumers should be sitting around bored. (see Allspaw’s Capacity Planning slides for more on that theme.)

If you can’t guarantee those real time writes for thems that cares, and mostly bored queue consumers the rest of the time then your queues might not delight you after all.

See also: Twitter, or Architecture Will Not Save You

“On a rooftop in Brooklyn…”

July 6th, 2008

We tromped up to our roof with friends and neighbors to watch the fireworks in the rain.

Tagged: , , ,

June 23rd, 2008

From the road

June 17th, 2008

IMG_3518

IMG_3795

2008/06/16

Sierra Nevadas in the rear view mirror

June 16th, 2008

IMG_3450, originally uploaded by curlyjazz.

A family ritual

June 12th, 2008

A family ritual, originally uploaded by curlyjazz.

We go for coffee at Gayles every Saturday morning I’m in Santa Cruz.

AOEMedia Sponsoring Magpie

June 7th, 2008

AOE media, a TYPO3 & open source provide from Germany, recently agreed to become a sponsor on Magpie. Which is very exciting!

Looking around I see being sponsored by AOE Media puts me in good company as they sponsor a number of interesting projects, including my favorite wiki software Doku

I’m hoping this will allow me to actually spend a little time with Magpie, maybe finally find the time to re-vamp the website into something a little more functional.

And last if you’re someone interested in sponsorship work as a whole on Magpie, or specific feature development get in touch.

Thanks again AOE Media.

Moving to New York

June 3rd, 2008

new york, ny

Some of you have heard, some of you haven’t, but Jasmine and I are getting up next week (June 12th) and moving to New York.

The plan is to go on working for Flickr, and fly back to SF once a month or so. So if you could plan your camps/parties/meetups/conferences accordingly that would be swell.

Looking forward to seeing you all on the East Coast. (And while we haven’t really gotten a feel for the new apartment [in Williamsburg], its looking like we might even have something resembling a guest room.)

We’ll be driving back via the not most direct route (which will surprise no one who knows us) that takes us past the Grand Canyon, Austin, and New Orleans. (thats as far as the map I’m looking at goes) So any suggestions for anything from sights, to food, to places to stay, to good people along those stretches, to great audiobooks to fill up empty bits in the road are all appreciated.

10 Years

June 1st, 2008

And what a long, strange 10 years it’s been.

Twitter, or Architecture Will Not Save You

May 28th, 2008


(circa 2006 Twitter maintenance cat)

Along with a whole slew of smart folks, I’ve been playing the current think game de jour, “How would you re-architect Twitter?”. Unlike most I’ve been having this conversation off and on for a couple of years, mostly with Blaine, in my unofficial “Friend of Twitter” capacity. (the same capacity that I wrote the first Twitter bot in, and have on rare occasion logged into their boxes to play “spot the run away performance issue.”)

For my money Leonard’s Brought to You By the 17th Letter of the Alphabet is probably the best proposed architecture I’ve seen — or at least it matches my own biases when I sat down last month to sketch out how build a Twitter-like thing. But when Leonard and I were chatting last week about this stuff, I was struck what was missing from the larger Blogosphere’s conversation: the issues Twitter is actually facing.

Folks both within Twitter and without have framed the conversation as an architectural challenge. Meanwhile the nattering classes have struck on the fundamental challenge of all social software (namely the network effects) and are reporting that they’ve gotten confirmation from “an individual who is familiar with the technical probelms at Twitter” that indeed Twitter is a social software site!

Living and Dying By the Network

All social software has to deal with the network effect. At scale it’s hard. And all large social software has had to solve it. If you’re looking for the roots of Twitter’s special challenges, you’re going to have to look a bit farther a field.

Though you can hedge your bets with this stuff by making less explicit promises than Twitter does (everything from my friends in a timely fashion is pretty hard promise to keep). Flickr mitigates some of this impact by making promises about recent contacts, not recent photos (there are a fewer people than photos), meanwhile Facebook can hide a slew of sins behind the fact that their newsfeeds are “editorialized”, no claims of completeness anywhere in site. (there is a figure floating around that at least at one point Facebook was dropping 80% of their updates on the floor)

So while architectures that strip down Twitter to queues, and logs could be a huge win, and while thinking about new architectures is the sexy, hard problem we all want to fix, Twitter’s problems are really of a more pedestrian hard, plumbing and ditch digging nature. Which is less fun, but reality.

Growth

Their first problem is growth. Honest to god hockey stick growth is so weird, and wild, and hard, thats it’s hard to imagine and cope with if you haven’t been through it at least once. To quote Leonard again (this from a few weeks ago back when TC thought they’d figured out that Twitter’s problems were Blaine):

“Even if you’re architecturally sound, you’re dealing with development with extremely tight timelines/pressures, so you have to make decisions to pick things that will work but will probably need to eventually be replaced (e.g. DRb for Twitter) — usually you won’t know when and what component will be the limiting factor since you don’t know what the uses cases will be to begin with. Development from prototype on is a series of compromises against the limited resources of man-hours and equipment. In a perfect world, you’d have perfect capacity planning and infinite resources, but if you’ve ever experienced real-world hockey-stick growth on a startup shoestring, you know that’s not the case. If you have, you understand that scaling is the brick that hits you when you’ve gone far beyond your capacity limits and when your machines hit double or triple digit loads. Architecture doesn’t help you one bit there.”

Growth is hard. Dealing with growth is rarely sexy. When your growth goes non-linear you’re tempted to think you’ve stumbled into a whole class of new problems that need wild new thinking. Resist. New ideas should be applied judiciously. Because mostly its plumbing. Tuning your databases, getting your thread buffer sizes right, managing the community, and the abuse.

Intelligence and Monitoring

Growth compounds the other hard problem that Twitter (and almost every sites I’ve seen) has, thery’re running black boxes. Social software is hard to heartbeat, socially or technically. It’s one of the places where our jobs are actually harder than those real time trading systems, and other five nines style hard computing systems.

And it’s a problem Twitter is still struggling to solve. (really you never stop solving it, your next SPOF will always come find you, and then you have something new to monitor) Twitter came late in life to Ganglia, and haven’t had the time to really burnish it. And Ganglia doesn’t ship by default with a graph for what to do when your site needs its memcache servers hot to run. And what do you do when Ganglia starts telling you your recent framework upgrade is causing a 10x increase in data returned from your DBs for the same QPS. Or that your URL shortening service is starting to slow down sporadically adding an extra 30ms burn to message handling. (how do you even graph that?)

Beyond LAMP Needs Better Intelligence

Monitoring and intelligence get even harder as you start to embrace these new architectures. Both because the systems are more complex, but largely because we don’t know what monitoring and resourcing for Web scale queues of data, and distributed hash tables look like. And we don’t yet have the scars from living through the failure scenarios. And we’re rolling our own solutions as it is early days, without the battle hardened tweaks and flags of an Apache or MySQL.

We all know that Jabber has different performance characteristics than the Web (that’s rather the point), but we don’t have the data to quantify what it looks like at network effect impacted scale. (the big IM installs, particularly LJ and Google have talked a bit in public, but their usage patterns tend to be pretty different than stream style APIs. Btw I’ll be talking about this a bit in Portland at OSCON in a few months!)

Recommendations

So I’d add to Leonard’s architecture (and I know Leonard is thinking about this), and the various other cloud architectures emerging that to make it work you need build monitoring and resourcing in from the ground up, or you’re distributed in the cloud queues are going to fail.

And solve the growth issues, with appropriate solutions for growth, which rarely involves architectural solutions.