24 2 / 2012

Keep your QA close…and your users closer

Building great products is every engineer’s passion, more so when the product is going to get millions of visits every day. At IGN, we build such products and work hard to make sure that we have quality, scale, predictability, and stability with our systems. 

One of such systems is our Social API, which gets about 20K requests a minute. Recently we started noticing issues with this API farm going down for no obvious reason. Newrelic showed a massive spike getting data from memcached, but thats about all we had. Memcached hit ratio kept dropping whenever this happened - all the way to zero. This started happening at a much greater frequency while we started debugging critical pieces of code that could potentially fail to scale. We started looking at traffic data to see if there is anything unusual. Nothing stood out of the ordinary.

It was not until one fine morning when I received this from one of our top users, AngryMrBungle (read from bottom -> top)

This DM woke me up at 7AM, and I thought he could be talking about his browser crashing and not MyIGN. Later did I realize that he never mentioned a browser. Time to get more info.

That cleared up the entire situation. On my way to work I was able to link this problem to the exact classes and method that needed refactoring. We had a fix in place after a bit of whiteboarding.

Testing was again a challenge, as I did not want to bother AngryMrBungle, but at the same time we could not figure out to test the site in the right circumstances, fast. Thats one of the challenges with testing “scalability” with the right mix of requests. I asked him if he could help us test this and he gladly agreed. We had the team with eyes pinned on NewRelic. Not a blip on NewRelic. In a few minutes I got a DM from him that he did not face any problems like he did earlier.

This was good news for the team, and the refactored architecture will get rolled out to other API layers as well.

In summary, it is very critical for consumer facing sites to be in constant touch with the users. The users are way more passionate than we as engineers think, and very helpful in charting the feature set they want. We use uservoice and it does not surprise me to see users write detailed feedback and requests, and telling us how we can make the U|X better. We are refactoring quite a bit of technology platforms at IGN and moving to a more API-based, Open Architecture on the back-end, and Widget-based on the front-end. The idea is for us to build a platform to scale for user-base and content growth, and the only way to do this better is to engage with our audience. We take pride in being on top of this, and I thought this incident was something worth mentioning in a blog. Between engaging with our users on boards, MyIGN and twitter - UserVoice is something that helps us a lot with getting the pulse of the users, and pretty much have our product featureset “direction” on an auto-pilot. 

I’ll write a separate blog explaining the technical side of this “scalability bug” that AngryMrBungle helped uncover.