INSTRUMENTAL BLOG

Instrumental customers,

Unfortunately this past Friday (April 19th), Instrumental ceased displaying up to the minute data for a period approaching 15 hours. Astute observers of the app and our Twitter account (@instrumentalapp) have likely noticed that this is the worst of a series of graph delays that have been occurring for the past month.

I’ll give a technical post mortem later in this message, but I want to underline the most important parts of these delays to you:

  1. Your data may not appear quickly, but it’s NEVER lost.
  2. The changes we made to fix last Friday’s problems should ensure we don’t see problems like we have in the past month.
  3. We’re going to do a better job communicating service information to you.

We’re currently working on a series of changes that should insure we don’t experience delays of this magnitude again, and that you don’t lack for up to date information again. We’ll also be working on better ways to communicate essential information regarding Instrumental’s current status, so that you never have to wonder if your app is down, or that Instrumental’s data is delayed.

This past month’s unpredictable data delays we hope to have behind us soon. Internally we still have more work to do to ensure an even more reliable future, but we’ve removed the weak parts of our infrastructure that were the primary cause of the delay.

If you have questions about what happened, or need to address concerns, please email us at support@instrumentalapp.com. We want to make sure you understand exactly what happened, and know that you can expect more reliable behavior in the future.

April 19th Technical Post-Mortem

Background: Instrumental’s data is persisted in two places:

  1. A MongoDB replica set previously hosted on Linode
  2. A series of flat files archived in S3

Writing to either of these two data stores is done by queueing the write documents (collections of metric updates), then having separate processes commit them to the database or S3. Typically we’ve seen Mongo writes using our particular configuration to take roughly 10ms, with little variation. As of March 22nd, 2013, we’ve seen that number begin to vary wildly, with batches of 100 writes taking anywhere from 1s to 10s.

Description: Around 6 AM EST, our Mongo replica set encountered an unknown event while one of the read members was performing an overnight compaction that caused our primary member to fail. The other members of the set failed to elect a new primary member, causing all Mongo writes to fail. Our incoming SQS queue had grown to 20,000 documents (each usually containing from 100-2000 writes to the Mongo database) by 8 AM EST. During this period, a new replica member was forced to become the primary member so the Mongo database could begin accepting writes again. It was at this time that we observed that our MongoDB write times had begun to vary wildly, and we could not predict a time when the entire queue could be consumed by the new primary member.

Presuming that the write variability might be caused by overactive shared tenants on the Linode instance assuming primary responsibilities, we rebuilt the previous primary member and had it rejoin the replica set around 1 PM EST. While we did observe slightly faster write times during this period that caused us to begin reducing overall queue size, we could still not accurately predict when the queue would be entirely consumed and the service would be up to date.

Due to the increasing variability we were seeing on Linode (and a host of other small complaints and limitations), we had already begun a slow migration to AWS that included moving queue responsibilities to SNS and SQS, and an eventual primary data store switch to DynamoDB. While the DynamoDB migration was not yet ready for production, we had enough of the infrastructure automation scripts ported to use EC2 that we had enough tools at our disposal to spin up a new Mongo replica set on EC2 that used EBS provisioned iops for more consistent behavior. This new replica set was spun up around 5PM EST, and over the course of the evening was replicating with the Linode hosts at around 1-5s latency.

During the time period with which this new set was brought up, the Linode hosts had eventually caught up to the queue and were displaying up to date data sometime around 8PM EST. On Saturday, we migrated our database writing processes to AWS and switched primary member responsibility, after which we observed an overall system latency decrease and more consistent behavior.

What Caused The Problem: We’re not certain of the original event that caused the primary member to fail out. We do know that the compaction command on the read member of the database failed, and suspect that the primary might have failed while the compacting member was unavailable, but that is only supposition on our parts (and even if it did occur, Mongo’s election behavior should handle this event). That is only part of the picture, however: the delay would not have been nearly as severe were it not for the high variability in IO that we saw on our Linode host. We suspect that the advent of Linode’s “next gen” hardware and network has been slightly to blame, as its introduction roughly correlates with our IO increases; given our small sample size, we cannot make that claim with certainty.

What We Did Wrong: We had observed the increasing IO lag, and had already begun migrating our system to move to DynamoDB. This decision (to migrate to AWS and a new datastore in one swoop) had too many dependencies to happen quickly, and put us in a more dangerous time period in which our ability to recover from a large pending queue was significantly reduced. We should have made the transition more piecemeal, and migrated our database servers to AWS earlier. Additionally, we failed to communicate information to our customers regularly, which caused more uncertainty regarding graph data amongst all customers; lacking an accurate estimate of graphs catching up to date, we should have been more frequent with our updates.

What We Did Right: When we realized we were unable to accurately estimate the time the system would be up to date, we immediately began migrating to EC2; we followed through on the migration to such a point that the overall system state is more reliable than it was last week. We now have tools in place to allow for better growth and handle unexpected events in a better manner than we had available on Linode. Other systems will now be migrated from Linode over the next week, after which the system will become even more performant.

Additionally, the queueing behavior of the system (and the persistent, decoupled attributes which SQS adds) ensured that we were still receiving and archiving customer data even though our primary data store was hours out of date.

What Happens Next: Over the next week, most Instrumental servers will begin to migrate to AWS. Internally we will be discussing the advantages of tuning our AWS Mongo replica set to grow better, or to continue apace with our DynamoDB migration. We will also continue discussions regarding the implementation of an in-memory store that will supplement our persistent store, that will reduce visible system latency and allow us to better handle persistence lag. Also, we’ll begin planning an in-app method of communicating service status to our customers, such that extroardinary events like Friday’s are visible and up to date.

Thanks, Instrumental customers, for your patience last Friday. Those of you that came into the Campfire chatroom, or talked to us on Twitter, were uniformly awesome in talking to us during an extremely stressful period, and we hope this helps better explain why we couldn’t keep your dashboards and graphs up to date on Friday.

Love,

Instrumental

Happy February everyone! Last week was quite a doozy for site owners, between a new Rails vulnerability for 3.0.x and 2.3.x, high profile sites like Twitter being attacked, and Amazon suffering some rare downtime. Are you monitoring enough of your infrastructure to know when your own site is being attacked? Tracking metrics like failed login attempts, improperly formatted requests and cross domain form submit attempts can alert you in advance to someone snooping around your site. Be careful out there!

  • Twitter dev skr wrote about the Twitter stack, which is a great top level overview of how they develop and monitor the service. As you might imagine, visibility into their production code is critical to their business; tracing code execution, observing JVM effects and graphing core software metrics all allow them to keep on improving Twitter safely.

  • Noah Lorang, 37signals’ data analyst, wrote a great post entitled “Three Charts are all I need” that advocates for relying on simple, effective tools to visualize your data. The line many infographics are more like infauxgraphics sums up his point nicely: complicated information visualization techniques are more likely to hide data than they are to aid discovery.

Have a great (and hacker free) week everyone!

What a week! We’ve been improving the service, fixing bugs and talking to awesome new customers. New customers and existing customers alike may find our new help site useful, which you can use to send us feature requests, make suggestions about the service, and get ahold of us if you’ve been having problems.

Some great and late links for you to look at before diving into the week:

  • Jeff Leek, assistant Biostatistics professor at Johns Hopkins, is teaching a course at Coursera on Data Analysis. If you’ve got the time, it’s look like a great class to grow your analysis skills with, as well as pick up a little bit of R. Take a look at Jeff’s sneak peek on the classes topics. (Note: it’s already started, but you should be able to join late if you’d like to review the lectures and course materials)

  • On the entirely other side of the topical spectrum of data analysis, there’s a fantastic article by David Skok at forentrepreneurs.com called SaaS Metrics 2.0 - A Guide to Measuring and Improving What Matters. The article delves into the financial metrics of running a software as a service business, a topic we know is near and dear to many of our customers.

  • Finally, we’ve got a great bug hunt story from Daniel F Pupius about the lengths he and the GMail team went to track down some bad performing JS code in Internet Explorer. Most of us may never need to write an intercepting DLL for Internet Explorer :), but it’s a great lesson in cleverly isolating and measuring the pieces of a complex system.

Have a great week, everyone!

Happy Monday! There have been a lot of great posts this past week that we want to share with you, so let’s get to it.

Jeff Hodges had a great post targeted at developers entitled “Notes on Distributed Systems for Young Bloods” that was full of great tips on growing your application. Of course, we’re particular fans of his point “Metrics are the only way to get your job done”, but there’s lot of great points in his article.

Data visualization fans will no doubt find something interesting in AppStoreRankings.net’s App Store visualization tool. Even if you’re not interested in the iOS App development community, there’s something to appreciate in how they visualize the disparate scales of data in a way that makes it easy to comprehend.

Another developer focused post, “The 10 Commandments of Logging” from Brice Figureau, was full of great advice about logging your data. We hope you’re already monitoring your important application data in Instrumental, but the humble log file is often the best place to capture important debugging and state data.

Have a great week!

It’s easy to underestimate the benefits of building lightweight tools. A simple status bar provides substantial insight to devs and admins, is unobtrusive, and can be implemented in an afternoon.

image

Why should I do this?

By keeping indicators of app health visible during general site usage, errors get caught earlier. Time usually wasted flipping back and forth between the app and sources of key metrics while troubleshooting is reduced. And, you’ll avoid the dreaded, “what the… I keep changing things and nothing is updating… aw crap I’m making changes to development and looking at production”, and similar mixups, thus lowering the frequency of facepalms. Always a good thing.

What to include

Status bars are useful in several scenarios: general site usage, quick check-ins to make sure everything’s ok, and troubleshooting. The items you should include can overlap between these scenarios.

For general site usage and quick check-ins, you’ll want basic activity info as well as any items that will tip you off to emerging issues: think online users, response times, exceptions, and failed jobs. Items that inspire and help maintain momentum are good as well, e.g. recent signups, payments, posts, Tweets, and other user activity.

For troubleshooting, we want info of use when the site is busted, or when a user is experiencing issues. The number of recent errors, failed jobs, slow transactions, server load… you get the picture.

What is, light urple?

Color coding your status bar to match the current environment will nearly eliminate the risk of getting mixed up while flipping between multiple environments. It also minimizes the mental cost of keeping track as you switch back and forth; energy that can be spent elsewhere.

image

Along for the ride

Posting screenshots of new features or error conditions for others to check out is great. But, what environment were you in? What code revision was deployed? When did you take the screenshot? These bits of context need to be communicated at the time of posting, and can easily be lost over time. By adding them to the status bar, they’ll be there in every screenshot you take.

It’s also not a bad idea to throw in site load and user activity metrics, since they’re particularly helpful when dealing with errors. As more stats are added, the context captured in every snapshot will continually improve.

Digging in

Status bar summaries should link to more in-depth views of the data. This makes it trivial for anyone using the site to take advantage of the data you’re gathering and the services you’re using to provide visibility into your app. Without this, it’s easy for only the implementors of various services to be aware of the availability of that information and to really know what’s going on. Make it easy for everyone to stay informed.

Examples of linkages:

  • Jobs > Resque panel
  • Posts > Posts graph / Individual posts
  • Tweets > Twitter graph / Individual posts
  • Signups > Signup graph / Latest user pages
  • Recent Errors > Airbrake

The attractor

One of the nicest aspects of having a status bar is this: helpful items will be apparent as you experience different scenarios. These helpful bits might otherwise be overlooked, or, worse, might be written down in notes never to be seen or heard from again. But taking a few minutes to implement the status bar is like putting up a whiteboard in the situation room. It will naturally accumulate useful information.

Last week was a busy week for Rails folk, as two serious vulnerabilities were publicized and patched (CVE-2013-0155 and CVE-2013-0156). If you haven’t upgraded yet, it is strongly recommended that you do so now.

Dan Mckinley from Etsy had a great post entitled “Whom The Gods Would Destroy They First Give Realtime Analytics”. The post delves into the danger of acting on current information without placing it in the proper context; check it out.

Luke Ludwig from Sport Ngin talks about their experiences upgrading to Ruby 1.9.3, with some great advice about measuring each change in isolation to observe its effects.

Instrumental developer Chris Zelenak created the metriks-instrumental gem to let users of the metriks gem automatically submit their data to Instrumental.

Have a great week everyone!

Ruby developers love EventMachine. Used to drive gems like Thin, network protocols (em-http-request, em-mongo, em-zeromq, et al) and a host of in-house network servers and clients, it’s the Ruby community’s go to library for the Reactor pattern.

Read More

It happened again. You were loading the front page of your app again and the load time took 27 seconds. You’ve seen it before, you think, on every second Tuesday, Arbor Day, and right after every new deploy. You’ve looked at the web server log files, application server log files and your database slow query logs. No luck. In the absence of any facts, you pick the nearest available black box that might be causing the problem.

Read More

When we started Instrumental, we knew that one of the big features we wanted to include was a way to interactively query the data you sent us; whether you wanted to view how many times you sent us the data, the top 25 metrics in a given group, or only graph the logarithm of a value, we knew that there was a lot more to your data than just showing you the average.

That’s why we created the Instrumental Query Language to let you express your data in any number of different ways; we launched the query language with a handful of functions, but knew you’d have some great ideas for us. After receiving your feedback, we’ve added things like historical comparison, logarithmic value and basic mathematical operators to the language to let you express your data however you’d like.

We also incorporated the language into the latest version of the Instrumental API, which many of you have been using to build awesome looking dashboards that communicate app data to your whole team. It’s been incredibly exciting to see you build tools that pull your data back out of Instrumental.

The query language is one cool feature among many, and we can’t wait to show you what we have planned for the rest of 2012 and beyond. Thanks for a fantastic year so far.

This is our third update on all the updates we’ve rolled out to Instrumental over 2012. We’ve processed a LOT of data so far this year (over 1 trillion metrics) - at any given moment we’re processing up to 700,000 data points a second from our users.

We send that data using the instrumental_agent gem, which collects your data and buffers it in a separate thread, which we later send to Instrumental.  We’ve enhanced the agent’s compatibility with different Ruby versions (JRuby, Ruby 1.8.6, etc.), and included new functions like the `time` method to make it easier to benchmark different pieces of your code.

We also spent a lot of time talking to you about your experiences with the agent; thanks to your input, we’ve increased the agent’s compatibility with a number of different preforking worker situations like Resque or Unicorn’s worker processes.

Many of our users are pretty awesome developers in their own right, and we saw the always awesome Chris Gaffney of Collective Idea contribute the Instrumental Statsd backend as a way for you to use Statsd clients to send data to Instrumental.

Of course, to process all the data you are sending us, we had to increase our capacity slightly. :) We’ve been building out our infrastructure to handle even greater load, so you never have to worry about sending too much data.

Now that we have all this data, we wanted to really let you do some interesting things with it. Our next update will tell you all about the Instrumental Query Language, and how it helps analyze your data at an even deeper level.

(Source: instrumentalapp.com)