12.3. Responsiveness, Service Level Objectives, and Apdex

Speed is a feature.

—Adam De Boor, Gmail software engineer, Google

Performance is a feature.

—Jeff Atwood, co-founder of StackOverflow

The best performance improvement is the transition from the nonworking state to the working state.

—John Ousterhout, designer of magic and Tcl/Tk

As we learned in Section 1.7, availability refers to the fraction of time your site is available and working correctly. For example, Google Apps guarantees its enterprise customers a minimum of “three nines” or 99.9% availability, though Nygard wryly notes (Nygard 2007) that less-disciplined sites provide closer to “two eights” (88.0%).

Why might your app be unavailable? For one thing, it may have crashed because of an unexpected error. While most PaaS services automatically restart a crashed app, restarting can induce delays and harm availability. One way to improve the reliability of software is to make it more robust. Defensive programming is a philosophy that tries to anticipate potential software flaws and write code to handle them. Here are three examples:

  • Check input values. A common cause of problems is for the user to input values that the developer doesn’t expect. Checking that the input is in a reasonable range for individual values, that it is not too big for a series of data, and that the collection of inputs are logically consistent can reduce the chances of outages.

  • Check input data type. Another mistake users can make is to enter an unexpected type of data in response to a query. Making sure the user enters a valid type of data increases the chances of success for the app.

  • Catch exceptions. Modern programming languages offer the ability to execute code when exceptions occur, such as arithmetic overflow. Offering code that can catch any exception increases the chances of the app continuing to run well even when unexpected events occur.

Another availability challenge is a bug that leads to outages but only appears after a long time or under heavy load. A classic example is a resource leak: a long-running process eventually runs out of a resource, such as memory, because it cannot reclaim 100% of the unused resource due to either an application bug or the inherent design of a language or framework. Software rejuvenation is a long-established way to alleviate a resource leak: the Apache web server runs a number of identical worker processes, and when a given worker process has “aged” enough, that process stops accepting requests and dies, to be replaced by a fresh worker. Since only one worker (1/n of total capacity) is “rejuvenated” at a time, this process is sometimes called rolling reboot, and most PaaS platforms employ some variant of it.

Overprovisioning is often used in anticipation of crash recovery and rolling reboot. The idea is to provide more servers in a tier at any given time than you think you’ll need. For example, by deploying \(n + 1\) servers in a tier, temporarily losing one server degrades performance by only \(1/n\). Good values for \(n\) can sometimes be determined empirically by monitoring, as the rest of this chapter describes. However, at large scale, systematic overpro- visioning is both economically unattractive and may be insufficient by itself. For example, in an app whose database queries are poorly constructed, the database will quickly become the bottleneck, and as Section 12.2 reminds us, databases are generally not amenable to shared-nothing horizontal scaling. In such a situation, overprovisioning the other tiers won’t help. The lesson is that in the end, there’s no substitute for a design that is free of gratuitous bottlenecks to scalability. Later in this chapter we identify some common bottlenecks and how to avoid them.

Of course, it’s not much good if your app is technically “available” but so sluggish that users don’t want to use it. Responsiveness is the perceived delay between when a user takes an action such as clicking on a link and when the user perceives a response, such as new content appearing on the page. Technically, responsiveness has two components: latency, the initial delay to start receiving new content, and throughput, the time it takes for all the content to be delivered. As recently as the mid-1990s, many home users connected to the Internet using telephone modems that took 100 ms (milliseconds) to deliver the first packet of information and show part of the Web page. Telephone modems could sustain at most 56 Kbps (56 × 103 bits per second), so loading a complete Web page or image 50 KBytes in size or 400 Kbits could take more than eight seconds. Since today’s home customers increasingly use broadband connections whose throughput is 1–20 Mbps, responsiveness for Web pages is dominated by latency rather than throughput.

Since responsiveness has such a large effect on user behavior, SaaS operators carefully monitor the responsiveness of their sites. Of course, in practice, not every user interaction with the site takes the same amount of time, so evaluating performance requires appropriately characterizing a distribution of response times. Consider a site on which 8 out of 10 requests complete in 100 ms, 1 out of 10 completes in 250 ms, and the remaining 1 out of 10 completes in 850 ms. If the user satisfaction threshold \(T\) for this site is 200 ms, it is true that the average response time of \((8(100) + 1(250) + 1(850))/10 = 190\) ms is below the satisfaction threshold. But on the other hand, 20% of requests (and therefore, up to 20% of users) are receiving unsatisfactory service. Two definitions are used to measure latency in a way that makes it impossible to ignore the bad experience of even a small number of users:

  • A service level objective (SLO) usually takes the form of a quantitative statement about the quantiles of the latency distribution over a time window of a given width. For example, “95% of requests within any 5-minute window should have a latency below 100 ms.” In statistical terms, the 95th quantile of the latency distribution must not exceed 100 ms.

  • The Apdex score (Application Performance Index) is an open standard that computes a simplified SLO as a number between 0 and 1 inclusive representing the fraction of satisfied users. Given a user satisfaction threshold latency \(T\) selected by the application operator, a request is satisfactory if it completes within time \(T\), tolerable if it takes longer than \(T\) but less than \(4T\), and unsatisfactory otherwise. The Apdex score is then (Satisfactory +0.5(Tolerable)) / (Number of samples). In the example above, the Apdex score would be \((8 + 0.5(1))/10 = 0.85\).

Of course, the total response time perceived by the users includes many factors beyond your SaaS app’s control. It includes DNS lookup, time to set up the TCP connection and send the HTTP request to the server, and Internet-induced latency in receiving a response containing enough content that the browser can start to draw something (so-called “time to glass,” a term that will soon seem as quaint as “counterclockwise”). Especially when using curated PaaS, SaaS developer/operators have the most control over the code paths in their own apps: routing and dispatch, controller actions, model methods, and database access. We will therefore focus on measuring and improving responsiveness in those components.

For small sites, a perfectly reasonable way to mitigate latency is to overprovision (provide excess resources relative to steady-state) at one or more tiers, as Section 12.2 suggested for the presentation and logic tiers. A few years ago, overprovisioning meant purchasing additional hardware that might sit idle, but pay-as-you-go cloud computing lets you “rent” the extra servers for pennies per hour only when needed. Indeed, companies like RightScale offer just this service on top of Amazon EC2.

As we will see, a key insight that helps us is that the same problems that push us out of the “PaaS-friendly” tier are the ones that will hinder scalability of our post-PaaS solutions, so understanding what kinds of problems they are and how to solve them will serve you well in either situation.

What are the thresholds for user satisfaction on responsiveness? A classic 1968 study from the human-computer interaction literature (Miller 1968) found three interesting thresholds: if a computer system responds to a user action within 100 ms, it’s perceived as instantaneous; within 1 second, the user will still perceive a cause-and-effect connection between their action and the response, but will perceive the system as sluggish; and after about 8 seconds, the user’s attention drifts away from the task while waiting for a response. Surprisingly, more than thirty years later, a scholarly study in 2000 (Bhatti et al. 2000) and another by independent firm Zona Research in 2001 affirmed the “eight second rule.” While many believe that a faster Internet and faster computers have raised users’ expectations, the eight-second rule is still used as a general guideline. New Relic, whose monitoring service we introduce later, reported in March 2012 that the average page load for all pages they monitor worldwide is 5.3 seconds and the average Apdex score is 0.86.

Self-Check 12.3.1. For a SaaS app to scale to large numbers of users, it must maintain its ____ and ____ as the number of users increases, without increasing the ____.

Availability; responsiveness; cost per user

Self-Check 12.3.2. True or False: From the perspective of responsiveness, faster is always better.

False. Faster than 100 ms is not perceptible to people, and people abandon sites only when responsiveness is slows to 8 seconds or worse.