Edition 102

Mobile App Performance Testing

What is mobile app performance testing, exactly? What should we care about in terms of app performance and how can we test app performance as part of our automated testsuite? This is a topic I've been thinking a lot about recently, and so I convinced Brien Colwell (CTO of HeadSpin) to sit down with me so I could pester him with some questions. I took our conversation and turned it into a sort of FAQ of mobile app performance testing. None of this is specifically tied to Appium--stay tuned for future editions where we look at how to achieve some of the goals discussed in this article, using Appium by itself or in conjunction with other tools and services.

What is performance testing all about?

Really, we could think of performance testing as a big part of a broader concept: UX testing (or User Experience Testing). The idea here is that a user's experience of your app goes beyond the input/output functionality of your app. It goes beyond a lot of the things we normally associate with Appium (though of course it includes all those things too--an app that doesn't work does not provide a good experience!) It reflects the state of the app market, where the world is so crowded with nearly-identical apps, that small improvements in UX can mean the difference between life and death for one of these startups.

Years ago, I considered performance testing to be exclusively in the domain of metrics like CPU or memory usage. You only did performance testing when you wanted to be sure your app had no memory leaks, that kind of thing. And this is still an important part of performance testing. But more broadly, performance testing now focuses on attributes or behaviors of your application as they interface with psychological facts about the user, like how long they are prepared to wait for a view to load before moving on to another app. One quick way of summarizing all of this is to define performance testing as ensuring that your app is responsive to the directions of the user, across all dimensions.

What performance metrics should we care most about?

It's true that classic specters of poor software performance, like memory leaks or spinning CPUs, can plague mobile app experience. And there is good reason to measure and profile these metrics. However, the primary cause of a bad user experience these days tends to be network-related. So much of the mobile app experience is dominated by the requirement of loading data over the network, that any inefficiency there can cause the user to experience painfully frustrating delays. Also, testing metrics like memory or CPU usage can often be adequately accomplished locally during development, whereas network metrics need to be tested in a variety of conditions in the field.

To this end, we might track metrics like the following:

  • DNS resolution (did the name resolve to an IP that was close by?)
  • Time till 1st byte received for a given response
  • Time delay between send / acknowledge in the TCP stream
  • TLS handshake time

Beyond network metrics, there are a number of other UX metrics to consider:

  • Time-to-interactivity (TTI): how long does it take for an app to become usable after the user has launched it? (This is an extremely important UX metric)
  • Total time a blank screen is shown during a user session
  • Total time loading animations / spinners are shown during a user session
  • Video quality (MOS)
  • Time to full load (for progressively-loaded apps)

What are some common mistakes mobile app developers make that lead to poor performance?

When developing an application, we are often doing so on top-of-the-line desktop or laptop computers and devices, with fast corporate internet. The performance we experience during development may be so good that it masks issues experienced by the average set of users. Here are a few common mistakes (again, largely network-oriented) that developers make which can radically impact performance:

  • Using HTTP/2 or a custom TCP communication channel instead of sticking with the simplicity of HTTP v1 and optimizing traffic. (I.e., there aren't as many gains as you might expect from a more modern or complicated network stack)
  • Having dependent/blocking network requests that are executed serially; whenever possible, network requests should be executed in parallel.
  • Having too many network requests on a single view. This can be a consequence of a microservices / micro API architecture where to get the appropriate data, lots of requests are required. For mobile it is essential to make as few requests as possible (ideally only 1 per view), and therefore to sacrifice purity / composability of API response in order to aggregate all the data necessary to build a given view.
  • Misconfigured DNS causing devices to talk to servers that are unduly far away.
  • Unoptimized images, video, or audio being sent to regions or networks that can't handle the bandwidth.

How should we think about absolute vs relative performance measurements?

In general, it can often be more useful to track performance relative to a certain baseline, whether that is an accepted standard baseline, or just the first point at which you started measuring performance of your app. However, tracking relative performance can also be a challenge when testing across a range of devices or networks, because relative measures might not be comparing apples to apples. In these cases, looking at absolute values side-by-side can be quite useful as well.

Are there any absolute standard performance targets generally recognized as helpful?

It's true that each team defines UX for their own app. Acceptable TTI measures for an e-commerce app might differ by an order of magnitude or more from acceptable measures for AAA gaming titles. Still, there are some helpful rules of thumb based on HCI (Human-Computer Interaction) research:

  • Any delay over 500ms becomes a "cognitive" event, meaning the user is aware of time having passed.
  • Any delay over 3s becomes a "reflective" event, meaning the user has time to reflect on the fact of time having passed. They can become distracted or choose to go do something else.

These are not hard-and-fast truths that make sense in every case. And of course nobody really has a universal answer, but again, it's helpful to treat that 500ms number as a good target for any interaction we want to feel "snappy".

(To follow up on some of this research, read up on the human processor model or powers of 10 in UX)

How widely should we expect performance to vary across different devices?

In other words, when should we be really concerned about differences in performance between different devices or networks? Actually, it's fairly common to see differences of about 30% as quite common between devices. This level of difference doesn't usually indicate a severe performance issue, and can (with all appropriate caveats) be regarded as variance.

True performance problems can cause differences of 10-100x the baseline measurements--just think how long you've waited for some app views to load when they are downloading too much content over a slow network!

How do you decide which devices to focus on for performance testing?

The answer here is simple if not practical: test on the devices that bring in the greatest revenue! Obviously this implies that you have some kind of understanding of your userbase: where are they located? What devices do they use? What is their typical network speed? And so on. If you don't have this information, try to start tracking it so you can cross-reference with whatever sales metrics are important for your product (items purchased, time spent in app, whatever).

At that point, if you can pick the top 5 devices that meet these criteria in a given region, you're well positioned to ensure a strong UX.

Conclusion

"Performance" turns out to be quite a broad subcategory of UX, and of course, what we care about at the end of the day is UX, in a holistic way. The more elements of the UX we can begin to measure, the more we will be able to understand the impact of changes in our application. We'll even eventually get to the point where we've identified solid app-specific metric targets, and can fail builds that don't meet these targets, guaranteeing a minimum high level of UX quality for our users. Oh, and our users? They won't know any of this is happening, but they'll love you for it.