Edition 29

Automating Complex Gestures with the W3C Actions API

Mobile apps often involve the use of touch gestures, sometimes in a very complex fashion. Compared to web apps, touch gestures are common and crucial for mobile apps. Even navigating a list of items requires a flick or a swipe on mobile, and often the difference between these two actions can make a meaningful difference in the behavior of an app.

A Long Time Ago...

The story of Appium's support for these gestures is as complex as the gestures themselves can be! (Feel free to skip this section if you want to get straight to the details...) Because Appium has the goal of compatibility with WebDriver, we've also had to evolve in step with some of the changes in the WebDriver protocol. When Appium was first developed, there were two different ways of getting at actions using WebDriver's existing JSON Wire Protocol. These APIs were designed for automating web browsers, driven by a mouse, so needless to say they didn't map well to the more generally gestural world of mobile app use. To make things worse, the iOS and Android automation technologies Appium was built on top of did not expose useful general gesture primitives. They each exposed their own platform-specific API, with commands like swipe that took different parameters and behaved differently to each other, as well as to the intention of the JSON Wire Protocol.

Appium thus faced two challenges: an inadequate protocol spec, and an inadequate and variable set of basic mobile APIs provided by Apple and Google. Our approach was to implement the JSON Wire Protocol as faithfully as possible, but also to provide direct access to the platform-specific APIs, via the executeScript command. We defined a special prefix, mobile:, and implemented access to these non-standard APIs behind that prefix. So, users could run commands like driver.executeScript("mobile: swipe", args) to directly activate the iOS-specific swipe method provided by Apple. It was a bit hacky, but gave users control over whether they wanted to stick to Appium's implementation of the standard JSON Wire Protocol, or gain direct access to the platform-specific methods.

Meanwhile, the Appium team learned that the Selenium project was working on a better, more general specification for touch actions. This new API was proposed as part of the new W3C WebDriver Spec which was under heavy development at the time. The Appium team then implemented this new API, and gave Appium clients another way to automate touch actions which we thought would be the standard moving forward. Unfortunately, this was an erroneous assumption. Appium was too quick to implement this new Actions spec---the spec itself changed and was recently ratified in a different incarnation than what Appium had originally supported. At least the spec is better now!

A Happy Confluence

That brings us to today. Over the years the mobile automation technologies Appium uses have themselves evolved, and clever people have uncovered new APIs that allow Appium to perform totally (or almost totally) arbitrary gestures. The W3C WebDriver Spec is also now an official thing, including the most recent incarnation of the Actions API. The confluence of these two factors means that, since Appium 1.8, it's been possible for Appium to support the W3C Actions API for complex and general gestures, for example in drawing a picture (which is what we are going to do in this edition of Appium Pro).

Why do we care about the W3C API specifically? Apart from Appium's desire to match the official WebDriver standard, the Appium clients are built directly on top of the Selenium WebDriver clients. As the Selenium clients change to accommodate only the W3C APIs, that means Appium will need to support them or risk getting out of phase with the updated clients.

The Actions API

The W3C Actions API is very general, which also makes it abstract and a bit hard to understand. Basically, it has the concept of input sources, many of which can exist, and each of which must be of a certain type (like key or pointer), potentially a subtype (like mouse, pen, touch), and have a certain id (like "default mouse"). Pointer inputs can register actions like pointerMove, pointerUp, pointerDown, and pause. By defining one (or more) pointer inputs, each with a set of actions and corresponding parameters, we can define pretty much any gesture you like.

Conceptually, for example, a "zoom" gesture consists of two pointer input sources, each of which would register a series of actions:

Pointer 1 (type touch, id "forefinger")
- pointerMove to zoom origin coordinate, with no duration
- pointerDown
- pointerMove to a coordinate diagonally up and right, with duration X
- pointerUp

Pointer 2 (type touch, id "thumb")
- pointerMove to zoom origin coordinate, with no duration
- pointerDown
- pointerMove to a coordinate diagonally down and left, with duration X
- pointerUp

These input sources, along with their actions, get bundled up into one JSON object and sent to the server when you call driver.perform(). The server then unpacks the input sources and actions and interprets them appropriately, each input source's actions being played at the same time (each action taking up one "tick" of virtual time, to keep actions synchronized across input sources).

Example: Let's Draw a Surprised Face

Let's take a look at some actual Java code. Because the W3C Actions API is so new, there aren't a whole lot of helper methods in the Java client we can use to make our life easier. The helper methods which do exist are pretty boring, basically implementing moving to and tapping on elements, with code like:

Actions actions = new Actions(driver);
actions.click(element);
actions.perform();

But this is the kind of thing we can pretty much do already, without the Actions API. What about something cool, like drawing arbitrary shapes? Let's teach Appium to draw some circles so we can play around with a "surprised face" picture (just to keep things simple---as an exercise to the reader it would be interesting to augment the drawing methods to be able to also draw half-circles, so that our face could be more smiley and less surprised).

If we're going to draw some circles, the first thing we'll need is some math, so we can get the coordinates for points along a circle:

private Point getPointOnCircle (int step, int totalSteps, Point origin, double radius) {
    double theta = 2 * Math.PI * ((double)step / totalSteps);
    int x = (int)Math.floor(Math.cos(theta) * radius);
    int y = (int)Math.floor(Math.sin(theta) * radius);
    return new Point(origin.x + x, origin.y + y);
}

The idea here is that we're going to define a circle by an origin coordinate, a radius, and a number of "steps"---how fine-grained our circle should be. If we pass in a value of 4 for totalSteps, for example, our circle will actually be a square! The greater the number of steps, the more perfect a circle it will appear. Then we use the magic of Trigonometry to determine, for a given iteration ("step"), which point our "finger" should be on.

Now we need to use this method to actually do some drawing with Appium:

private void drawCircle (AppiumDriver driver, Point origin, double radius, int steps) {
    Point firstPoint = getPointOnCircle(0, steps, origin, radius);

    PointerInput finger = new PointerInput(Kind.TOUCH, "finger");
    Sequence circle = new Sequence(finger, 0);
    circle.addAction(finger.createPointerMove(NO_TIME, VIEW, firstPoint.x, firstPoint.y));
    circle.addAction(finger.createPointerDown(MouseButton.LEFT.asArg()));

    for (int i = 1; i < steps + 1; i++) {
        Point point = getPointOnCircle(i, steps, origin, radius);
        circle.addAction(finger.createPointerMove(STEP_DURATION, VIEW, point.x, point.y));
    }

    circle.addAction(finger.createPointerUp(MouseButton.LEFT.asArg()));
    driver.perform(Arrays.asList(circle));
}

In this drawCircle method we see the use of the low-level Actions API in the Java client. Using the PointerInput class we create a virtual "finger" to do the drawing, and a Sequence of actions corresponding to that input, which we will populate as we go on. From here on out we're just calling methods on our input to create specific actions, for example moving, touching the pointer to the screen, and lifting the pointer up. (In doing this we utilize some timing constants defined elsewhere). Finally, we hand the sequence off to the driver to perform! This method is a perfectly general way of drawing a circle with Appium using the W3C Actions API. But it is not yet enough to draw a surprised face. For that, we need to specify which circles we want to draw, at which coordinates:

public void drawFace() {
    Point head = new Point(220, 450);
    Point leftEye = head.moveBy(-50, -50);
    Point rightEye = head.moveBy(50, -50);
    Point mouth = head.moveBy(0, 50);

    drawCircle(driver, head, 150, 30);
    drawCircle(driver, leftEye, 20, 20);
    drawCircle(driver, rightEye, 20, 20);
    drawCircle(driver, mouth, 40, 20);
}

Here we simply define the center points of our various face components (head, eyes, and mouth), and then draw a circle of an appropriate size and with an appropriate arrangement of parts so that it kind of looks like a face. But of course all of this is only going to work if we can find an app that will recognize our gestures as an attempt to draw something! Luckily, the "ApiDemos" app that is freely available from Google has such a view inside of it. So we can start an Appium session on this app and navigate directly to the .graphics.FingerPaint activity. Once we do this, we get the masterpiece below:

Animated gif of Appium drawing a face

Go ahead and check out the full test class on GitHub to see how all the boilerplate looks or to run this example yourself. Obviously, this is a toy example, but it shows the power of the Actions API to do pretty much any kind of gesture. It's certainly suitable for the typical kinds of gestures employed in most mobile apps. But there's clearly much more creativity to be unlocked here. What will you do with the Actions API? Let me know and maybe I'll showcase your creation in a future edition!