The engineering problem of A/B testing

Maybe you want to segment on device type, or audience data too.

You can import stats from a database and use a basic excel sheet, or you can use a tool like Google Analytics that for instance allows you to query sequences, which can be very useful in analyzing user behavior.

Visual editors: Tools like Optimizely offer a visual editor that allows you to click your way to designing a new experiment.

Super useful if you don’t have direct access to an engineering team (if you do have that, there are likely better options).

Implementation approachesTo my (admittedly limited) knowledge, there are at least five ways to implement A/B testing:Canary releases: If you want to test a new variation of your website, you can deploy a new version per variant that you want to test (a feature branch perhaps), and then route a subset of your users (with sticky sessions) to that new deployment.

To be able to use this, you have to have a well-managed infrastructure and release pipeline, especially when you want to run multiple tests in parallel and need many different deployments and the routing complexities that come with it.

Likely you’ll need a decent amount of traffic, too.

The upsides seem clear, though.

For instance, any failed experiment does not introduce technical debt (code never lands on master, and deployments are just deleted).

Another benefit is that this enforces that a user can only be in one experiment at a time; multiple experiments introduce both technical challenges and uncertainty about experiments influencing each other.

Split URLs: Historically recommended by Google to prevent SEO issues, you can use URLs to route users to different experiments.

As an example: /amazing-feature/test-123/b .

The benefit of this approach specifically is that you will not negatively impact any SEO value a given URL on your domain has while you’re experimenting with different designs.

Server-side: Users are bucketed on the server when a page is requested.

A cookie is then set to ensure the user is “stuck” in this bucket, and it’s used to render an interface with whatever experiments the user is in.

You can pretty much do whatever you want: A/B tests, multivariate tests, feature toggles, parallel experiments: it’s all up to you.

For the user, this is one of the best options, because the performance impact is negligible.

However, because you use cookies, the benefits of a CDN are limited.

Cookies introduce variation in requests (especially if users can enter multiple experiments), and it will lead to cache misses, leaving you without the protection of a CDN.

Client-side: If you don’t have access to the server, or you want to have maximum flexibility, client-side A/B testing is also an option.

In this scenario, either no interface is rendered, or the original interface, and as soon, or slightly before this happens, the experiments are activated, and the interface is augmented based on whatever variant the user is in.

This choice often makes sense when you do not have access to an engineering team, and are using external tools to run experiments.

However, it’s often the worst choice in terms of performance.

As an example, let’s look at how client-side Optimizely is implemented: you embed a blocking script, which forces the browser to wait with displaying anything on screen until this script is downloaded, compiled and executed.

Additionally, the browser will de-prioritize all other resources (that might be needed to progressively enhance your website) in order to load the blocking script as fast as possible.

To top it off, the browser both has to preconnect to another origin if you do not self-host the script, and it can only be cached for a couple of minutes (that is, if you want to be able to turn off a conversion-destroying experiment as quick as possible).

With synthetic tests on mobile connections, I’ve seen Optimizely delay critical events between 1–2 seconds.

Use with caution!On the edge: If you have a CDN in front of your website, you can use the power of edge workers to run experiments.

I’ll defer to Scott Jehl for details about it, but the gist of it is that your server renders all variations of your interface, your CDN caches this response, and then when a user loads your website, the cached response is served, after the edge worker removes the HTML that is not applicable to the user that requests it.

A very promising approach if you care about performance, because you get the benefits of a CDN without any impact on browser performance.

The reality of A/B testingTurns out, A/B testing is hard.

It can be very valuable, and I think you owe it to your bottom line to measure.

However, it’s not a silver bullet, and you have to tailor your approach to the type of company you are (or want to be).

Here’s what I learned at a mid-sized company with roughly 50–100k users a day:Isolate experiments as much as possibleAt my current employer, we are implementing experiments in parallel, and the implementation is always on production as soon as it has been verified, regardless of whether it will be used or not (basically, a feature toggle).

This is mostly due to our tech choices: we have a server-side rendered, re-hydrated, single-page application, which makes it hard to use the canary strategy (because you never go back to a router or load-balancer).

Besides that, we cannot afford the luxury of one experiment per user across the platform, due to a lack of traffic.

In practice, this means that experiments have side-effects.

There are two issues here at play.

Firstly, concurrently implemented experiments make any reasonable expectation of end-to-end test coverage impossible: even a small amount like 10 A/B tests creates 100 variations of your application.

To test all those different variations, our tests would take 250 hours instead of 15 minutes.

So, we disable all experiments during our tests.

Which means that any experiment can — and eventually will — break critical (and non-critical) user functionality.

Additionally, besides the cache problem I mentioned earlier, it also makes it much harder to reliably reproduce bugs from your error reporting systems (and reproduction is hard enough to begin with!).

Secondly, running multiple experiments across a user’s journey will lead to uncertainty about test results.

Suppose you have an experiment on a product page, and one on search.

If the experiment on search has a big impact on the type of traffic you send to the product page, the results from the product experiment will be skewed.

The best isolation strategy I can think of is canary releases and feature branches.

In my wild, lucid dreams, this is how it works: when you start an experiment, you create a branch that will contain the changes for a variant.

You open a pull request, and a test environment with that change is deployed.

Once it passes review, it is then deployed to the production, and the router configuration is updated to sent a certain amount of traffic to the variant that you want to test.

You have to look at expected usage, general traffic and a desired duration of the test to determine what traffic split makes sense.

Suppose that you estimate 20% of traffic for a week would be enough, it would then be common to exclude 80% of traffic for the test, and split the remaining 20% evenly over an instance that is running the current version of your website, and a version that is running a variant of the experiment.

A diagram showing an A/B testing architectureI can imagine orchestrating this requires a significant engineering effort though, especially when you want to automatically turn experiments on and of, or when you want to use more advanced targeting.

You have to have enough traffic here, and at this point you will see benefits from splitting up your website in smaller deployable units.

For instance, you might want to consider splitting up your frontend in micro-frontends.

If you cannot properly isolate experiments, you could try to accept that not all problems in life are or should be solvable.

If you’re more of a control freak (like me), you might want to consider mutually exclusive experiments — meaning that a user that is in experiment X cannot be in experiment Y at the same time.

This will help eliminate behavioral side-effects.

If you need more testing confidence, you can opt-in for lower level testing, like unit or component testing.

Or, you can deal with 250 hours plus pipelines, whatever floats your boat.

Stick to high standardsOne oft-repeated mantra around A/B testing is “it’s just a test, we’ll fix it later”.

The idea here is that you build an MVP and gauge interest, and if there is, you build a better designed, better implemented version as a final version.

In practice I have not seen that work, presumably for two reasons: the first is that the incentive to fix things disappear after they are shipped.

This applies to all parties: engineering, design and product.

It’s already proven to be an uplift, and spending time re-designing or re-factoring will feel unneeded.

And things that feel unneeded — even if they are needed — will not happen, especially in the pressure cooker that is product engineering.

The second reason is that re-implementing an experiment, even redesigning it, could have impact on any formerly assumed uplift.

To be absolutely sure, you’d have to run another experiment, now with the production-ready implementation.

Ain’t nobody got time for that, chief.

And here’s the thing: the type of environment that needs to take shortcuts for the implementation of an experiment is also unlikely to allocate time to refactor or/and re-run a successful experiment.

What happens?.You accumulate tech debt.

Often not something that is clearly scoped, and quantitatively described.

Debt that is hard to put a number on, and hard to make the case for it to be addressed.

Debt that will creep up on you, until finally, everybody gives up and pulls out the Rewrite hammer.

(I’ll refer back to Nicolas’ tweet at this point).

Different standards are not just unwise, they are also confusing.

It’s hard enough to align engineers on one standard, but two?.Impossible.

Brace yourself for endless back-and-forths in code reviews.

(On a personal note — as if the former declarations were not just that — lower standards are uninspiring as hell, too.

But maybe that’s just me.

)Aim for impactVWO, a conversion optimization platform, estimates that around 1 in 7 experiments fail.

The common refrain in the CRO world is that failing is okay, as long as you learn from it.

The assumption here is that knowing what doesn’t work, is as valuable as knowing what does.

However, that does not mean you should start experimenting with things that are just guesswork, and/or can be figured out by common sense, experience, or qualitative research.

Every one of those options is cheaper than throwing away over 85% of the capacity of 100k-a-year designers or developers — especially if you take churn rate into account, which will inevitably happen if your employees feel like all of their contributions are meaningless.

How do you keep morale high and let contributors feel like they’re being valued?.Sure, buy-in, and emphasizing learning, help.

But for me, big bets are the most inspiring.

They allow me to fully use my experience and skill set to make a difference.

Now, what a big bet is depends on the type of company you are, but I wouldn’t consider reposition or copy experiments to be in that category.

A good indication of setting the bar too low is many inconclusive experiments (or experiments that have to run for a long time to be significant).

If that’s the case, you’re placing too many small bets.

Now, back to that tweet…Admittedly I used Nicolas’ tweet mostly to have a nice, stingy introduction to a rather boring topic, but it carries a potent truth: data, or the requirement of data, often leads to inertia.

Data does not have all the answers.

It does not replace a vision, and it is not a strategy.

Define a vision, then use A/B tests to validate your progress in reaching that vision.

Not the other way around.

.. More details

Leave a Reply