This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding of A/B testing and offers a fundamental introduction to the topic itself. We’ll also go through the most common statistical methods used in A/B tests and show how each tool uses them for evaluating the results.
This is how the series of blog articles will be structured:
- In this blog post we will begin with a general overview of A/B testing. We'll go through the main aspects and notions in A/B testing and analyse pros and cons. We'll also have a detailed overview of the most common statistical methods used to evaluate the results.
- In the blog post(s) that will follow we will examine multiple A/B test tools: specifically, we will assess the tools by employing them to execute various test scenarios involving simulated user interactions. Unlike real A/B tests, where the actual underlying values are unknown, these simulated user interactions enable us to feed the tools with input metric values that we control.
What is an A/B test?
An A/B test is a method of comparing two variants (like two different versions of a webpage or email campaign) to determine which one performs better. Typically, a single metric is used to measure the performance in A/B testing. Most often, user-centric metrics such as engagement, satisfaction, or UI/UX usability are measured. Since the metric is crucial in the outcome of the whole method, it will be further discussed below.
An A/B test is mostly used inside a hypothesis-driven development process. Such a development process applies the scientific approach of:
- Idea generation
- Hypothesis creation
- Experimental design
- Experimentation
- Inference
- Continue to iterate with 1. or abort
In such a hypothesis-driven development process, A/B tests are just one form of experiment to validate a hypothesis. There are other ways to test a hypothesis that may even be faster to implement and bring better results. For example:
- Examining existing data: when testing a hypothesis such as "shopping increases with a faster payment process", you can analyze existing data to establish a correlation between shopping volume and payment process speed. Once the hypothesis is confirmed, in subsequent iterations, you can generate ideas to further enhance the payment process speed.
- Usability prototypes (for example paper prototypes or wireframes): these prototypes provide the capability to lead a customer or a group of customers through a simulated journey, presenting a modified sequence of screens and dialogues. This enables you to observe and discern the elements they appreciate and those they find unfavorable in the dialogue.
A/B tests can be further supplemented with qualitative research (e.g. interviews) to better understand the motives behind user behavior.
Pro and cons of A/B testing
A/B tests are more expensive compared to alternatives since they require an implementation of the variants which should scale to a large part of the production user base. A full implementation of all aspects of the variants is also needed: for instance, it is not enough to show the UI of the new search system, but the search itself has to be implemented too.
Moreover, a significant number of interactions or users is necessary to achieve statistical significance in outcomes. The required number depends on the current metric value and the expected difference between the variants. In a later section we will see how the required number of users' interactions changes in different scenarios.
Some companies also see the risk that an A/B test might negatively influence their business because the change upsets or confuses the users or because the variation that is being tested is really worse than the current version.
On the contrary, if executed correctly, A/B testing enables us to impartially evaluate and compare the effects of a modification without additional limitations (like “a non-representative group of 10 people found variant B better”).
A/B testing may also be the only alternative when no other experiment type is possible or feasible. In practice A/B tests are often applied in various areas from application frontend variations, marketing email variations to ML (Machine Learning) models.
- Frontend variations (layout changes, different orders of fields or choices, changes of font / color / size, …) have the issues that there is no strong science to estimate the effect of a modification. A/B tests are especially appealing here if the software changes require minimal effort.
- The performance of ML models (for example for product recommendation or fraud detection) can change in a complex way when the model generation is changed (for example: new features in input, different tuning parameters, a different learning algorithm used). Here A/B tests may be the only way to estimate the end result from the users' perspective. Also here the effort to train and deploy a second model may be small compared to traditional software modifications.
Roles
In hypothesis-driven development processes, there are distinct roles that contribute to the success of the experiment. The experiment designer is responsible for selecting the appropriate type of experiment and designing it to effectively test the hypothesis. The evaluator analyzes the raw data from the experiment and translates it into insights that can be used to inform decision-making. The experimenter combines the responsibilities of the experiment designer and evaluator, and can be fulfilled by a variety of professionals such as data scientists, data analysts, or UX/UI designers. Finally, developers are responsible for implementing the necessary changes to the software, particularly in the case of A/B testing.
Aspects of an A/B test
In an A/B test the objects (e.g. the users) are split into disjunctive buckets. This splitting is called assignment. Usually there are two buckets (labeled “A” and “B”) but there can also be more. To achieve statistical independence, the objects are randomly assigned to a bucket. Associated to each bucket is a variant: this is a specific version of something we want to test, like a button in a webpage, a different ML model or, in medical studies, a specific treatment.
In clinical studies the buckets (of patients) have special names: the group of patients who receive medication or a treatment is commonly referred to as the treatment group while the group of people which do not get a treatment is called the control group. In business A/B tests, such as those conducted in e-commerce, the terms baseline or control are typically employed to refer to the currently implemented variant, while challenger denotes the proposed new version. Confusingly, in addition to “challenger” you may also come across the terms “variant” and “variation” (as in “measuring the baseline and the variant”). We will not use these terms in this way since we defined “variant” differently.
In most cases, the probability of assignment to a bucket (also referred to as weight) in an A/B test is equal (even-split), typically resulting in a 50% allocation to each bucket in a two-buckets scenario. As a result, the ultimate sizes of the buckets may either be equal (which is uncommon) or closely resemble one another. However, it is also possible to deviate from the 50/50 split approach, for example if you are unsure about the new variant and want to minimise negative effects: an unequal allocation of traffic (e.g. 30/70 or 20/80) may help to reduce this risk, as only a minority of the traffic (30% or 20%) is exposed to a new, unknown variation. But note that the precision/power of an A/B test depends on the size of the smaller/smallest bucket. So in general using an even split leads to faster statistical significance in your experiment.
What has not been discussed so far is which objects are assigned to the buckets. Most common objects are users, user interactions (like page-views, shop visits, etc.), sessions or API requests are assigned. Very important here is that the bucket assignment of users and sessions have to be consistent. So, for example, a user should see the same frontend variant (corresponding to his bucket) even when she refreshes the browser or logs out and logs in again. Achieving consistency can be accomplished by retaining the initial random bucket selection in a data store or by utilizing consistent hashing, an ingenious technique that eliminates the need for data storage and which will be covered later in this blog post. We will often use the term users to represent the objects assigned since this is the most common use case.
As part of the experiment design, it is essential to determine the user base that will participate in the test. The test can include all users of a platform, but it is also possible to select a subset based on specific criteria. So an experiment setup may be: show 50% of the users from France a new red button, while the other 50% should see a new blue button and users from all the other countries should see the default black button.
Test types
In addition to the common A/B test there are two other test types:
A so called A/B/n test is used to test a wider range of options when more than two buckets are needed. This is usually chosen when you don't know which variant is the most desired and the production of variants is cheap. The buckets may or may not have equal assignment probability.
In an A/A test there are two buckets which are assigned to the same variant (resulting in equal behavior of all users). This is commonly used to identify issues in the data pipeline during the introduction of an A/B test tool or after major changes in the data pipeline. Such an A/A test (with 50/50 split) will:
- verify that the assignment produces two buckets of similar size. If this is not the case, a Sample Ratio Mismatch (SRM) has occurred.
- verify that the assignment is random and independent by looking at the metric which should also be similar.
- verify that the statistic part of the A/B test tool works: the tool should not detect any statistically significant differences between the two buckets
- give you an idea what the baseline metric value is before you introduce a new variant
- give you information about how many interactions happen in a given time to help later with the estimation of the experiment run time
There are advanced experiment setups like the Multi-Armed-Bandit, but this will be covered in a later blog post.
More about metrics
Metrics are the way to judge the variants. So in this regard they have to be aligned with (other) business metrics; otherwise there is a risk of spending a lot of time and effort on the wrong goal. Some companies set a so called North Star Metric, which captures the product's essential value to the customer, defining the relationship between customer problems and revenue. Optimizing growth activities center around this metric, providing direction for long-term, sustainable growth throughout the customer lifecycle.
The management business metrics like sales and profit are usually not suitable as an A/B test metric because they are lagging by days or weeks and also have multiple other influences that cannot be controlled. So usually a more technical metric is chosen which can be calculated fast (like once per day), is only influenced by the application under test and is a proxy for the management business metrics in some way. This is commonly a conversion rate (obtained dividing the number of users which take a positive action in a specific part of the software by the number of all users visiting it). Other metrics are, for example, basket size, average item price in the basket or average sale by customer. We will frequently use conversion rate as metric in the examples that follow, but it's important to note that other metrics may also be applicable.
It is usually effortless to identify several metrics that require optimization. However, it is generally preferable to concentrate on a single metric while other metrics can be included as guardrail metrics to ensure that they are not negatively impacted during A/B testing. Consider an example: if your North Star Metric is the number of purchases, you might attempt to boost this figure by either lowering the price (even though profits may decrease) or by reducing both the price and the quality of products (while this maintains the profit, it will likely increase the return rates and decrease customer satisfaction). In this scenario, it becomes crucial to incorporate profit, return rate, and customer satisfaction as guardrail metrics to ensure a comprehensive evaluation of the overall impact.
Another factor that must be taken into account when selecting a metric is the reliance on consistent object assignment into buckets. For instance, if the bucket assignment is determined based on the session, then a user who logs out and logs back in would be treated as a separate session and potentially placed in a different bucket, resulting in varied behavior. It is evident that a metric such as monthly customer profit, which is computed per user and not per session, would be inappropriate, since the metric should provide insight into the value of one variant and would not be useful if the user was exposed to multiple variants during the month. Of course if the bucket assignment is done with the most strict object type (user) you are free in the selection of metrics. But for technical reasons this may require large efforts and a session or request based approach might be preferable.
Statistics
Think about the following scenario: After running a test for one week you look at the metrics for the two variants and see 5% in the baseline and 5.5% in the challenger variant. You see that the challenger variant is clearly better (by 10% nonetheless): you end the test and communicate the outcome.
Is there something wrong with this scenario? Yes, the main problem is that there is a risk that the difference in the metric values could be caused purely by chance and not a difference in user behavior (which we wanted to measure). If the difference is caused by chance, we may select the worse performing variant and therefore cause a negative impact in the long run on our metric and business. In more statistical terms you have to make sure that the difference between the variants is statistically significant. Most of the A/B testing tools in the markets can help with this evaluation by delivering an overview of the results with some statistical information. The statistical methods used might be different though. In the following sections we will go through the two main “schools of thought” between statisticians: Frequentist and Bayesian approach. For each approach there are many tutorials and blog-posts online that you can follow (see, for example, ab-testing-with-python, bayesian-ab-testing-in-python, bayesian-in-pymc3). To clarify the concepts we will also use an example of an A/B test and explore the evaluation methods using Python.
Frequentist approach
This is the most “classical” approach to evaluate the statistical significance of your A/B test results and it uses only output data from your experiments in addition to the metric value of the baseline (for sample size estimation).
There are different types of tests that can be used in the frequentist case. Subsequently, we will walk through the primary steps of a two-tailed, two-sample t-test, which essentially examines whether there is a significant difference in a given metric between two variants (two-sampled) in two directions (two-tailed), either positive or negative. There are tools (like Analytics-Toolkit) that prefer to use a one-tailed test: as the name suggests this measures the difference only in one direction (e.g. is the challenger variant better than the baseline?) and this is suitable in many A/B-test scenarios as one would usually act (e.g. by implementing the new variant) only if a difference is found in a specific direction. However, there might be situations in which knowing the direction of your outcome is important (e.g.: a new version of your site is increasing or decreasing users' interactions) and the tools we will analyze use a two-sided approach in the frequentist case.
Assume you want to run an A/B test on your website to see if a new version (the challenger variant) of the “buy-now” button on the product page is better than the current version (the baseline variant). You might decide to split your users randomly into two roughly equal sized buckets and to show the baseline variant to all users from the first bucket and the challenger variant to all other users. You use the click-rate (the number of clicks on the button divided by the number of impressions, meaning the number of times the button is shown) as the metric for the A/B test.
In the Frequentist approach the null hypothesis (i.e. there is no significant difference between the click-rates of the two variants) is tested against an alternative hypothesis (i.e. the two variants have different click-rates). The main steps for this test are:
STEP 1: Set a maximal threshold (alpha) for the p-value for the significance of your test.
As we mentioned above, we want to be confident that if we see a difference between the click-rates of the variants, this is not due to chance. In more statistical words we want the p-value of the test to be below a certain threshold: this is the probability that you would have a difference (calculated via a test-statistic. See step 4. for a definition of this) as high as the one observed assuming the two variants are statistically equal (i.e. assuming the null-hypothesis is true).
So we fix a maximum threshold (this is usually set to 0.05) and we will make sure the p-value of our test is below this threshold before we reject the null-hypothesis. This threshold is commonly called the alpha level.
1alpha = 0.05 # threshold for the p-value
STEP 2: Calculate the sample size
Obviously you want to see results as soon as possible but to be able to see a significant difference between the variants you have to make sure you have enough objects in your test. The estimation of the number of objects required is called power analysis and it depends on:
- the so-called powerof the statistical test, which indicates the probability of finding a difference between the variants assuming that there is an actual difference
- the threshold alpha set above
- the minimum detectable effect (MDE): how big should the difference between the variants expressed as percentage change.
In the example above, let’s assume that we have measured a conversion rate of 5% on the baseline variant in a given period of time. With the new variant we would like to reach at least 6% (which means an MDE of 20%). How many users do we need in each bucket to reach this in a confident manner?
In the following code example we calculate this using Python and the statsmodels
library. Alternatively you can also use different online tools to get this estimate easily.
1from statsmodels.stats import proportion, power 2 3# conversion rate observed on the current variant 4cr = 0.05 5# rate we would like to reach with the new variant 6mde = 0.2 7expected_cr = cr*(1+mde) 8 9print("MDE: {:.2f}".format(mde)) 10print("Expected click rate: {}".format(expected_cr)) 11 12# calculate the effect size of proportions 13effect_size = proportion.proportion_effectsize(cr, expected_cr) 14 15nr_users = power.TTestIndPower().solve_power( 16 effect_size=effect_size, 17 nobs1=None, 18 alpha=alpha, 19 power=0.8, # standard value for power of the test 20 ratio=1.0 21) 22print(f"Nr of users needed in each bucket: {round(nr_users)}")
Which will output:
MDE: 0.20
Expected click rate: 0.06
Nr of users needed in each bucket: 8144
Notice that power.TTestIndPower().solve_power
needs some input parameters that need to be estimated: we set the power to 0.8 as this is a common value for such statistical tests. To estimate the size effect between the conversion rate (i.e. the magnitude of the difference) we use proportion.proportion_effectsize
which implements Cohen's h formula.
In a later section you will see how the sample size required changes based on different baseline click rates and MDE. You will notice that the smaller the expected difference, the larger the sample size needed: this means that to detect smaller MDE your test has to run longer.
Notice that the results given by using TTestIndPower
with effect_size from statsmodel
differ from the sample results you can get using the Evan’s Miller online calculator)(the difference becomes smaller when the needed sample size increases). This is due to a different assumption on the standard deviation to use under the null hypothesis (statsmodels use a pooled estimate while Evan’s Miller calculator uses the standard deviation of the baseline). For more details have a look at the discussion in Stackoverflow. In general we observed differences using different online calculators(check, for example results from optimizely section “why is your calculator different from other sample size calculators?”).
In the following sections we will use the results from the Python code from above since it is easier to inspect and understand how the numbers are calculated compared to the online tools mentioned.
Based on the sample size and frequency of interactions, you can estimate for the first time the running time of your A/B test.
STEP 3: Run the test (activate the splitting into buckets, serve the two different variants,...) and wait till the number of interactions has reached the calculated sample size.
Be aware that waiting till the sample size is reached is essential for the frequentist evaluation to work properly! The underlying issue is called the peeking Problemand it occurs when the A/B test is stopped as soon as a satisfying “significance” (i.e. low p-value) is reached before the sample size calculated above is obtained. Indeed the p-value can vary during the duration of the test and might reach low values even if there is no difference between the variations.
STEP 4: Accept/Reject the null-hypothesis depending on the p-value as calculated by a t-test and the threshold alpha from the first step.
Finally, once you run the test and collect the results you are ready to see if there is an actual difference between the variants. In an A/B testing scenario where two variants are compared this is usually achieved via an (independent) two-sampled t-test. This test delivers two values:
- test statistics: this is the difference between the means of two groups divided by some standard error. In our case, the means would be the two click-rates.
- p-value (see explanation before): based on this and on the alpha level set at the beginning we can accept/reject the null hypothesis
Let’s continue with the scenario of the A/B test for the new “buy-now” button introduced above. Assume we stop the test and we have the minimum sample size required from the calculation above: 8200 users in the bucket of the baseline variant and 8240 in the bucket of the challenger. We will make a simulation using numpy arrays of 1 (user click) and 0 (user did not click) and the actual click rates of 5% and 5.7%:
1import numpy as np
2seed = 42
3np.random.seed(seed)
4
5def create_random_rawdata(users, clicks):
6 """Generate an array of zeros and ones having:
7 length = users and clicks random ones"""
8 rawdata = np.array([1] * clicks + [0] * (users - clicks))
9
10 # Shuffle the data
11 np.random.shuffle(rawdata)
12 return rawdata
1from statsmodels.stats import weightstats 2 3# number of users for each variation 4users_baseline = 8200 5users_challenger = 8240 6 7# get number of clicks from conversion rate 8clicks_baseline = round(users_baseline*0.05) 9clicks_challenger = round(users_challenger*0.057) 10 11# create fake data of users interactions using randomly generated arrays of 1 and 0s 12data_baseline = create_random_rawdata(users_baseline, clicks_baseline) 13data_challenger = create_random_rawdata(users_challenger, clicks_challenger) 14 15# calculate test statistic and p value 16tstat, p, _ = weightstats.ttest_ind( 17 x1=data_challenger, 18 x2=data_baseline, 19 alternative='two-sided', 20 usevar='pooled', 21 weights=(None, None), 22 value=0 23) 24print("click-rate baseline: {:.2f}%".format(np.mean(data_baseline)*100)) 25print("click-rate challenger: {:.2f}%\n".format(np.mean(data_challenger)*100)) 26 27print("Alpha-level: {}\n".format(alpha)) 28print("T-test statistics: ") 29print("p-value: {:.2f}".format(p)) 30print("tstat: {:.2f}".format(tstat))
Which will output:
click-rate baseline: 5.00%
click-rate challenger: 5.70%
Alpha-level: 0.05
T-test statistics:
p-value: 0.04
tstat: 2.00
The p-value of 0.04 is slightly below our threshold alpha (0.05) set at the beginning. Therefore we can reject the null hypothesis and conclude that there is a significant difference between the new version of the “buy-now” button and the current one.
However, we have only reached an MDE of 14%, which is below the expected MDE we set at the beginning.
It is also possible to avoid using python and use an online calculator.
Bayesian approach
This other approach of evaluating A/B test results is used in some of the newest A/B testing tools and is based on the use of a prior which is then adjusted as you collect data from the results of the test. This approach can be useful when the expected difference between variants is so small that you would need a lot of user interactions to achieve statistically significant results.
In the Bayesian approach the main idea is to estimate the probability that the metric of the challenger is better than the baseline. This is commonly defined as chance to beat control or probability to be best.
Different tutorials can be found online where the Bayesian approach is explained with examples using Python. However, for the sake of completeness, we will go through the main steps of estimating this probability value by following the same example we used in the frequentist approach.
STEP 1: Set a minimum threshold for the probability to be best
We will see in the last point how to calculate the probability that the challenger is better than the baseline based on the current test data. In order to make a safe decision at the end, we need to set a minimum probability we want to reach. A common value for this threshold is 95%.
STEP 2: Choose the prior distribution for the conversion rate
The idea is to find a “good” function to model the probability of observing a certain conversion rate CR. As the conversion rate takes values in between 0 and 1 the beta distributionis a good candidate (as it is defined on the interval [0,1]). Its probability density function (PDF) is defined as:
As you can see from the formula, the probability density function depends on two parameters a and b that we need to choose. If you have no prior information on your conversion rate, a good choice is a=b=1, which is a “flat” prior distribution. Notice that the prior distribution chosen depends on the type of metric you want to analyse. For continuous metrics such as revenue, order value, etc. other probability distributions are considered (e.g. gamma distribution).
STEP 3: Calculate the posterior probability of the conversion rates
Once you start collecting data from the test, you can then “update” your estimation of the probability of observing a certain conversion rate. The posterior distribution can be computed using the prior calculated above via the Bayes’ rule. So assuming we have collected n user interactions and c clicks for a specific variation, we can adapt the probability density function by updating parameters a,b using the data we have collected as follow:
The following Python code generates a graph of the two PDFs
1import matplotlib.pyplot as plt 2import seaborn as sns 3 4rng = np.random.default_rng(seed=seed) 5sim_size = 20000 6 7# extract total nr of users and clicks by variant from the data above 8users_baseline = data_baseline.size 9clicks_baseline = data_baseline.sum() 10 11users_challenger = data_challenger.size 12clicks_challenger = data_challenger.sum() 13 14# set a, b for the prior 15a = 1 16b = 1 17 18# update a and b for the posterior of baseline and challenger 19a_bl = a + clicks_baseline 20b_bl = b + (users_baseline - clicks_baseline) 21 22a_ch = a + clicks_challenger 23b_ch = b + (users_challenger - clicks_challenger) 24 25# calculate posterior distribution f(x, a+c, b+(n-c)) 26posterior_bl = rng.beta(a_bl, b_bl, size=sim_size) 27posterior_ch = rng.beta(a_ch, b_ch, size=sim_size) 28 29# plot the distributions 30plt.figure() 31plt.title("Posterior PDF of baseline and challenger CR") 32sns.kdeplot(posterior_bl, color="blue", label="Baseline") 33sns.kdeplot(posterior_ch, color="orange", label="Challenger") 34plt.legend() 35plt.show()
We can then use these posteriors to calculate how likely it is that the challenger is going to be better than the baseline in general.
STEP 4: Compute the probability that the challenger is better than the baseline
Now that we have the distributions of baseline and challenger metric, we can estimate the probability of the challenger being better than the baseline by checking how often the distribution of the challenger is higher than the baseline.
1ch_bl = posterior_ch > posterior_bl 2print("Estimation of the probability that the challenger is better than the baseline: {:.2f}%".format( 3 np.mean(ch_bl) * 100))
Which will output:
Estimation of the probability that the challenger is better than the baseline: 97.58%
As the probability to be best is (slightly!) higher than the minimum threshold we set in step 1 (95%), we can be confident that the challenger will perform better if we choose it. So we might decide to update our “buy-now” button with the new version of the challenger.
Notice that all we did so far can be easily done with the help of Python modules like bayesian-testing or PyMC3 (the latter has more functionalities as it is designed for general bayesian analysis). For example, with bayesian-testing we can just get a an evaluation summary as follows (which also returns the value of 95.73% from above):
1import pprint 2from bayesian_testing.experiments import BinaryDataTest 3 4bayesian_test_agg = BinaryDataTest() 5 6bayesian_test_agg.add_variant_data_agg(name="baseline", 7 totals=users_baseline, 8 positives=clicks_baseline, 9 a_prior=a, 10 b_prior=b) 11 12bayesian_test_agg.add_variant_data_agg(name="challenger", 13 totals=users_challenger, 14 positives=clicks_challenger, 15 a_prior=a, 16 b_prior=b) 17 18pprint.pprint(bayesian_test_agg.evaluate(sim_count=sim_size, seed=seed))
Which will output
[{'expected_loss': 0.0070563,
'positive_rate': 0.05,
'positives': 410,
'posterior_mean': 0.05011,
'prob_being_best': 0.02415,
'totals': 8200,
'variant': 'baseline'},
{'expected_loss': 3.43e-05,
'positive_rate': 0.05704,
'positives': 470,
'posterior_mean': 0.05715,
'prob_being_best': 0.97585,
'totals': 8240,
'variant': 'challenger'}]
What we presented here is the more classical Bayesian approach based on the probability of being the best. An alternative approach which is sometimes used is based on the so-called HDI+ROPE decision rule. See this paperfor more details.
Comparing the two approaches with the provided example, we conclude that by using the frequentist approach we would have rejected the new variant as the differences are not significant. On the other hand, with the Bayesian approach the probability of the challenger to be better than the baseline is higher than the threshold set: so if we are willing to accept the resulting effect size (12%) we can be confident to switch to the new variant.
Pros and cons of Bayesian vs. Frequentist
In the Frequentist approach we have to wait until we reach the established sample size based on the MDE we want to reach (which also has to be guessed/estimated). The Bayesian approach, on the other hand, can reach conclusions faster. In the table below we have collected different simulation outcomes using the code from the examples in the previous sections: for different baseline and challenger conversion rates, we have calculated the sample size needed for the Frequentist approach and the corresponding sample size in the Bayesian approach to reach a probability of being best of ~96% (slightly above the threshold).
Minimum sample size for different click-rate of the baseline and MDE
Baseline CR | Challenger CR | MDE | Frequentist Sample Size | Bayesian Sample Size | Ratio |
3% | 3.2% | 6% | 117 857 | 42 297 | 2.78 |
3% | 3.3% | 10% | 53 183 | 18 743 | 2.83 |
5% | 5.5% | 10% | 31 218 | 11 137 | 2.80 |
5% | 6% | 20% | 8 144 | 2 859 | 2.84 |
Another advantage is that for the Bayesian approach no estimation of the MDE is required. Which means one less parameter to estimate / guess. On the other side, in the Bayesian case, one has to select a matching probability distribution for the metric.
Pro | Cons | |
Frequentist | no prior assumption needed, only data from the test used | estimation of sample size needed and higher risk of peeking |
Bayesian | reach conclusion faster | need to choose a probability distribution (like beta or gamma) for the metric |
Technical details about bucket assignment
One central aspect is the bucket assignment. It gets some identification (user-id, session-id or request-id) and outputs the bucket-id to use. There are usually two options to implement such a function in A/B testing tools:
- using a persistent data store: query the data store and if no entry is found, draw a random number between 0 and 1, assign the user based on the random number and the bucket probabilities, and store the bucket-id in the persistent data store. For web applications, the persistent data store is located at some server with some local cache because the assignment doesn’t change afterwards. One drawback is the added latency for the initial case when the server needs to be contacted.
- consistent hashing: hash the user-id and convert the hash to a number between 0 and 1 (this is now pseudo-random), assign the user based on this number and the bucket probabilities, and return the bucket-id. This variant doesn’t need a database or server interaction but breaks when a second experiment would be performed: in the new experiment each user would be put into the same buckets. To avoid this, the hash is computed from the user-id and the experiment-id.
Conclusion & next blog post
In conclusion, we have explored a comprehensive overview of A/B testing in this blog post, delving into its fundamental aspects and concepts. Our analysis encompassed a thorough examination of the advantages and disadvantages associated with A/B testing, accompanied by a detailed exploration of the prevalent statistical methods utilised for result evaluation.
In the upcoming post, our focus will shift to a detailed examination of GrowthBook as the initial tool in our series. This will include a description of how we evaluate such tools. Stay tuned for insights into optimising your experimentation processes!
More articles
fromRaimar Falke & Francesca Diana
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog authors
Raimar Falke
Senior IT consultant
Do you still have questions? Just send me a message.
Francesca Diana
Data Scientist
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.