{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "When you have single evaluation metric, you have to know what is the impact of your results to the business side. Analytical speaking, you want to find whether your results is significantly different. You then also want to know about the magnitude and direction of your changes.\n", "\n", "If your results is statisically significant, then you can interpret the results based on the how you characterize the metric and build intuition from it, just as we have discussed in previous blog. You also want to check the variability of the metric that you experiment.\n", "\n", "If your results is not statiscally significant when it really should, then you can do two things. You could subset your experiment by platform, time (day of the week) see what went wrong or different significant if subset by those features. It could lead you to new hypothesis test and understand how your participants reacts. If you just begin in your experiment, you should cross-check your parametric hypothesis and non-parametric hypothesis test.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Effect Size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpg](../galleries/abtesting/5w5.jpg)\n", "\n", "*Screenshot taken from Udacity, A/B Testing, Single Metric Example*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next experiment is about change color and placement of \"Start Now\" button. The metric used is click-through-rate, and the unit of diversion is cookie. There's two things that we can't use analytic standard error. First because our unit of analysis is different than unit of diversion, then the empiric variability and analytical variability expected to be different. Second, more importantly CTR is not following binomial(normal) distribution, instead poisson distribution. The variability can only be calculated empirically.\n", "\n", "Assume Audacity have passed the sanity check for the experiments, want to know analyze whether the changes is worth it for business metric (meaning, the results is statistically significant) and have calculated empirical standard errors. To know which standard errors they want, they calculate the sample size and have 10.000 for each group. \n", "\n", "After they experiment, they have control and experiment grou, number of clocks and pageviews which summarize as above X(number of clicks), N(number of pageviews). We have 0.0035 as our desired standard error earlier, and normalized by scalling factor:\n", "\n", "$$SE = \\frac{SE}{\\sqrt{\\frac{1}{7370}+\\frac{1}{7270}}}$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we build a confidence interval using observed outcome as our point estimate as depicted by the image above. Since we determined that anything outside practical difference 0.01 (dmin) will be significantly different, and indeed the CI is outside of the boundary (typo in the image, CI should be 0.022 to 0.038). So it's recommended to launch the experiment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sign Test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ![jpg](../galleries/abtesting/5w6.jpg)\n", "\n", "*Screenshot taken from Udacity, A/B Testing, Single Metric Example*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we have another example. Suppose we want to experiment changes which will have positive increase every day. Here we have 7-days experiment. We want to know whether positive increase every day results is so rare due to chance that it could be affected by changes in our experiment. Using binomial distribution is not an option, since number of success and failures needs to be at least 5, and we don't have number of failures.\n", "\n", "We can calculate this using binomial:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.0078125])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%R choose(7,7) * 0.5**7 * 0.5**0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results is 0.0078. But this is only a one-tail p-value. Because we want to find what's significantly different, we multiply it by two to get two-tail yields 0.0156. The results is still statistically significant (above dmin) that we should consider launch the experiment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![jpg](../galleries/abtesting/5w7.jpg)\n", "\n", "*Screenshot taken from Udacity, A/B Testing, Checking Invariant, Part 2*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take this to another example and calculate the effect size and sign test." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "Xs_cont = [196, 200, 200, 216, 212, 185, 225, 187, 205, 211, 192, 196, 223, 192] \n", "Ns_cont = [2029, 1991, 1951, 1985, 1973, 2021, 2041, 1980, 1951, 1988, 1977, 2019, 2035, 2007] \n", "Xs_exp = [179, 208, 205, 175, 191, 291, 278, 216, 225, 207, 205, 200, 297, 299] \n", "Ns_exp = [1971, 2009, 2049, 2015, 2027, 1979, 1959, 2020, 2049, 2012, 2023, 1981, 1965, 1993]" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Xcont = sum(Xs_cont)\n", "Ncont = sum(Ns_cont)\n", "Xexp = sum(Xs_exp)\n", "Nexp = sum(Ns_exp)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confidence Interval = (0.0064658534962369341, 0.016736185710796242)\n" ] } ], "source": [ "d = float(Xexp)/Nexp - float(Xcont)/Ncont\n", "scaled_factor_anl = (np.sqrt(1./5000+ 1./5000))\n", "scaled_factor_emp = (np.sqrt(1./Ncont+ 1./Nexp))\n", "SE = 0.0062*scaled_factor_emp / scaled_factor_anl\n", "m = 1.96*SE\n", "print 'Confidence Interval = {}'.format((d-m,d+m))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results we see is that analytically this is significantly different, but notice that the practical positive boundary is within the interval. So we may want to hold that thought, let's see another example." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame({'clicks_control':Xs_cont,\n", " 'pageviews_control':Ns_cont,\n", " 'clicks_experiment':Xs_exp,\n", " 'pageviews_experiment':Ns_exp})" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ctr_exp = df['clicks_experiment']/df['pageviews_experiment']\n", "ctr_cont = df['clicks_control']/df['pageviews_control'] \n", "df['isHigher'] = ctr_exp > ctr_cont" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
clicks_controlclicks_experimentpageviews_controlpageviews_experimentisHigher
019617920291971False
120020819912009True
220020519512049False
321617519852015False
421219119732027False
\n", "