Sampling

Synopsis

  • Effect on proportion

    • Percentages

      • Bad as percentage can have different intuition for different ranges

      • Difference is subjective

    • Odds Ratio

    • LOR

      • Most appropriate (symmetric)

  • Choosing Effect Size indicator is driven by context and familiarity. Use something which makes sense to individuals/ audience

  • Precision

    • How precise is data?

    • Sources of Errors

      • Sampling Errors

        • Is my sample representative of population?

      • Measurement Errors

        • UOM

        • Improper Recording

      • Random Errors

        • Least important

        • Only one with mathematical model behind it, so it’s used all the time in statistics classes

Imports

import numpy as np
import scipy as sp
import scipy.stats as stats
import matplotlib as mpl
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
import pandas as pd
from IPython.display import display, display_html
import bqplot

from bqplot import LinearScale, Hist, Figure, Axis, ColorScale

from bqplot import pyplot as pltbq
# import seaborn as sns
# seed the random number generator so we all get the same results
np.random.seed(17)

# some nice colors from http://colorbrewer2.org/
COLOR1 = '#7fc97f'
COLOR2 = '#beaed4'
COLOR3 = '#fdc086'
COLOR4 = '#ffff99'
COLOR5 = '#386cb0'

mpl.rcParams['figure.figsize'] = (8.0, 9.0)
Copy to clipboard
%matplotlib inline
Copy to clipboard

Part 1

  • Estimate Avg. weight of man & woman in US

  • Quantify uncertainity in estimate

  • Approach =>

    • Simulate many exp

    • Compare how results vary from one experiment to another

  • Start by assuming distribution and move on show how to eliminate this assumption (solve without it)

#Weight of woman(in kg)

weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()
Copy to clipboard
(72.69764573296688, 16.944043048498038)
Copy to clipboard
help(stats.lognorm)
Copy to clipboard
Help on lognorm_gen in module scipy.stats._continuous_distns object:

class lognorm_gen(scipy.stats._distn_infrastructure.rv_continuous)
 |  lognorm_gen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
 |  
 |  A lognormal continuous random variable.
 |  
 |  %(before_notes)s
 |  
 |  Notes
 |  -----
 |  The probability density function for `lognorm` is:
 |  
 |  .. math::
 |  
 |      f(x, s) = \frac{1}{s x \sqrt{2\pi}}
 |                \exp\left(-\frac{\log^2(x)}{2s^2}\right)
 |  
 |  for :math:`x > 0`, :math:`s > 0`.
 |  
 |  `lognorm` takes ``s`` as a shape parameter for :math:`s`.
 |  
 |  %(after_notes)s
 |  
 |  A common parametrization for a lognormal random variable ``Y`` is in
 |  terms of the mean, ``mu``, and standard deviation, ``sigma``, of the
 |  unique normally distributed random variable ``X`` such that exp(X) = Y.
 |  This parametrization corresponds to setting ``s = sigma`` and ``scale =
 |  exp(mu)``.
 |  
 |  %(example)s
 |  
 |  Method resolution order:
 |      lognorm_gen
 |      scipy.stats._distn_infrastructure.rv_continuous
 |      scipy.stats._distn_infrastructure.rv_generic
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  fit(self, data, *args, **kwds)
 |      Return MLEs for shape (if applicable), location, and scale
 |      parameters from data.
 |      
 |      MLE stands for Maximum Likelihood Estimate.  Starting estimates for
 |      the fit are given by input arguments; for any arguments not provided
 |      with starting estimates, ``self._fitstart(data)`` is called to generate
 |      such.
 |      
 |      One can hold some parameters fixed to specific values by passing in
 |      keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
 |      and ``floc`` and ``fscale`` (for location and scale parameters,
 |      respectively).
 |      
 |      Parameters
 |      ----------
 |      data : array_like
 |          Data to use in calculating the MLEs.
 |      arg1, arg2, arg3,... : floats, optional
 |          Starting value(s) for any shape-characterizing arguments (those not
 |          provided will be determined by a call to ``_fitstart(data)``).
 |          No default value.
 |      kwds : floats, optional
 |          - `loc`: initial guess of the distribution's location parameter.
 |          - `scale`: initial guess of the distribution's scale parameter.
 |      
 |          Special keyword arguments are recognized as holding certain
 |          parameters fixed:
 |      
 |          - f0...fn : hold respective shape parameters fixed.
 |            Alternatively, shape parameters to fix can be specified by name.
 |            For example, if ``self.shapes == "a, b"``, ``fa`` and ``fix_a``
 |            are equivalent to ``f0``, and ``fb`` and ``fix_b`` are
 |            equivalent to ``f1``.
 |      
 |          - floc : hold location parameter fixed to specified value.
 |      
 |          - fscale : hold scale parameter fixed to specified value.
 |      
 |          - optimizer : The optimizer to use.  The optimizer must take ``func``,
 |            and starting position as the first two arguments,
 |            plus ``args`` (for extra arguments to pass to the
 |            function to be optimized) and ``disp=0`` to suppress
 |            output as keyword arguments.
 |      
 |      Returns
 |      -------
 |      mle_tuple : tuple of floats
 |          MLEs for any shape parameters (if applicable), followed by those
 |          for location and scale. For most random variables, shape statistics
 |          will be returned, but there are exceptions (e.g. ``norm``).
 |      
 |      Notes
 |      -----
 |      This fit is computed by maximizing a log-likelihood function, with
 |      penalty applied for samples outside of range of the distribution. The
 |      returned answer is not guaranteed to be the globally optimal MLE, it
 |      may only be locally optimal, or the optimization may fail altogether.
 |      If the data contain any of np.nan, np.inf, or -np.inf, the fit routine
 |      will throw a RuntimeError.
 |      
 |      When the location parameter is fixed by using the `floc` argument,
 |      this function uses explicit formulas for the maximum likelihood
 |      estimation of the log-normal shape and scale parameters, so the
 |      `optimizer`, `loc` and `scale` keyword arguments are ignored.
 |      
 |      Examples
 |      --------
 |      
 |      Generate some data to fit: draw random variates from the `beta`
 |      distribution
 |      
 |      >>> from scipy.stats import beta
 |      >>> a, b = 1., 2.
 |      >>> x = beta.rvs(a, b, size=1000)
 |      
 |      Now we can fit all four parameters (``a``, ``b``, ``loc`` and ``scale``):
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x)
 |      
 |      We can also use some prior knowledge about the dataset: let's keep
 |      ``loc`` and ``scale`` fixed:
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x, floc=0, fscale=1)
 |      >>> loc1, scale1
 |      (0, 1)
 |      
 |      We can also keep shape parameters fixed by using ``f``-keywords. To
 |      keep the zero-th shape parameter ``a`` equal 1, use ``f0=1`` or,
 |      equivalently, ``fa=1``:
 |      
 |      >>> a1, b1, loc1, scale1 = beta.fit(x, fa=1, floc=0, fscale=1)
 |      >>> a1
 |      1
 |      
 |      Not all distributions return estimates for the shape parameters.
 |      ``norm`` for example just returns estimates for location and scale:
 |      
 |      >>> from scipy.stats import norm
 |      >>> x = norm.rvs(a, b, size=1000, random_state=123)
 |      >>> loc1, scale1 = norm.fit(x)
 |      >>> loc1, scale1
 |      (0.92087172783841631, 2.0015750750324668)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from scipy.stats._distn_infrastructure.rv_continuous:
 |  
 |  __init__(self, momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cdf(self, x, *args, **kwds)
 |      Cumulative distribution function of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      cdf : ndarray
 |          Cumulative distribution function evaluated at `x`
 |  
 |  expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)
 |      Calculate expected value of a function with respect to the
 |      distribution by numerical integration.
 |      
 |      The expected value of a function ``f(x)`` with respect to a
 |      distribution ``dist`` is defined as::
 |      
 |                  ub
 |          E[f(x)] = Integral(f(x) * dist.pdf(x)),
 |                  lb
 |      
 |      where ``ub`` and ``lb`` are arguments and ``x`` has the ``dist.pdf(x)``
 |      distribution. If the bounds ``lb`` and ``ub`` correspond to the
 |      support of the distribution, e.g. ``[-inf, inf]`` in the default
 |      case, then the integral is the unrestricted expectation of ``f(x)``.
 |      Also, the function ``f(x)`` may be defined such that ``f(x)`` is ``0``
 |      outside a finite interval in which case the expectation is
 |      calculated within the finite range ``[lb, ub]``.
 |      
 |      Parameters
 |      ----------
 |      func : callable, optional
 |          Function for which integral is calculated. Takes only one argument.
 |          The default is the identity mapping f(x) = x.
 |      args : tuple, optional
 |          Shape parameters of the distribution.
 |      loc : float, optional
 |          Location parameter (default=0).
 |      scale : float, optional
 |          Scale parameter (default=1).
 |      lb, ub : scalar, optional
 |          Lower and upper bound for integration. Default is set to the
 |          support of the distribution.
 |      conditional : bool, optional
 |          If True, the integral is corrected by the conditional probability
 |          of the integration interval.  The return value is the expectation
 |          of the function, conditional on being in the given interval.
 |          Default is False.
 |      
 |      Additional keyword arguments are passed to the integration routine.
 |      
 |      Returns
 |      -------
 |      expect : float
 |          The calculated expected value.
 |      
 |      Notes
 |      -----
 |      The integration behavior of this function is inherited from
 |      `scipy.integrate.quad`. Neither this function nor
 |      `scipy.integrate.quad` can verify whether the integral exists or is
 |      finite. For example ``cauchy(0).mean()`` returns ``np.nan`` and
 |      ``cauchy(0).expect()`` returns ``0.0``.
 |      
 |      The function is not vectorized.
 |      
 |      Examples
 |      --------
 |      
 |      To understand the effect of the bounds of integration consider
 |      
 |      >>> from scipy.stats import expon
 |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0)
 |      0.6321205588285578
 |      
 |      This is close to
 |      
 |      >>> expon(1).cdf(2.0) - expon(1).cdf(0.0)
 |      0.6321205588285577
 |      
 |      If ``conditional=True``
 |      
 |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0, conditional=True)
 |      1.0000000000000002
 |      
 |      The slight deviation from 1 is due to numerical integration.
 |  
 |  fit_loc_scale(self, data, *args)
 |      Estimate loc and scale parameters from data using 1st and 2nd moments.
 |      
 |      Parameters
 |      ----------
 |      data : array_like
 |          Data to fit.
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      
 |      Returns
 |      -------
 |      Lhat : float
 |          Estimated location parameter for the data.
 |      Shat : float
 |          Estimated scale parameter for the data.
 |  
 |  isf(self, q, *args, **kwds)
 |      Inverse survival function (inverse of `sf`) at q of the given RV.
 |      
 |      Parameters
 |      ----------
 |      q : array_like
 |          upper tail probability
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      x : ndarray or scalar
 |          Quantile corresponding to the upper tail probability q.
 |  
 |  logcdf(self, x, *args, **kwds)
 |      Log of the cumulative distribution function at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logcdf : array_like
 |          Log of the cumulative distribution function evaluated at x
 |  
 |  logpdf(self, x, *args, **kwds)
 |      Log of the probability density function at x of the given RV.
 |      
 |      This uses a more numerically accurate calculation if available.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logpdf : array_like
 |          Log of the probability density function evaluated at x
 |  
 |  logsf(self, x, *args, **kwds)
 |      Log of the survival function of the given RV.
 |      
 |      Returns the log of the "survival function," defined as (1 - `cdf`),
 |      evaluated at `x`.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      logsf : ndarray
 |          Log of the survival function evaluated at `x`.
 |  
 |  nnlf(self, theta, x)
 |      Return negative loglikelihood function.
 |      
 |      Notes
 |      -----
 |      This is ``-sum(log pdf(x, theta), axis=0)`` where `theta` are the
 |      parameters (including loc and scale).
 |  
 |  pdf(self, x, *args, **kwds)
 |      Probability density function at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      pdf : ndarray
 |          Probability density function evaluated at x
 |  
 |  ppf(self, q, *args, **kwds)
 |      Percent point function (inverse of `cdf`) at q of the given RV.
 |      
 |      Parameters
 |      ----------
 |      q : array_like
 |          lower tail probability
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      x : array_like
 |          quantile corresponding to the lower tail probability q.
 |  
 |  sf(self, x, *args, **kwds)
 |      Survival function (1 - `cdf`) at x of the given RV.
 |      
 |      Parameters
 |      ----------
 |      x : array_like
 |          quantiles
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      sf : array_like
 |          Survival function evaluated at x
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from scipy.stats._distn_infrastructure.rv_generic:
 |  
 |  __call__(self, *args, **kwds)
 |      Freeze the distribution for the given arguments.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution.  Should include all
 |          the non-optional arguments, may include ``loc`` and ``scale``.
 |      
 |      Returns
 |      -------
 |      rv_frozen : rv_frozen instance
 |          The frozen distribution.
 |  
 |  __getstate__(self)
 |  
 |  __setstate__(self, state)
 |  
 |  entropy(self, *args, **kwds)
 |      Differential entropy of the RV.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          Location parameter (default=0).
 |      scale : array_like, optional  (continuous distributions only).
 |          Scale parameter (default=1).
 |      
 |      Notes
 |      -----
 |      Entropy is defined base `e`:
 |      
 |      >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5)))
 |      >>> np.allclose(drv.entropy(), np.log(2.0))
 |      True
 |  
 |  freeze(self, *args, **kwds)
 |      Freeze the distribution for the given arguments.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution.  Should include all
 |          the non-optional arguments, may include ``loc`` and ``scale``.
 |      
 |      Returns
 |      -------
 |      rv_frozen : rv_frozen instance
 |          The frozen distribution.
 |  
 |  interval(self, alpha, *args, **kwds)
 |      Confidence interval with equal areas around the median.
 |      
 |      Parameters
 |      ----------
 |      alpha : array_like of float
 |          Probability that an rv will be drawn from the returned range.
 |          Each value should be in the range [0, 1].
 |      arg1, arg2, ... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter, Default is 0.
 |      scale : array_like, optional
 |          scale parameter, Default is 1.
 |      
 |      Returns
 |      -------
 |      a, b : ndarray of float
 |          end-points of range that contain ``100 * alpha %`` of the rv's
 |          possible values.
 |  
 |  mean(self, *args, **kwds)
 |      Mean of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      mean : float
 |          the mean of the distribution
 |  
 |  median(self, *args, **kwds)
 |      Median of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          Location parameter, Default is 0.
 |      scale : array_like, optional
 |          Scale parameter, Default is 1.
 |      
 |      Returns
 |      -------
 |      median : float
 |          The median of the distribution.
 |      
 |      See Also
 |      --------
 |      rv_discrete.ppf
 |          Inverse of the CDF
 |  
 |  moment(self, n, *args, **kwds)
 |      n-th order non-central moment of distribution.
 |      
 |      Parameters
 |      ----------
 |      n : int, n >= 1
 |          Order of moment.
 |      arg1, arg2, arg3,... : float
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |  
 |  rvs(self, *args, **kwds)
 |      Random variates of given type.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          Location parameter (default=0).
 |      scale : array_like, optional
 |          Scale parameter (default=1).
 |      size : int or tuple of ints, optional
 |          Defining number of random variates (default is 1).
 |      random_state : {None, int, `~np.random.RandomState`, `~np.random.Generator`}, optional
 |          If `seed` is `None` the `~np.random.RandomState` singleton is used.
 |          If `seed` is an int, a new ``RandomState`` instance is used, seeded
 |          with seed.
 |          If `seed` is already a ``RandomState`` or ``Generator`` instance,
 |          then that object is used.
 |          Default is None.
 |      
 |      Returns
 |      -------
 |      rvs : ndarray or scalar
 |          Random variates of given `size`.
 |  
 |  stats(self, *args, **kwds)
 |      Some statistics of the given RV.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional (continuous RVs only)
 |          scale parameter (default=1)
 |      moments : str, optional
 |          composed of letters ['mvsk'] defining which moments to compute:
 |          'm' = mean,
 |          'v' = variance,
 |          's' = (Fisher's) skew,
 |          'k' = (Fisher's) kurtosis.
 |          (default is 'mv')
 |      
 |      Returns
 |      -------
 |      stats : sequence
 |          of requested moments.
 |  
 |  std(self, *args, **kwds)
 |      Standard deviation of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      std : float
 |          standard deviation of the distribution
 |  
 |  support(self, *args, **kwargs)
 |      Return the support of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, ... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information).
 |      loc : array_like, optional
 |          location parameter, Default is 0.
 |      scale : array_like, optional
 |          scale parameter, Default is 1.
 |      Returns
 |      -------
 |      a, b : float
 |          end-points of the distribution's support.
 |  
 |  var(self, *args, **kwds)
 |      Variance of the distribution.
 |      
 |      Parameters
 |      ----------
 |      arg1, arg2, arg3,... : array_like
 |          The shape parameter(s) for the distribution (see docstring of the
 |          instance object for more information)
 |      loc : array_like, optional
 |          location parameter (default=0)
 |      scale : array_like, optional
 |          scale parameter (default=1)
 |      
 |      Returns
 |      -------
 |      var : float
 |          the variance of the distribution
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from scipy.stats._distn_infrastructure.rv_generic:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  random_state
 |      Get or set the RandomState object for generating random variates.
 |      
 |      This can be either None, int, a RandomState instance, or a
 |      np.random.Generator instance.
 |      
 |      If None (or np.random), use the RandomState singleton used by np.random.
 |      If already a RandomState or Generator instance, use it.
 |      If an int, use a new RandomState instance seeded with seed.
Copy to clipboard

Note

Log normal distribution is specified by 3 parameters

  • σ - Shape parameter or Standard deviation of Distribution

  • m - Scale parameter (median) shrinking or stretching the graphs

  • θ or (μ) - Location parameter ; specifying where it is located in map

For log normal distribution. Check following links

rv = stats.lognorm(1, 0, 50) # Don't know shape , location , median
# xs = np.linspace(1, 100, 100)
ys = rv.rvs(10000)
plt.hist(ys)
Copy to clipboard
(array([9.308e+03, 5.550e+02, 8.500e+01, 3.000e+01, 1.000e+01, 3.000e+00,
        4.000e+00, 3.000e+00, 0.000e+00, 2.000e+00]),
 array([1.12258114e+00, 2.17935158e+02, 4.34747735e+02, 6.51560312e+02,
        8.68372890e+02, 1.08518547e+03, 1.30199804e+03, 1.51881062e+03,
        1.73562320e+03, 1.95243578e+03, 2.16924835e+03]),
 <BarContainer object of 10 artists>)
Copy to clipboard
../_images/Sampling_11_1.png
# For women weight
weight = stats.lognorm(0.23, 0, 70.8)
weight.mean(), weight.std()
Copy to clipboard
(72.69764573296688, 16.944043048498038)
Copy to clipboard
xs = np.linspace(20, 160, 100)
ys = weight.pdf(xs)

plt.plot(xs, ys, linewidth=4, color=COLOR1)
plt.xlabel('weight (kg)')
plt.ylabel('PDF')
plt.show()
Copy to clipboard
../_images/Sampling_13_0.png
def make_sample(n=100):
    sample = weight.rvs(n)
    return sample
Copy to clipboard
sample = make_sample(100)
sample.mean(), sample.std()
Copy to clipboard
(73.04819598338332, 16.22111742103728)
Copy to clipboard
def sample_stat(sample):
    return sample.mean()
Copy to clipboard
sample_stat(sample)
Copy to clipboard
73.04819598338332
Copy to clipboard

Simulating experiment 1000 times

def computing_sampling_distribution(n=100, iters=1000):
    stats = [sample_stat(make_sample(n)) for i in range(iters)]
    return np.array(stats)
Copy to clipboard
sample_means = computing_sampling_distribution(n=100, iters=1000)
Copy to clipboard
plt.hist(sample_means, color=COLOR2)
plt.show()
Copy to clipboard
../_images/Sampling_21_0.png
@interact
def sim(n=[10,20,50,100, 200, 1000, 10000], iters=[100, 1000, 10000]):
    sample_means = computing_sampling_distribution(n, iters)
    mean = sample_means.mean()
    std = sample_means.std()
    plt.hist(sample_means, bins=30, color=COLOR2)
    plt.axvline(mean, color=COLOR4)
    plt.axvline(mean -3*std, color=COLOR5)
    plt.axvline(mean +3*std, color=COLOR5)
    
    conf_int = np.percentile(sample_means, [2.5, 97.5]) # 95% confidence interval
    plt.axvline(conf_int[0], color=COLOR1)
    plt.axvline(conf_int[1], color=COLOR1)
    plt.ylabel("count")
    plt.xlabel(f"sample_means(n = {n})")
    plt.title(f" iters={iters},  $\mu_s$={mean:0.4f},  $\sigma_s$={std:0.4f}")
    plt.show()
Copy to clipboard

Application for Sim Monitor

caption = widgets.Label(value='The values of slider1 and slider2 are synchronized')
sliders1, slider2 = widgets.IntSlider(description='Slider 1'),\
                    widgets.IntSlider(description='Slider 2')
l = widgets.link((sliders1, 'value'), (slider2, 'value'))
display(caption, sliders1, slider2)
Copy to clipboard
The values of slider1 and slider2 are synchronized
0
0

Simple plot Linked to output widget

%matplotlib widget
Copy to clipboard
widget_plot = widgets.Output()
Copy to clipboard
with widget_plot:
    widget_plot.clear_output()
    ax.clear()
    fig, ax = plt.subplots(1,1)
#     display(ax.figure)
    plt.show()
Copy to clipboard
widget_plot
Copy to clipboard
ax.plot(range(1, 100))
fig.canvas.draw()
Copy to clipboard
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-203552d8af35> in <module>
----> 1 ax.plot(range(1, 100))
      2 fig.canvas.draw()

NameError: name 'ax' is not defined
Copy to clipboard
ax.clear()
ax.plot(range(1, 20))
fig.canvas.draw()
Copy to clipboard
# @interact
# def update_plot(n=2):
#     ax.clear()
#     ax.plot(np.linspace(1,100,100)**n)
#     fig.canvas.draw()
Copy to clipboard
widget_plot
Copy to clipboard
widget_pow = widgets.IntSlider(value=2,min=0, max=5)
widget_pow
Copy to clipboard
def update_plot(n=2):
    ax.clear()
    ax.plot(np.linspace(1,100,100)**n)
    fig.canvas.draw()

def handle_change(change):
    update_plot(change.new)
Copy to clipboard
widget_pow.observe(handle_change, "value")
Copy to clipboard
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-cc35b1b6381a> in <module>
----> 1 widget_pow.observe(handle_change, "value")

NameError: name 'widget_pow' is not defined
Copy to clipboard
widgets.VBox([widget_plot, widget_pow])
Copy to clipboard
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-cb580b1392d7> in <module>
----> 1 widgets.VBox([widget_plot, widget_pow])

NameError: name 'widget_pow' is not defined
Copy to clipboard

Lognorm Application

widgets.SelectionSlider(
    options=[1,2, 5, 6, 10],
    value=1,
    description='n',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True
)
Copy to clipboard
def sample_stat(sample, kind='mean'):
    if kind =='mean':
        return sample.mean()
    elif kind == 'coeff_variation':
        return sample.mean()/sample.std()
    elif kind == 'min':
        return sample.min()
    elif kind == 'max':
        return sample.max()
    elif kind == 'median':
        return np.percentile(sample, 50)
    elif kind == 'p10':
        return np.percentile(sample, 10) 
    elif kind == 'p90':
        return np.percentile(sample, 90)
    elif kind == 'IQR':
        return np.percentile(sample, 75) - np.percentile(sample, 25)
    
def make_sample(rv, n=100):
    sample = rv.rvs(n)
    return sample

def computing_sampling_distribution(rv, n=100, iters=1000, kind='mean'):
    stats = [sample_stat(make_sample(rv, n), kind) for i in range(iters)]
    return np.array(stats)
Copy to clipboard
sample_stat(sample, kind='mean')
Copy to clipboard
fig, axes = plt.subplots(nrows=2, ncols=2)
axes
Copy to clipboard
ax0, ax1, ax2, ax3 = axes.flatten()
Copy to clipboard
# def figure(figsize=None):
#     'Temporary workaround for traditional figure behaviour with the ipympl widget backend'
#     fig = plt.figure()
#     if figsize:
#         w, h =  figsize
#     else:
#         w, h = plt.rcParams['figure.figsize']
#     fig.canvas.layout.height = str(h) + 'in'
#     fig.canvas.layout.width = str(w) + 'in'
#     return fig


def update_text(ax, x, y, s, align='right'):
    ax.text(x, y, s,
            horizontalalignment=align,
            verticalalignment='top',
            transform=ax.transAxes)
    
# class KindRange(object):
#     def __init__(self, kind, min_d, max_d):
#         self.kind = kind
#         self.min_d = min_d
#         self.max_d = max_d

class LogNormVisualizer(object):
    
    def __init__(self, shape, loc, median, xs=np.linspace(20, 160, 100)):
        plt.close('all')
        self.shape = shape
        self.loc = loc
        self.median = median
        self.w_shape = widgets.FloatText(value=shape, description='shape',disabled=False)
        self.w_loc = widgets.FloatText(value=loc, description='loc',disabled=False)
        self.w_median = widgets.FloatText(value=median, description='median',disabled=False)
        self.w_n = widgets.SelectionSlider(
                                options=[10,20,50,100, 200, 1000, 10000],
                                value=10,
                                description='n',
                                disabled=False,
                                continuous_update=False,
                                orientation='horizontal',
                                readout=True
                            )
        
        self.w_kind = widgets.Dropdown(
                                    options=['mean', 'coeff_variation', 'min', 'max', 'median', 'p10', 'p90', 'IQR'],
                                    value='mean',
                                    description='Statisitics',
                                    disabled=False,
                                    continuous_update=False,
                                    orientation='horizontal',
                                    readout=True
                                )
        self.kind = self.w_kind.value
        self.w_iters = widgets.SelectionSlider(
                                options=[100,500, 1000, 5000, 10000],
                                value=100,
                                description='iters',
                                disabled=False,
                                continuous_update=False,
                                orientation='horizontal',
                                readout=True
                            )
        self.rv = stats.lognorm(shape, loc, median)
        self.w_rv = widgets.Output()
        plt.close()
        with self.w_rv:
            self.w_rv.clear_output()
            self.fig, axes = plt.subplots(nrows=4, ncols=1)
            self.ax1, self.ax2, self.ax3, self.ax4 = axes
            self.ax1.set_title("Log Normal Distribution")
            self.ax2.set_title("Statistics")
            self.ax3.set_title("Stat_Variation_n")
            self.ax3.set_title("Stat_Variation_iter")
            self.fig.canvas.toolbar_visible = False
            self.fig.canvas.header_visible = False
            self.fig.canvas.footer_visible = False
#             self.fig.canvas.layout.width = '100%'
#             self.fig.canvas.layout.height = '6in'
#             self.fig.canvas.layout.width = '6in'
            self.fig.canvas.resizable = True
            self.fig.canvas.capture_scroll = True
            self.fig.tight_layout()
            
        self.range_store = {}
        self.w_range_min = widgets.FloatText(value=0, description='r_min',disabled=False)
        self.w_range_max = widgets.FloatText(value=100, description='r_max',disabled=False)
        self.w_range_reinit = widgets.Button(disabled=False, description='ReInitialize')
        self.w_rv.layout.display = 'flex-grow'
        self.xs = xs
        self.sample_d = None
        self.r_min = None
        self.r_max = None
        self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
        self.update_rv()
        self.link_widgets()
        
#         self.update_rv_plot()
#         self.update_stat_plot()
        
    def init_kind_range_store(self, kind, r_min, r_max):
        self.update_range_store(kind, r_min, r_max)
        self.w_range_min.value = r_min
        self.w_range_max.value = r_max
            
    def update_range_store(self, kind, r_min, r_max):
        self.range_store[kind] = (r_min, r_max)
    
    def link_widgets(self):
        self.w_shape.observe(self.handle_shape_change, "value")
        self.w_loc.observe(self.handle_loc_change, "value")
        self.w_median.observe(self.handle_median_change,'value')
        self.w_n.observe(self.handle_n_change, 'value')
        self.w_iters.observe(self.handle_iters_change, 'value')
        self.w_kind.observe(self.handle_kind_change, 'value')
        self.w_range_min.observe(self.handle_w_range_min_change, 'value')
        self.w_range_max.observe(self.handle_w_range_max_change, 'value')
        self.w_range_reinit.on_click(self.handle_click)
        
#         plt.show()
    
    
    def handle_click(self, b):
        mean = self.sample_d.mean()
        std = self.sample_d.std()
        self.r_min = np.floor(mean-4*std)
        self.r_max = np.ceil(mean+4*std)
        self.init_kind_range_store(self.kind, self.r_min, self.r_max)
        plt.show()
        
    def reset(self):
        self.update_range_store(self.kind, self.r_min, self.r_max)
        self.ax2.set_xlim(self.r_min, self.r_max)
        plt.show()
#         self.fig.canvas.draw()
#         plt.draw()
        
    def handle_w_range_max_change(self, change):
        self.r_min = self.w_range_min.value
        self.r_max = change.new
        self.reset()
        
    def handle_w_range_min_change(self, change):
        self.r_min = change.new
        self.r_max = self.w_range_max.value
        self.reset()

#         plt.show()
    
    def handle_shape_change(self, change):
        self.shape = change.new
        self.update_rv()
        
    def handle_loc_change(self, change):
        self.loc = change.new
        self.update_rv()
        
    def handle_median_change(self, change):
        self.median = change.new
        self.update_rv()
        
    def handle_n_change(self, change):
        self.n = change.new
        self.update_stat_plot()
        self.update_var_plot()
        
    def handle_iters_change(self, change):
        self.iters = change.new
        self.update_stat_plot()
        self.update_var_plot()
        
    def handle_kind_change(self, change):
        self.kind = change.new
        self.update_stat_plot()
        self.ax3.clear()
        self.update_var_plot()
        
    def update_rv_plot(self):
#         self.ax1.set_xlim([np.random.randint(100),100])
        self.ax1.clear()
        self.ax1.set_title("Log Normal Distribution")
        self.ys = self.rv.pdf(self.xs)
        self.ax1.plot(self.xs, self.ys, linewidth=4, color=COLOR1)
        update_text(self.ax1, 0.97, 0.95,  f"$\sigma$={self.shape}")
        update_text(self.ax1, 0.97, 0.85,  f"$loc$={self.loc}")
        update_text(self.ax1, 0.97, 0.75,  f"$\mu$={self.median}")
        plt.show()
        
    def update_stat_plot(self):
        self.ax2.clear()
        self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
        self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value, 
                                                       self.w_iters.value, 
                                                       kind=self.w_kind.value)
        mean = self.sample_d.mean()
        std = self.sample_d.std()
        self.r_min = np.floor(mean-4*std)
        self.r_max = np.ceil(mean+4*std)
        a,b = np.percentile(self.sample_d,[10, 90])
        self.ax2.hist(self.sample_d, bins=30, color=COLOR2)
        self.ax2.axvline(mean, color=COLOR4)
        self.ax2.axvline(mean -3*std, color=COLOR5)
        self.ax2.axvline(mean +3*std, color=COLOR5)
        self.ax2.axvline(a, color=COLOR3)
        self.ax2.axvline(b, color=COLOR3)
        self.df.loc[self.df.shape[0]] = [self.w_n.value, self.w_iters.value, self.w_kind.value, mean, std]
        if self.kind in self.range_store:
            self.ax2.set_xlim(*self.range_store[self.kind])
            pass
        else:
            self.init_kind_range_store(self.kind, self.r_min, self.r_max)
            self.ax2.set_xlim(self.r_min, self.r_max)
        update_text(self.ax2, 0.03, 0.95,  f"$\sigma_d$={std:0.2f}", align='left')
        update_text(self.ax2, 0.03, 0.85,  f"$\mu_d$={mean:0.2f}", align='left')
        update_text(self.ax2, 0.97, 0.95,  f"$p10_d$={a:0.2f}")
        update_text(self.ax2, 0.97, 0.85,  f"$p90_d$={b:0.2f}")
        plt.show()
        
        
    def update_var_plot(self):
#         self.df.plot.scatter('n', 'value', ax=self.ax3)
        sel = self.df[self.df.kind==self.kind]
        sel.plot.scatter(x='n', y='mean_d', s=sel['std_d']*200, ax=self.ax3, logx=True)
        self.ax3.set_ylim(*self.range_store[self.kind])
        sel.plot.scatter(x='iters', y='mean_d', s=sel['std_d']*200, ax=self.ax4, logx=True)
        self.ax4.set_ylim(*self.range_store[self.kind])
#         sns.scatterplot()
        
        plt.show()
        
    def update_rv(self):
        self.rv = stats.lognorm(self.shape, self.loc, self.median)
#         self.w_rv.clear_output()
        self.update_rv_plot()
        self.update_stat_plot()
        self.update_var_plot()
        
    def view(self):
#         with self.w_rv:
#             display(self.ax.figure)
        self.widget_setter_label = widgets.Label("Parameters", position='center')
        self.widget_setter = widgets.VBox([self.widget_setter_label,
                                           widgets.HBox([self.w_shape, self.w_loc, self.w_median])])
        self.widget_simcontrol = widgets.HBox([self.w_n, self.w_iters, self.w_kind])
        self.range_control = widgets.HBox([self.w_range_min, self.w_range_max, self.w_range_reinit])
        ui = widgets.VBox([self.widget_setter,self.widget_simcontrol,self.range_control, self.w_rv])
        return ui
Copy to clipboard
lnv = LogNormVisualizer(0.23, 0, 70.8)
Copy to clipboard
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-34708152cec1> in <module>
----> 1 lnv = LogNormVisualizer(0.23, 0, 70.8)

<ipython-input-25-7235e5af6dae> in __init__(self, shape, loc, median, xs)
     93         self.r_max = None
     94         self.df = pd.DataFrame(columns=['n', 'iters', 'kind','mean_d','std_d'])
---> 95         self.update_rv()
     96         self.link_widgets()
     97 

<ipython-input-25-7235e5af6dae> in update_rv(self)
    233 #         self.w_rv.clear_output()
    234         self.update_rv_plot()
--> 235         self.update_stat_plot()
    236         self.update_var_plot()
    237 

<ipython-input-25-7235e5af6dae> in update_stat_plot(self)
    190         self.ax2.clear()
    191         self.ax2.set_title(f"Statistics[{self.w_kind.value}]")
--> 192         self.sample_d = computing_sampling_distribution(self.rv, self.w_n.value, 
    193                                                        self.w_iters.value,
    194                                                        kind=self.w_kind.value)

TypeError: computing_sampling_distribution() got an unexpected keyword argument 'kind'
Copy to clipboard
../_images/Sampling_46_1.png
lnv.rv.mean(), lnv.rv.std()
Copy to clipboard
# df = lnv.df
Copy to clipboard
# df.loc[df.shape[0]] = [1,2,3,4]; df
Copy to clipboard
lnv.view()
Copy to clipboard
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-73bcf06dbfee> in <module>
----> 1 lnv.view()

NameError: name 'lnv' is not defined
Copy to clipboard
# lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d',  ax=lnv.ax3)
lnv.df[lnv.df.kind=='mean'].plot.scatter(x='n',y='std_d', s='mean_d', ax=lnv.ax3)
# lnv.df
Copy to clipboard
np.percentile?
Copy to clipboard
# plt.xlim?
Copy to clipboard
w = widgets.IntText()
l = widgets.IntText()
Copy to clipboard
t = 2
def handle_change(change):
    t = change.new
    l.value = t
    
w.observe(handle_change, "value")
Copy to clipboard
widgets.VBox([w,l])
Copy to clipboard
# widgets.Label?
Copy to clipboard
# display?
Copy to clipboard

Estimating PI Computationally

  • Area of circle => πr2

  • Area of square of size 2r => 4r^2

rv = stats.uniform(-1,2)
Copy to clipboard
%matplotlib inline
plt.plot(np.linspace(1,10,10), rv.rvs(10))
plt.show()
Copy to clipboard
../_images/Sampling_63_0.png
@interact
def estimate_pi(n=(1,1000000)):
    x = stats.uniform(-1,2)
    y = stats.uniform(-1,2)
    dist = np.sqrt( x.rvs(n)**2+y.rvs(n)**2)
#     print(dist[:5])
    in_circle = dist <= 1
    pi = sum(in_circle)*4/n
    return pi
#     print(f"Estimated PI {pi} with n:{n}")
Copy to clipboard
500000
s = [estimate_pi(10000) for trials in range(1000)]
Copy to clipboard
sample = np.array(s)
plt.hist(sample, bins=40)
plt.show()
Copy to clipboard
../_images/Sampling_66_0.png
sample.mean(), sample.std(), np.percentile(sample, [5,95])
Copy to clipboard
(3.1407156, 0.017060112445115943, array([3.1116 , 3.16882]))
Copy to clipboard

Part 2

What if we don’t know the underlying assumption?

  • Take the Sample

  • Generate Model for Population from Sample

  • Calculate Sampling Statistics by running experiments on this model

sample = [1,2,3,4,5,2,1 ,4]
Copy to clipboard
n = len(sample)
Copy to clipboard
np.random.choice(sample, 
                 n, 
                 replace=True) # New sample from original sample , after this you can run experiment and calculate sampling statistics
Copy to clipboard
array([4, 1, 1, 3, 2, 5, 5, 5])
Copy to clipboard

Resampling

class Resampler(object):
    def __init__(self, sample, xlim=None, iters=1000):
        self.sample = sample
        self.n = len(sample)
        self.iters = iters 
    
    def resample(self):
        new_sample = np.random.choice(self.sample, self.n, replace=True)
        return new_sample
    
    
    def sample_stat(self, sample):
        return sample.mean()
    
    def computing_sampling_distribution(self):
        stats = [self.sample_stat(self.resample()) for i in range(self.iters)]
        return np.array(stats)
    
    def plot_summary_statistics_distribution(self):
        fig, ax = plt.subplots(1)
        stats = self.computing_sampling_distribution()
        
        mean = stats.mean()
        SE = stats.std()
        
        conf_int = np.percentile(stats, [2.5, 97.5]) # 95% confidence interval
        plt.axvline(mean, color=COLOR1)
        plt.axvline(conf_int[0], color=COLOR1)
        plt.axvline(conf_int[1], color=COLOR1)
#         plt.xlim(mean-4std, )
        plt.hist(stats, color=COLOR4)
        plt.show()
Copy to clipboard
# s = np.random.random([1, 100]).flatten()
s = weight.rvs(10)
rsample = Resampler(s, iters=1000)
rsample.sample.shape
Copy to clipboard
(10,)
Copy to clipboard
# stat = rsample.computing_sampling_distribution()
# stat.shape
Copy to clipboard
rsample.plot_summary_statistics_distribution()
Copy to clipboard
../_images/Sampling_78_0.png
 np.random.choice([1,2,3],3, replace=True)
Copy to clipboard
array([3, 1, 3])
Copy to clipboard
plt.plot(s)
Copy to clipboard
[<matplotlib.lines.Line2D at 0x7feaabb10430>]
Copy to clipboard
../_images/Sampling_80_1.png
xs = np.linspace(20, 160, 100)
plt.plot(weight.pdf(xs))
Copy to clipboard
[<matplotlib.lines.Line2D at 0x7feaabc22d60>]
Copy to clipboard
../_images/Sampling_82_1.png
type(weight)
Copy to clipboard
scipy.stats._distn_infrastructure.rv_frozen
Copy to clipboard
type(sample)
Copy to clipboard
list
Copy to clipboard
class StdResampler(Resampler):
    def sample_stat(self, sample):
        return sample.std()
Copy to clipboard
s = weight.rvs(1000)
rsample = StdResampler(s, iters=1000)
rsample.sample.shape
Copy to clipboard
(1000,)
Copy to clipboard
rsample.plot_summary_statistics_distribution()
Copy to clipboard
../_images/Sampling_87_0.png

Part 3

In this section we will implement above for multiple group problems

mu1, sig1 = 178, 7.7
male_height = stats.norm(mu1, sig1); male_height
Copy to clipboard
<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9c58e0>
Copy to clipboard
mu2, sig2 = 163, 7.3
female_height = stats.norm(mu2, sig2); female_height
Copy to clipboard
<scipy.stats._distn_infrastructure.rv_frozen at 0x7feaab9ec4c0>
Copy to clipboard
n = 1000
male_sample = male_height.rvs(1000)
female_sample = female_height.rvs(1000)
male_mean = male_height.mean()
female_mean = female_height.mean()
male_std = male_height.std()
female_std = female_height.std()
Copy to clipboard
difference = (mu1 - mu2)

relative_difference_by_male = difference/mu1*100
relative_difference_by_female = difference/mu2*100
relative_difference_by_male, relative_difference_by_female
Copy to clipboard
(8.426966292134832, 9.202453987730062)
Copy to clipboard
thresh = (male_mean*female_std+female_mean*male_std)/(male_std+female_std); thresh
Copy to clipboard
170.3
Copy to clipboard
male_overlap = sum((male_sample < thresh))/ len(male_sample)
male_overlap 
Copy to clipboard
0.163
Copy to clipboard
female_overlap = sum((female_sample > thresh))/ len(female_sample);
female_overlap
Copy to clipboard
0.139
Copy to clipboard
overlap = (male_overlap + female_overlap)/2; overlap
Copy to clipboard
0.15100000000000002
Copy to clipboard
prob_superiority = sum(male_sample > female_sample)/(len(male_sample)+len(female_sample))
Copy to clipboard
prob_superiority = (male_sample > female_sample).mean()
prob_superiority
Copy to clipboard
0.929
Copy to clipboard
def overlap(grp1_sample, grp2_sample):
    """
        grp1: Control
        grp2: Treatment
    """
    
#     grp1_sample = grp1.rvs(1000)
#     grp2_sample = grp2.rvs(1000)
    m1 = grp1_sample.mean()  #grp1.mean()
    m2 = grp2_sample.mean()  #grp2.mean()
    s1 = grp1_sample.std()  #grp1.std()
    s2 = grp2_sample.std()  #grp1.std()
    thresh = (m1*s2+m2*s1)/(s1+s2)
    
    grp1_overlap = sum(grp1_sample<thresh)/ len(grp1_sample)
    grp2_overlap = sum(grp2_sample>thresh)/ len(grp2_sample)
    misclassification_rate = (grp1_overlap+grp2_overlap)/2
    return misclassification_rate
    
Copy to clipboard
grp1_sample = male_height.rvs(1000)
grp2_sample = female_height.rvs(1000)

overlap(grp1_sample, grp2_sample)
Copy to clipboard
0.166
Copy to clipboard
def prob_superior(grp1_sample, grp2_sample):
    # Assumes same size 
    assert len(grp1_sample) == len(grp2_sample)
    return(grp1_sample>grp2_sample).mean()
Copy to clipboard
prob_superior(grp1_sample, grp2_sample)
Copy to clipboard
0.932
Copy to clipboard
grp1_sample.var()
Copy to clipboard
58.99469052969996
Copy to clipboard
def cohen_effect(grp1_sample, grp2_sample):
    diff = grp1_sample.mean() - grp2_sample.mean()
    var1, var2 = grp1_sample.var(), grp2_sample.var()
    n1, n2 = len(grp1_sample), len(grp2_sample)
    pooled_var = (n1*var1+n2*var2)/(n1+n2)
    d = diff/np.sqrt(pooled_var)
    return d
Copy to clipboard
cohen_effect(grp1_sample, grp2_sample)
Copy to clipboard
1.9942476570475194
Copy to clipboard
def color_data(inp, CI):
    a, b = CI
    t1 = inp< a
    t2= inp>b                                       
    color_data = (t1+t2).astype(np.int)
    return color_data

class MultiGrpResampler(object):
    def __init__(self, sample1, sample2, xlim=None, iters=1000, summary_stat='cohen'):
        self.sample1 = sample1
        self.sample2 = sample2
        self.xlim = xlim
        self.iters = iters
        self.summary_stat = summary_stat
    
    def resample(self):
        new_sample1 = np.random.choice(self.sample1, len(self.sample1), replace=True)
        new_sample2 = np.random.choice(self.sample2, len(self.sample2), replace=True)
        return new_sample1, new_sample2
    
    def calc_summary_stat(self, sample1, sample2):
        if self.summary_stat == 'cohen':
            return cohen_effect(sample1, sample2)
        if self.summary_stat == 'overlap':
            return overlap(sample1, sample2)
        if self.summary_stat == 'prob_superiority':
            return prob_superior(sample1, sample2)
        
        
    def compute_sampling_distribution(self):
        summary_stats = [self.calc_summary_stat(*self.resample()) for i in range(self.iters)]
        return np.array(summary_stats)
    
    
    def sampling_distribution(self):
        summary_stats = self.compute_sampling_distribution()
        
        mean = summary_stats.mean()
        std = summary_stats.std()
        CI = np.percentile(summary_stats, [5, 95]) # 90% confidence interval
        
        return  summary_stats, mean, std, CI
    
    
    def plot_sampling_distribution(self):
        summary_stats, mean, std, CI = self.sampling_distribution()
        bins = 30
#         hist_data = np.histogram(summary_stats, bins)[1]
        x_sc = LinearScale()
        y_sc = LinearScale()
        col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
        y,edges = np.histogram(summary_stats, bins)
        centers = 0.5*(edges[1:]+ edges[:-1])
        cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
        ax_x = Axis(scale=x_sc, tick_format='0.3f')
        ax_y = Axis(scale=y_sc, orientation='vertical')
        vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc, 'color': col_sc})
        hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=bins, colors=cdata)
        fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
                     title=f'Sampling Distribution {self.summary_stat}' )
#         print(fig.marks[0].colors)
        
        return fig
        
        
        
    
Copy to clipboard
sample1 = male_height.rvs(1000)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,)
summary_stats, mean, std, CI = rsampl.sampling_distribution()
mean, std, CI
Copy to clipboard
(1.988103230980105, 0.05500478400582647, array([1.89740944, 2.07562714]))
Copy to clipboard
def color_data(inp, CI):
    a, b = CI
    t1 = inp< a
    t2= inp>b                                       
    color_data = (t1+t2).astype(np.int)
    return color_data
Copy to clipboard
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
Copy to clipboard
# hist = Hist(data=summary_stats, scales={'summary_statistics': x_sc, 'count':y_sc})

hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc})
Copy to clipboard
len(hist.midpoints)
Copy to clipboard
0
Copy to clipboard
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
ax_x
Copy to clipboard
Axis(scale=LinearScale(), tick_format='0.2f')
Copy to clipboard
color_data(hist.midpoints, CI)
Copy to clipboard
array([], dtype=int64)
Copy to clipboard
np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
Copy to clipboard
[]
Copy to clipboard
hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors
Copy to clipboard
[]
Copy to clipboard
# Figure(marks=[hist], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution' )
Copy to clipboard
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_mean, vline_a, vline_b
Copy to clipboard
(Lines(colors=['orangered'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.12200944, 2.12200944]), y=array([0, 1])),
 Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.02652263, 2.02652263]), y=array([0, 1])),
 Lines(colors=['steelblue'], interactions={'hover': 'tooltip'}, preserve_domain={'x': False, 'y': True}, scales={'y': LinearScale(allow_padding=False, max=1.0, min=0.0), 'x': LinearScale()}, scales_metadata={'x': {'orientation': 'horizontal', 'dimension': 'x'}, 'y': {'orientation': 'vertical', 'dimension': 'y'}, 'color': {'dimension': 'color'}}, tooltip_style={'opacity': 0.9}, x=array([2.21439821, 2.21439821]), y=array([0, 1])))
Copy to clipboard
Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, title='Sampling Distribution 2' )
Copy to clipboard
1.952.002.052.102.152.202.252.300102030405060708090100110120C1C1C1Sampling Distribution 2

Function Call

bins=30
y,edges = np.histogram(summary_stats, bins)
centers = 0.5*(edges[1:]+ edges[:-1])
cdata = np.array(col_sc.colors)[color_data(centers, CI)].tolist()
cdata
Copy to clipboard
['Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'MediumSeaGreen',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red']
Copy to clipboard
n = 100000
sample1 = male_height.rvs(n)
sample2 = female_height.rvs(1000)

rsampl = MultiGrpResampler(sample1, sample2,summary_stat='cohen')
fig = rsampl.plot_sampling_distribution()
# fig.marks[0].colors = cdata
fig
Copy to clipboard
1.8201.8401.8601.8801.9001.9201.9401.9601.9802.00001020304050607080C1C1C1Sampling Distribution cohen
fig.marks[0]
Copy to clipboard
Hist(bins=30, colors=['Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'MediumSeaGreen', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red', 'Red'], interactions={'hover': 'tooltip'}, scales={'sample': LinearScale(), 'count': LinearScale(), 'color': ColorScale(colors=['MediumSeaGreen', 'Red'])}, scales_metadata={'sample': {'orientation': 'horizontal', 'dimension': 'x'}, 'count': {'orientation': 'vertical', 'dimension': 'y'}}, tooltip_style={'opacity': 0.9})
Copy to clipboard
fig.marks[0].bins
Copy to clipboard
30
Copy to clipboard
fig
Copy to clipboard
1.982.002.022.042.062.082.102.12C1C1C1Sampling Distribution cohen
summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)
Copy to clipboard
hist.colors
Copy to clipboard
['steelblue']
Copy to clipboard
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
             title=f'Sampling Distribution {summary_stat}' )
Copy to clipboard
fig
Copy to clipboard
1.952.002.052.102.152.200102030405060708090C1C1C1Sampling Distribution cohen
hist.bins = 30
hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
hist.colors
Copy to clipboard
[]
Copy to clipboard
fig
Copy to clipboard
1.952.002.052.102.152.200102030405060708090C1C1C1Sampling Distribution cohen
summary_stat = 'cohen'
summary_stats, mean, std, CI = rsampl.sampling_distribution()
x_sc = LinearScale()
y_sc = LinearScale()
col_sc = ColorScale(colors=['MediumSeaGreen', 'Red'])
ax_x = Axis(scale=x_sc, tick_format='0.2f')
ax_y = Axis(scale=y_sc, orientation='vertical')
hist = Hist(sample=summary_stats, scales={'sample': x_sc, 'count': y_sc, 'color': col_sc}, bins=30)
# hist.bins = 30
with hist.hold_sync():
    hist.colors = np.array(col_sc.colors)[color_data(hist.midpoints, CI)].tolist()
vline_mean = pltbq.vline(mean, stroke_width=2, colors=['orangered'], scales={'y': y_sc, 'x': x_sc})
vline_a = pltbq.vline(CI[0], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
vline_b = pltbq.vline(CI[1], stroke_width=2, colors=['steelblue'], scales={'y': y_sc, 'x': x_sc})
fig = Figure(marks=[hist, vline_mean, vline_a, vline_b], axes=[ax_x, ax_y], padding=0, 
             title=f'Sampling Distribution {summary_stat}' )
fig
Copy to clipboard
1.851.901.952.002.052.102.152.200102030405060708090100110C1C1C1Sampling Distribution cohen
np.histogram([12,3,4,5, 19], bins=3)
Copy to clipboard
(array([3, 1, 1]), array([ 3.        ,  8.33333333, 13.66666667, 19.        ]))
Copy to clipboard
np.histogram?
Copy to clipboard