2つの二項分布のパラメタの差の推定(PyMC3, PyStan)

 

二項分布から生成されると仮定される2つのデータがあったときにその差(2つの二項分布のパラメタの差)を推定することを考える。

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from matplotlib import pyplot as plt
%matplotlib inline
 

PyMC3

In [2]:
import pymc3 as pm
from pymc3 import Beta, Binomial, Model, Deterministic
from pymc3 import traceplot, sample, summary
import theano
theano.config.warn.round=False
 

PyMC3でモデルを定義する。ここでは両方とも10回の施行で、それぞれ5回と7回が観察されたものとする。 パラメタの差を表現するにはDeterministicを使う。

In [3]:
model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=10, p=theta1, observed=5)
    k2 = Binomial("k2", n=10, p=theta2, observed=7)
 

MCMCする。

In [4]:
with model:
    trace = sample(5000)
 
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -4.8165: 100%|██████████| 200000/200000 [00:31<00:00, 6403.18it/s]
Finished [100%]: Average ELBO = -4.8215
100%|██████████| 5000/5000 [00:07<00:00, 706.35it/s] 
 

結果を表示する。

In [5]:
summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()
 
delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.168           0.179            0.003            [-0.497, 0.193]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.496         -0.293         -0.175         -0.048         0.194

 
 

同じことを違うデータでやってみる。試行回数と観察回数をそれぞれ10倍する。

In [6]:
model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=100, p=theta1, observed=80)
    k2 = Binomial("k2", n=100, p=theta2, observed=70)

    trace = sample(5000)
    
summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()
 
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -9.2531: 100%|██████████| 200000/200000 [00:30<00:00, 6489.16it/s]
Finished [100%]: Average ELBO = -9.2481
100%|██████████| 5000/5000 [00:07<00:00, 692.82it/s]
 
delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.097            0.059            0.001            [-0.019, 0.215]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.022         0.057          0.098          0.137          0.212

 
 

事後分布の左右の幅が狭くなったのがわかる。

次に試行回数が異なってそれぞれゼロ回の観察だった場合。

In [7]:
model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=1, p=theta1, observed=0)
    k2 = Binomial("k2", n=5, p=theta2, observed=0)

    trace = sample(5000)
    
summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()
 
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -2.5657: 100%|██████████| 200000/200000 [00:30<00:00, 6568.58it/s]
Finished [100%]: Average ELBO = -2.5605
100%|██████████| 5000/5000 [00:08<00:00, 623.50it/s]
 
delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.185            0.261            0.006            [-0.285, 0.713]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.284         0.000          0.153          0.368          0.717

 
 

両方とも0なので、差が0のところが最もありそうなのだが、k1の試行回数が少ない分確信がないので右の裾が長い。

 

PyStan

同じことをPyStanでやってみる。

In [8]:
import pystan
In [9]:
code = """
data { 
  int<lower=0> k1;
  int<lower=1> n1;
  int<lower=0> k2;
  int<lower=1> n2;
} 
parameters {
  real<lower=0,upper=1> theta1;
  real<lower=0,upper=1> theta2;
}
transformed parameters {
  real<lower=-1,upper=1> delta;
  delta <- theta1 - theta2;
}
model {
  theta1 ~ beta(1, 1);
  theta2 ~ beta(1, 1);
  k1 ~ binomial(n1, theta1);
  k2 ~ binomial(n2, theta2);
}
"""

data = {'k1': 5, 'n1': 10, 'k2': 7, 'n2': 10}
In [10]:
fit = pystan.stan(model_code=code, data=data)
 
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_0f3e4edfa3cbb4c0cf311cacc8bc0578 NOW.
In [11]:
print(fit)
fit.plot()
 
Inference for Stan model: anon_model_0f3e4edfa3cbb4c0cf311cacc8bc0578.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta1    0.5  2.4e-3   0.14   0.22    0.4    0.5    0.6   0.77   3408    1.0
theta2   0.67  2.2e-3   0.13   0.39   0.59   0.67   0.76   0.89   3437    1.0
delta   -0.17  3.2e-3   0.19  -0.53   -0.3  -0.17  -0.04   0.22   3667    1.0
lp__   -17.01    0.03   1.07 -19.97 -17.41 -16.68 -16.25 -15.98   1819    1.0

Samples were drawn using NUTS at Sat Jun  3 13:40:12 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
Out[11]: