ポワソン分布は単位時間あたりλ回起こる出来事の確率分布と説明されることが多い。

これをPythonのシミュレーションで確かめてみる。

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame, Series
from scipy.stats import poisson

%matplotlib inline

シミュレーションを行うには単位時間あたりの長さのほかにシミュレートする時間の長さが必要だ。

単位時間あたり4回起こる出来事を長さが2の時間の中にランダムに配置するとすると、8個のイベントを配置するということになる。

 [random.random() * 2 for _ in range(0, 8)]

[1.7265978877690655,
 0.05240790096914916,
 1.0092316367962268,
 1.7811176759439296,
 1.5008611575306552,
 1.4178547662306291,
 0.3848485501520824,
 1.9601703946981495]

単位時間1の中にいくつ入ったのかを数える。

[random.random() * 2 < 1 for _ in range(0, 8)].count(True)

3

これを10000回繰り返してヒストグラムを描く。

simulations = [
    [random.random() * 2 < 1 for _ in range(0, 8)].count(True)
    for _ in range (0, 10000)]
plt.hist(simulations, bins=range(0, 8+2), normed=True, color='gray')
plt.show()

これは実質的には試行回数が8で確率が1/2の二項分布のシミュレーションを行っていることになる。 8個のイベントしか配置していないから8回よりも多くなることはない。

一方ポワソン分布のパラメタは単位時間あたりの期待値だけで、8のような上限はない。そこで期待値が4に固定されるようにして、長さとイベント数を増やしてみる。

さらにシミュレーション結果の隣にポワソン分布の理論値のヒストグラムを並べるようにする。

lam = 4

lengths = [2, 4, 8, 16]

fig_columns = 2
fig_rows = len(lengths) / fig_columns + (1 if len(lengths) % fig_columns > 0 else 0)
fig, axes = plt.subplots(fig_rows, fig_columns)

for i, length in enumerate(lengths):
    num_events = int(length * lam) # 単位長さ当たりのイベント数num_events/length=lamになるようにする

    simulations = [
        # 長さlengthの中にnum_events個のイベントを配置し、1未満に入ったイベントの数を数える
        [random.random() * length < 1 for _ in range(0, num_events)].count(True)
        for _ in range (0, 10000)] # 10000回シミュレートする

    ax = axes[i / fig_columns, i % fig_columns]
 
    x_range = Series(range(0, int(poisson.ppf(0.99999, lam))))

    (_, _, bar_sim) = ax.hist(simulations, bins=x_range, normed=True, color='gray', width=0.5)
    bar_poi = ax.bar(x_range + 0.5, poisson.pmf(x_range, lam), color='white', width=0.5)

    ax.set_xlabel('count')
    ax.set_ylabel('probability')
    ax.set_title('length=%d, # of events=%d' % (length, num_events))

fig.legend((bar_sim[0], bar_poi[0]), ('simulation', 'poisson'), 'upper right', bbox_to_anchor=(1.2, 1))
fig.subplots_adjust(wspace=0.5, hspace=1)

plt.show()

長さとイベント数が多くなるにつれてシミュレーションの分布がポワソン分布に近づいていることがわかる。

参考にした文献：「ポワソン分布・ポワソン回帰・ポワソン過程」島谷健一郎（近代科学社）

今度は二項分布のパラメタが1つだけあって、同じ二項分布から複数のセッション（n回中k回）がデータとして取られる場合に、その1つだけのパラメタを推定するということを考える。

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from matplotlib import pyplot as plt
%matplotlib inline

PyMC3¶

import pymc3 as pm
from pymc3 import Beta, Binomial, Bernoulli, Model, Deterministic
from pymc3 import traceplot, sample, summary
import theano
theano.config.warn.round=False

PyMC3でモデルを定義する。観察されるデータはk1をk2の2つあり、同じパラメタthetaの二項分布から生み出されているとする。2つの観察データは10回中5回と10回中7回であった。

model = Model()
with model:
    # Prior
    theta = Beta('theta', alpha=1, beta=1)

    # Observed
    k1 = Binomial("k1", n=10, p=theta, observed=5)
    k2 = Binomial("k2", n=10, p=theta, observed=7)

    trace = sample(5000)

summary(trace[1000:])
traceplot(trace[1000:])
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -4.4786: 100%|██████████| 200000/200000 [00:23<00:00, 8437.65it/s]
Finished [100%]: Average ELBO = -4.4647
100%|██████████| 5000/5000 [00:03<00:00, 1276.28it/s]

theta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.591            0.099            0.003            [0.386, 0.778]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.381          0.529          0.596          0.658          0.776

上記ではk1とk2という2つのBinomialを記述したが、下記のように1つのBinomialでobservedを配列にする書き方もできるようだ。

model = Model()
with model:
    # Prior
    theta = Beta('theta', alpha=1, beta=1)

    # Observed
    k = Binomial("k", n=10, p=theta, observed=[5, 7])
    trace = sample(5000)
    
summary(trace[1000:])
traceplot(trace[1000:])
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -4.4718: 100%|██████████| 200000/200000 [00:26<00:00, 7629.06it/s]
Finished [100%]: Average ELBO = -4.4784
100%|██████████| 5000/5000 [00:04<00:00, 1172.66it/s]

theta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.590            0.101            0.002            [0.394, 0.788]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.380          0.525          0.594          0.660          0.780

ところで10回中5回、7回の2つの二項分布のセッションがあるというのは、20個のベルヌーイ分布で12回が1であるという見方をしてもいいのだろうか。

model = Model()
with model:
    # Prior
    theta = Beta('theta', alpha=1, beta=1)

    # Observed
    k = Bernoulli("k", p=theta, observed=[1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0])
    trace = sample(5000)
    
summary(trace[1000:])
traceplot(trace[1000:])
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -14.793: 100%|██████████| 200000/200000 [00:24<00:00, 8070.62it/s]
Finished [100%]: Average ELBO = -14.791
100%|██████████| 5000/5000 [00:04<00:00, 1211.14it/s]

theta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.595            0.100            0.002            [0.396, 0.778]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.393          0.528          0.597          0.666          0.777

この場合でもパラメタの推定値が同様であることがわかる。

PyStan¶

同じことをPyStanでやってみる。まずはk1とk2を別々に定義する書き方から。

import pystan

code = """
data { 
  int<lower=0> k1;
  int<lower=1> n1;
  int<lower=0> k2;
  int<lower=1> n2;
} 
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  k1 ~ binomial(n1, theta);
  k2 ~ binomial(n2, theta);
}
"""

data = {'k1': 5, 'n1': 10, 'k2': 7, 'n2': 10}

fit = pystan.stan(model_code=code, data=data)
print(fit)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_8bbd120c2f3e5793a7ec522da7d4265d NOW.

Inference for Stan model: anon_model_8bbd120c2f3e5793a7ec522da7d4265d.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta   0.59  1.9e-3    0.1   0.38   0.53    0.6   0.66   0.78   2781    1.0
lp__  -15.39    0.02   0.71 -17.42 -15.55 -15.11 -14.93 -14.88   2032    1.0

Samples were drawn using NUTS at Sun Jun  4 01:18:06 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Stanでも配列を使うことができる。forを使ってインデクシングするようだ。

code = """
data {
  int N;
  int<lower=0> k[N];
  int<lower=1> n[N];
} 
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  for (i in 1:N)
    k[i] ~ binomial(n[i], theta);
}
"""

data = {'N': 2, 'k': [5, 7], 'n': [10, 10]}

fit = pystan.stan(model_code=code, data=data)
print(fit)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_8ee3da0050cf86e0c9c1a53e4765a6c7 NOW.

Inference for Stan model: anon_model_8ee3da0050cf86e0c9c1a53e4765a6c7.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta   0.59  2.1e-3    0.1   0.39   0.52   0.59   0.66   0.77   2295    1.0
lp__  -15.38    0.02   0.72 -17.34 -15.52  -15.1 -14.93 -14.88   1978    1.0

Samples were drawn using NUTS at Sun Jun  4 01:19:16 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

二項分布から生成されると仮定される2つのデータがあったときにその差（2つの二項分布のパラメタの差）を推定することを考える。

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from matplotlib import pyplot as plt
%matplotlib inline

PyMC3¶

import pymc3 as pm
from pymc3 import Beta, Binomial, Model, Deterministic
from pymc3 import traceplot, sample, summary
import theano
theano.config.warn.round=False

PyMC3でモデルを定義する。ここでは両方とも10回の施行で、それぞれ5回と7回が観察されたものとする。パラメタの差を表現するにはDeterministicを使う。

model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=10, p=theta1, observed=5)
    k2 = Binomial("k2", n=10, p=theta2, observed=7)

MCMCする。

with model:
    trace = sample(5000)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -4.8165: 100%|██████████| 200000/200000 [00:31<00:00, 6403.18it/s]
Finished [100%]: Average ELBO = -4.8215
100%|██████████| 5000/5000 [00:07<00:00, 706.35it/s]

結果を表示する。

summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()

delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.168           0.179            0.003            [-0.497, 0.193]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.496         -0.293         -0.175         -0.048         0.194

同じことを違うデータでやってみる。試行回数と観察回数をそれぞれ10倍する。

model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=100, p=theta1, observed=80)
    k2 = Binomial("k2", n=100, p=theta2, observed=70)

    trace = sample(5000)
    
summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -9.2531: 100%|██████████| 200000/200000 [00:30<00:00, 6489.16it/s]
Finished [100%]: Average ELBO = -9.2481
100%|██████████| 5000/5000 [00:07<00:00, 692.82it/s]

delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.097            0.059            0.001            [-0.019, 0.215]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.022         0.057          0.098          0.137          0.212

事後分布の左右の幅が狭くなったのがわかる。

次に試行回数が異なってそれぞれゼロ回の観察だった場合。

model = Model()
with model:
    # Prior
    theta1 = Beta('theta1', alpha=1, beta=1)
    theta2 = Beta('theta2', alpha=1, beta=1)

    # Difference
    delta = Deterministic('delta', theta1 - theta2)

    # Observed
    k1 = Binomial("k1", n=1, p=theta1, observed=0)
    k2 = Binomial("k2", n=5, p=theta2, observed=0)

    trace = sample(5000)
    
summary(trace[1000:], varnames=['delta'])
traceplot(trace[1000:], varnames=['delta'])
plt.show()

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -2.5657: 100%|██████████| 200000/200000 [00:30<00:00, 6568.58it/s]
Finished [100%]: Average ELBO = -2.5605
100%|██████████| 5000/5000 [00:08<00:00, 623.50it/s]

delta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.185            0.261            0.006            [-0.285, 0.713]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.284         0.000          0.153          0.368          0.717

両方とも0なので、差が0のところが最もありそうなのだが、k1の試行回数が少ない分確信がないので右の裾が長い。

PyStan¶

同じことをPyStanでやってみる。

import pystan

code = """
data { 
  int<lower=0> k1;
  int<lower=1> n1;
  int<lower=0> k2;
  int<lower=1> n2;
} 
parameters {
  real<lower=0,upper=1> theta1;
  real<lower=0,upper=1> theta2;
}
transformed parameters {
  real<lower=-1,upper=1> delta;
  delta <- theta1 - theta2;
}
model {
  theta1 ~ beta(1, 1);
  theta2 ~ beta(1, 1);
  k1 ~ binomial(n1, theta1);
  k2 ~ binomial(n2, theta2);
}
"""

data = {'k1': 5, 'n1': 10, 'k2': 7, 'n2': 10}

fit = pystan.stan(model_code=code, data=data)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_0f3e4edfa3cbb4c0cf311cacc8bc0578 NOW.

print(fit)
fit.plot()

Inference for Stan model: anon_model_0f3e4edfa3cbb4c0cf311cacc8bc0578.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta1    0.5  2.4e-3   0.14   0.22    0.4    0.5    0.6   0.77   3408    1.0
theta2   0.67  2.2e-3   0.13   0.39   0.59   0.67   0.76   0.89   3437    1.0
delta   -0.17  3.2e-3   0.19  -0.53   -0.3  -0.17  -0.04   0.22   3667    1.0
lp__   -17.01    0.03   1.07 -19.97 -17.41 -16.68 -16.25 -15.98   1819    1.0

Samples were drawn using NUTS at Sat Jun  3 13:40:12 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

前回の記事で二項分布のパラメタ推定をPyMC2で行った。しかしPyMC2をいまから使っていくのも微妙な気がしてきたので新しいPyMC3で書き直す。ついでにPyStanでも同じことをして比べてみる。

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from matplotlib import pyplot as plt
%matplotlib inline

PyMC3¶

まずはPyMC3のモジュールをインポートする。 theano.config.warn.round=Falseは警告を抑止するための設定。

import pymc3 as pm
from pymc3 import Beta, Binomial, Model
from pymc3 import traceplot, sample, summary
import theano
theano.config.warn.round=False

次にモデルを定義する。

model = Model()
with model:
    theta = Beta('theta', alpha=1, beta=1)
    k = Binomial("k", n=10, p=theta, observed=5)

PyMC3ではこのようにwith構文を使ってモデルを定義する。何をしているのかわからなくて面食らうが、見た目上モデル定義がまとまりをなすようにするための工夫だろうか。多分ダイナミックな文脈を作ってBetaやBinomialなどが現在の文脈と結びつくようにしているのかもしれない。

MCMCを行うにはsample関数を使う。

with model:
    trace = sample(5000)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -2.4075: 100%|██████████| 200000/200000 [00:24<00:00, 8220.59it/s]
Finished [100%]: Average ELBO = -2.4025
100%|██████████| 5000/5000 [00:04<00:00, 1246.86it/s]

summary関数やtraceplot関数で結果を表示することができる。

summary(trace[1000:])
traceplot(trace[1000:])
plt.show()

theta:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.494            0.129            0.003            [0.248, 0.756]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.234          0.407          0.497          0.581          0.745

PyStan¶

次に同じことをPyStanでやってみる。

import pystan

PyStanではモデルは完全にStanの文法で記述することになる。

code = """
data { 
  int<lower=0> k;
} 
parameters {
  real<lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1, 1);
  k ~ binomial(10, theta);
}
"""

data = {'k': 5}

stan関数でMCMCが行われる。

fit = pystan.stan(model_code=code, data=data)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_79f6d134e0766f90fac665338011aa8a NOW.

stan関数の戻り値をprintしたりplotメソッドを呼ぶと結果を表示できる。

print(fit)
fit.plot()

Inference for Stan model: anon_model_79f6d134e0766f90fac665338011aa8a.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
theta    0.5  2.5e-3   0.14   0.24    0.4    0.5    0.6   0.75   2907    1.0
lp__   -8.82    0.02   0.71 -10.93  -9.01  -8.55  -8.37  -8.32   2050    1.0

Samples were drawn using NUTS at Sat Jun  3 12:48:06 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

なんで2つずつ出るのだろうか。これはよくわからない。

PyMCを試すことにする。 Pythonで体験するベイズ推論を買った関係上、PyMC3ではなくてPyMC2を使っている。

この記事の内容は書籍と関係なく、書籍よりずっと初歩的なものである。

import pymc as pm
from pymc.Matplot import plot as mcplot
from pymc import Beta, Binomial, Uniform, Model, MCMC
from pymc.Matplot import plot as mcplot

from matplotlib import pyplot as plt
%matplotlib inline

データが二項分布によって生成されるモデルを考える。二項分布では事象発生の確率を示すパラメタ$\theta$があって、これが推定の対象となる。

$\theta$は0から1までの連続値であり、特に事前の情報を持たないので、 0から1までのすべてが等確率（密度）となる事前分布を使う。これは具体的にはパラメタ$\alpha$と$\beta$がそれぞれ1のベータ分布である。

学習に使うデータとして、10回中5回事象が起こったとしよう。

theta = Beta('$\\theta$', alpha=1, beta=1)
k = Binomial("k", n=10, p=theta, value=5, observed=True)
model = Model([theta, k])

このモデルを使ってMCMCを行い、結果をプロットする。

mcmc = MCMC(model)
mcmc.sample(40000, 10000)

mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.6 secPlotting $\theta$

当然だが右側のプロットで示されるように、推定値は0.5の周りになる。 95%区間は点線で示されている。

10回中5回でなく、100回中50回だったら0.5に近いことにもっと確証を持つだろう。

theta = Beta('$\\theta$', alpha=1, beta=1)
k = Binomial("k", n=100, p=theta, value=50, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 4.0 secPlotting $\theta$

95%区間が狭くなったことがわかる。

1回中1回、1回中0回、2回中1回も試してみる。

theta = Beta('$\\theta$', alpha=1, beta=1)
k = Binomial("k", n=1, p=theta, value=1, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.9 secPlotting $\theta$

theta = Beta('$\\theta$', alpha=1, beta=1)
k = Binomial("k", n=1, p=theta, value=0, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.8 secPlotting $\theta$

theta = Beta('$\\theta$', alpha=1, beta=1)
k = Binomial("k", n=2, p=theta, value=1, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.9 secPlotting $\theta$

ところで二項分布のパラメタの事前分布としてベータ分布を使うというのは殆どの文献で慣例的にそうなっているのだが、等確率分布ならばUniformを使ってもよいのだろうか。

theta = Uniform('$\\theta$', lower=0, upper=1)
k = Binomial("k", n=10, p=theta, value=5, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.7 secPlotting $\theta$

theta = Uniform('$\\theta$', lower=0, upper=1)
k = Binomial("k", n=1, p=theta, value=1, observed=True)
model = Model([theta, k])
mcmc = MCMC(model)
mcmc.sample(40000, 10000)
mcplot(mcmc.trace('$\\theta$'))

 [-----------------100%-----------------] 40000 of 40000 complete in 3.8 secPlotting $\theta$

結果に特に違いはないように見える。

「データ指向のソフトウェア品質マネジメント」（野中・小池・小室）の4.2節「テスト工程での不具合検出数の予測」をPythonで追ってみることにする。

今年のソフトウェア品質シンポジウムで著者の一人によるチュートリアルを受講したのだが、それがとても面白かったので購入した。

import math
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
%matplotlib inline

まずは対象データを読み込む。これは書籍のサポートページからダウンロードできる。ダウンロードに書籍記載のパスワードが必要なので生データをここには載せないが、データ自体はFenton (2008)のものである。

ソースコード行数KLOCと不具合検出数Failures以外はカテゴリーデータ（順序尺度）で、D1は開発メンバーの経験、P5はステークホルダーの関与の度合い、P6は顧客とのコミュニケーション、P7は協力会社のマネジメントを表す。

data1 = pd.read_excel(u"4.2節/4-2-data.xls", index_col='Prj')
data1.columns

Index([u'Failures', u'KLOC', u'D1', u'P5', u'P6', u'P7'], dtype='object')

ソースコード行数と不具合検出数のヒストグラムを描くと、いずれも右に裾が長い分布となっている。また、ソースコード行数と不具合検出数には強い正の相関がある。

def scatter_corr_matrix(df):
    pd.scatter_matrix(df)
    plt.show()
    return df.corr()

scatter_corr_matrix(data1)

単回帰分析¶

まずはソースコード行数だけから不具合検出数を予測する単回帰分析を行ってみる。

「データ志向の～」では目的変数と説明変数が右に歪んだ分布となっている場合には、それぞれの対数をとったうえで両変数を対数変換したモデル式を用いるのが典型的であるとして、最初から対数変換した単回帰分析を行っているのだが、ここでは対数変換しない場合とした場合で比べてみることにする。

import statsmodels.formula.api as smf

smf.ols(formula="Failures ~ KLOC", data=data1).fit().summary()

smf.ols(formula="np.log(Failures) ~ np.log(KLOC)", data=data1).fit().summary()

Adj. R-squaredはあまり変わらないのだが、AICで見ると対数変換により改善している。

重回帰分析のための説明変数の検討¶

次に順序尺度を説明変数に加える検討をする。まずは順位尺度の列に順位を表す整数が入っているdata1-1のシートを読む。

data1_1 = pd.read_excel(u"4.2節/4-2-data.xls", index_col='Prj', sheetname='data1-1')
data1_1.describe()

C:\Users\ether\Anaconda2\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)

ソースコード行数がすでに説明変数に含まれているため、不具合検出密度と各順序尺度の相関（スピアマンの順位相関係数）を計算する。

from scipy.stats import spearmanr
data1_1_d = data1_1.assign(Density = data1_1['Failures'] / data1_1['KLOC'])
data1_1_d.corr(method='spearman')

不具合検出密度と各順序尺度の関係を箱ひげ図で表す。

data1_1_d.boxplot(column='Density', by='P7', return_type='dict')
data1_1_d.boxplot(column='Density', by='D1', return_type='dict')
data1_1_d.boxplot(column='Density', by='P5', return_type='dict')
data1_1_d.boxplot(column='Density', by='P6', return_type='dict')
plt.show()

これを見ると不具合検出密度と相関が高いのはP7（協力会社のマネジメントが高いと不具合検出密度が低い）だが、「データ志向の～」ではP7はデータ件数が少ない(n=7)こと、ソースコード行数との相関係数も極めて大きいこと（多重共線性の疑い）を理由に退けている。

重回帰分析の説明変数をステップワイズで選択¶

P7以外の順序尺度をステップワイズで選択する。まずは全部を使う。

smf.ols(formula="np.log(Failures) ~ np.log(KLOC) + D1 + P5 + P6", data=data1).fit().summary()

ここから1つずつ減らしていく。

formulas = [
    "np.log(Failures) ~ np.log(KLOC) + D1 + P5 + P6",
    "np.log(Failures) ~ np.log(KLOC) + D1 + P5",
    "np.log(Failures) ~ np.log(KLOC) + D1",
    "np.log(Failures) ~ np.log(KLOC)",
]
steps = [[res.llf, res.aic, res.rsquared, res.rsquared_adj, formula]
         for (res, formula) in [(smf.ols(formula=formula, data=data1).fit(), formula) for formula in formulas]]
DataFrame(steps, columns=['Log-Likelihood', 'AIC', 'R-squared', 'Adj. R-squared', 'formula'])

AICおよびAdj. R-squaredで判断するとlog(KLOC)とD1を使うモデルが最も良い。

smf.ols(formula="np.log(Failures) ~ np.log(KLOC) + D1", data=data1).fit().summary()

このサマリーを見るとD1のダミー変数の中でも2_Lと3_MのP値が高い。そこで「データ指向の～」では3_M以下の水準を統合することを提案している。

d1_map = {'1_VL': '3_M', '2_L': '3_M', '3_M': '3_M', '4_H': '4_H', '5_VH': '5_VH'}
data1_d1_new = data1.assign(D1_new = data1['D1'].apply(lambda x: d1_map[x]))

ols_final = smf.ols(formula="np.log(Failures) ~ np.log(KLOC) + D1_new", data=data1_d1_new).fit()
ols_final.summary()

データ加工により、Adj. R-squaredもAICも向上したことがわかる。

モデルの評価¶

予測モデルの評価として、「データ指向の～」ではまず予測値と実測値の相関係数を評価している。

scatter_corr_matrix(
    pd.concat(
        [data1.dropna(subset=['KLOC']).reset_index(drop=True),
         Series(np.exp(ols_final.predict()), name='predict')],
        axis=1)[['Failures', 'predict']])

上記では散布図も表示している。なお、「データ指向の～」（2015年1月13日第4刷）に掲載の散布図はX軸とY軸が逆のように思える（軸ラベル上Xが予測でYが実績なのだが、右上でなく左下の散布図になっている）。

相関係数が0.887であることから、うまく予測できているように見えると判断している。

このくだりにはちょっとよくわからないことろがあって、そもそも予測値と実績値の相関係数というのは予測モデルの評価の仕方として適当なのだろうかという疑問がある。というのはモデルが誤っていて「本来の値の約半分の値を予測する」というものであった場合、予測モデルとしては役に立たないが、相関係数は半分でない場合と同様に高くなるはずだからだ。

次に予測誤差 MRE (Megnitude of Relative Error)と呼ばれる評価指標を出している。こちらのほうが直感的な定義に思える。MREは1つ1つの予測に対する値なので、全体で1つの指標値にするには予測誤差の平均とか中央値とかで見るようだ。

def mre(actual, predicted):
    df = DataFrame({'actual': actual, 'predicted': predicted})
    mre = df.eval("abs(actual - predicted) / actual")
    mre.plot.box()
    plt.show()
    print mre.describe().round(2)
    return mre.median()

mre(
    data1.dropna(subset=['KLOC'])['Failures'],
    np.exp(ols_final.predict()))

count    29.00
mean      0.44
std       0.30
min       0.07
25%       0.22
50%       0.34
75%       0.55
max       1.41
dtype: float64

0.34027626468821626

中央値は34%程度（本文では30%と書いているが）である。

ポワソン回帰分析¶

この節のコラム記事では対数変換したうえでの重回帰分析（最小二乗法）ではなくポワソン回帰を行ったほうがよい予測モデルになるということが書かれている。本文には結果のみで、過程は省かれているが、これを行ってみたい。

Statsmodelsでポワソン回帰を行うにはglmを使う。コラムに掲載された結果と同じになるモデル式は下記だ。

import statsmodels.api as sm

poisson_model = smf.glm(
    formula='Failures ~ np.log(KLOC) + D1 + P5 + P6',
    data=data1,
    family=sm.families.Poisson()).fit()
poisson_model.summary()

scatter_corr_matrix(
    pd.concat(
        [data1.dropna(subset=['KLOC']).reset_index(drop=True),
         Series(poisson_model.predict(), name='predict')],
        axis=1)[['Failures', 'predict']])

予測値と実測値の相関係数は向上している。

一方MREを計算すると中央値では改善するが平均値では悪くなっており、またばらつきも大きくなっている。

mre(
    data1.dropna(subset=['KLOC'])['Failures'],
    poisson_model.predict())

count    29.00
mean      0.49
std       0.58
min       0.00
25%       0.12
50%       0.25
75%       0.52
max       1.83
dtype: float64

0.25178105769280135

こうしてみてくると対数変換の有無、説明変数の選択、順序尺度の統合、回帰手法の選択といった多くの「ノブ」があって、評価尺度も1つではない。モデリングが探索的な過程であると感じられる。

今度は説明変数にカテゴリーデータを含む場合の回帰分析を行う。

%matplotlib inline

対象となるデータはR処理系のcarパッケージに付属しているPrestigeというデータを write.csv(Prestige, "Prestige.csv", quote=FALSE, row.names=TRUE) でCSVにしたものである。

prestige = pd.read_csv("Prestige.csv", index_col=0)
prestige.columns

Index([u'education', u'income', u'women', u'prestige', u'census', u'type'], dtype='object')

このデータの各行は職業で、カナダの国勢調査で得られたデータである。変数は「定性的データ分析」（金明哲）によると次の通り。

education: 平均教育年数, income: 平均所得, women: 女性が占める割合, prestige: 職業地位の得点, census: カナダの国勢調査の職業コード, type: {bc: ブルーカラー, prof: 専門家, wc: ホワイトカラー}

職業コードは無視すると、typeがカテゴリーデータである。

次に散布図行列を書く。pandasのscatter_matrixは残念ながらカテゴリーデータを散布図行列に含めてくれない。もっとも含まれていても相関の解釈はしづらい。

from pandas.tools.plotting import scatter_matrix
scatter_matrix(prestige.drop("census", axis=1), figsize=(7, 7))
plt.show()

目的変数にしたいprestigeと相関を持ちそうなのがeducationとincomeであることがわかる。

重回帰分析にはstatsmodels.formula.apiのほうの関数を使う。というのもこちらを使うとカテゴリー変数に対して自動的にダミー変数を作ってくれるからだ。

import statsmodels.formula.api as smf

mod = smf.ols(formula="prestige ~ education + income + women + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.826
Method:                 Least Squares   F-statistic:                     93.07
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           1.93e-34
Time:                        22:51:20   Log-Likelihood:                -328.48
No. Observations:                  98   AIC:                             669.0
Df Residuals:                      92   BIC:                             684.5
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.8139      5.331     -0.153      0.879       -11.402     9.774
type[T.prof]     5.9052      3.938      1.500      0.137        -1.915    13.726
type[T.wc]      -2.9171      2.665     -1.094      0.277        -8.211     2.377
education        3.6624      0.646      5.671      0.000         2.380     4.945
income           0.0010      0.000      3.976      0.000         0.001     0.002
women            0.0064      0.030      0.212      0.832        -0.054     0.067
==============================================================================
Omnibus:                        0.681   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.711   Jarque-Bera (JB):                0.798
Skew:                          -0.112   Prob(JB):                        0.671
Kurtosis:                       2.619   Cond. No.                     7.48e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.48e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

次に変数選択を行う。ここではAICでもAdj. R-squaredでも同じく、education + income + typeの場合が最もよい。

formulas = [
    "prestige ~ education + income",
    "prestige ~ education + income + type",
    "prestige ~ education + income + women + type",
]
steps = [[res.llf, res.aic, res.rsquared, res.rsquared_adj, formula]
         for (res, formula) in [(smf.ols(formula=formula, data=prestige).fit(), formula) for formula in formulas]]
DataFrame(steps, columns=['Log-Likelihood', 'AIC', 'R-squared', 'Adj. R-squared', 'formula'])

選択された変数で改めて重回帰分析を行う。

mod = smf.ols(formula="prestige ~ education + income + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.828
Method:                 Least Squares   F-statistic:                     117.5
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           1.70e-35
Time:                        22:51:20   Log-Likelihood:                -328.51
No. Observations:                  98   AIC:                             667.0
Df Residuals:                      93   BIC:                             679.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.6229      5.228     -0.119      0.905       -11.004     9.758
type[T.prof]     6.0390      3.867      1.562      0.122        -1.640    13.718
type[T.wc]      -2.7372      2.514     -1.089      0.279        -7.729     2.255
education        3.6732      0.641      5.735      0.000         2.401     4.945
income           0.0010      0.000      4.586      0.000         0.001     0.001
==============================================================================
Omnibus:                        0.500   Durbin-Watson:                   1.787
Prob(Omnibus):                  0.779   Jarque-Bera (JB):                0.652
Skew:                          -0.106   Prob(JB):                        0.722
Kurtosis:                       2.662   Cond. No.                     7.34e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.34e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

多重共線性について確かめるためVIFを計算する。

from statsmodels.stats.outliers_influence import variance_inflation_factor

num_cols = mod.exog.shape[1] # 説明変数の列数
vifs = [variance_inflation_factor(mod.exog, i)
        for i in range(0, num_cols)]
DataFrame(vifs, index=mod.exog_names, columns=["VIF"])

ここではカテゴリー変数typeのダミー変数に対してVIFが計算されているが、「定性的データ分析」のRの実行結果ではtypeそのものに対して6.102131という結果が出ている。これをStatsmodelsでどうすればいいのか（Rがどう計算しているのか）はわからない。いったん10以下であることから多重共線性はないと判断する。

「定性的データ分析」ではprestigeに対するincomeの関係が直線でなく右側が吊り上がっていることに注目し、変数incomeを対数変換することを提案している。これをStatsmodelsで行う場合、式にはlog(income)でなく、numpy.logを使って書く必要があるようだ。

import math
mod = smf.ols(formula="prestige ~ education + np.log(income) + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.855
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                     137.6
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           3.51e-38
Time:                        22:51:21   Log-Likelihood:                -321.97
No. Observations:                  98   AIC:                             653.9
Df Residuals:                      93   BIC:                             666.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept        -81.2019     13.743     -5.909      0.000      -108.493   -53.911
type[T.prof]       6.7509      3.618      1.866      0.065        -0.435    13.937
type[T.wc]        -1.4394      2.378     -0.605      0.546        -6.162     3.283
education          3.2845      0.608      5.401      0.000         2.077     4.492
np.log(income)    10.4875      1.717      6.109      0.000         7.078    13.897
==============================================================================
Omnibus:                        0.073   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.964   Jarque-Bera (JB):                0.027
Skew:                           0.034   Prob(JB):                        0.987
Kurtosis:                       2.956   Cond. No.                         292.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

AICが良くなり、Adj. R-squaredも0.828から0.849に向上している。

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	9.4969	65.505	0.145	0.886	-124.909 143.903
KLOC	11.4165	1.167	9.779	0.000	9.021 13.812

	Failures	KLOC	D1	P5	P6	P7
count	31.000000	29.000000	31.000000	31.000000	31.000000	7.000000
mean	445.225806	39.379310	3.322581	4.032258	4.064516	2.857143
std	510.915695	40.680299	0.979357	0.752058	0.771815	0.899735
min	5.000000	1.000000	1.000000	3.000000	3.000000	2.000000
25%	100.000000	NaN	3.000000	3.500000	3.500000	NaN
50%	209.000000	NaN	4.000000	4.000000	4.000000	NaN
75%	662.500000	NaN	4.000000	5.000000	5.000000	NaN
max	1906.000000	155.000000	5.000000	5.000000	5.000000	4.000000

	Failures	KLOC	D1	P5	P6	P7	Density
Failures	1.000000	0.850419	-0.408264	-0.254397	-0.136753	0.018898	0.395567
KLOC	0.850419	1.000000	-0.114229	-0.055048	-0.031084	0.973329	-0.089946
D1	-0.408264	-0.114229	1.000000	0.488971	0.516152	0.632785	-0.692716
P5	-0.254397	-0.055048	0.488971	1.000000	0.645129	0.083666	-0.435407
P6	-0.136753	-0.031084	0.516152	0.645129	1.000000	-0.591960	-0.373876
P7	0.018898	0.973329	0.632785	0.083666	-0.591960	1.000000	-0.790569
Density	0.395567	-0.089946	-0.692716	-0.435407	-0.373876	-0.790569	1.000000

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	3.6092	0.788	4.582	0.000	1.961 5.258
D1[T.2_L]	-0.0876	0.664	-0.132	0.896	-1.478 1.302
D1[T.3_M]	-0.5328	0.712	-0.748	0.463	-2.023 0.957
D1[T.4_H]	-1.1245	0.665	-1.690	0.107	-2.517 0.268
D1[T.5_VH]	-1.2480	0.775	-1.611	0.124	-2.870 0.373
P5[T.4_H]	-0.3889	0.430	-0.904	0.377	-1.290 0.512
P5[T.5_VH]	0.0721	0.682	0.106	0.917	-1.356 1.500
P6[T.4_H]	0.0671	0.338	0.198	0.845	-0.641 0.775
P6[T.5_VH]	-0.3956	0.667	-0.593	0.560	-1.793 1.001
np.log(KLOC)	0.9235	0.089	10.430	0.000	0.738 1.109

	Log-Likelihood	AIC	R-squared	Adj. R-squared	formula
0	-18.504129	57.008257	0.896515	0.847496	np.log(Failures) ~ np.log(KLOC) + D1 + P5 + P6
1	-18.910947	53.821894	0.893571	0.858094	np.log(Failures) ~ np.log(KLOC) + D1 + P5
2	-19.531827	51.063655	0.888914	0.864765	np.log(Failures) ~ np.log(KLOC) + D1
3	-29.480676	62.961352	0.779383	0.771212	np.log(Failures) ~ np.log(KLOC)

wonderful cool something

ポワソン分布のシミュレーション

複数の観察データからのパラメタ推定 (PyMC3, PyStan)

PyMC3¶

PyStan¶

2つの二項分布のパラメタの差の推定(PyMC3, PyStan)

PyMC3¶

PyStan¶

PyMC3とPyStanによる二項分布のパラメタ推定

PyMC3¶

PyStan¶

PyMCの初歩

「データ指向のソフトウェア品質マネジメント」の「テスト工程での不具合検出数の予測」をPythonで行う

単回帰分析¶

重回帰分析のための説明変数の検討¶

重回帰分析の説明変数をステップワイズで選択¶

モデルの評価¶

ポワソン回帰分析¶

カテゴリーデータを説明変数に含む回帰分析

Dep. Variable:	Failures	R-squared:	0.780
Model:	OLS	Adj. R-squared:	0.772
Method:	Least Squares	F-statistic:	95.64
Date:	Sat, 12 Nov 2016	Prob (F-statistic):	2.29e-10
Time:	23:10:02	Log-Likelihood:	-200.39
No. Observations:	29	AIC:	404.8
Df Residuals:	27	BIC:	407.5
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	0.160	Durbin-Watson:	2.363
Prob(Omnibus):	0.923	Jarque-Bera (JB):	0.091
Skew:	-0.109	Prob(JB):	0.956
Kurtosis:	2.834	Cond. No.	78.8

Dep. Variable:	np.log(Failures)	R-squared:	0.779
Model:	OLS	Adj. R-squared:	0.771
Method:	Least Squares	F-statistic:	95.38
Date:	Sat, 12 Nov 2016	Prob (F-statistic):	2.35e-10
Time:	23:10:02	Log-Likelihood:	-29.481
No. Observations:	29	AIC:	62.96
Df Residuals:	27	BIC:	65.70
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	5.037	Durbin-Watson:	2.194
Prob(Omnibus):	0.081	Jarque-Bera (JB):	1.856
Skew:	-0.181	Prob(JB):	0.395
Kurtosis:	1.815	Cond. No.	9.02

Omnibus:	0.572	Durbin-Watson:	2.426
Prob(Omnibus):	0.751	Jarque-Bera (JB):	0.560
Skew:	0.296	Prob(JB):	0.756
Kurtosis:	2.664	Cond. No.	49.1

Omnibus:	0.773	Durbin-Watson:	2.459
Prob(Omnibus):	0.679	Jarque-Bera (JB):	0.831
Skew:	0.287	Prob(JB):	0.660
Kurtosis:	2.402	Cond. No.	44.1

Omnibus:	0.826	Durbin-Watson:	2.461
Prob(Omnibus):	0.662	Jarque-Bera (JB):	0.841
Skew:	0.241	Prob(JB):	0.657
Kurtosis:	2.318	Cond. No.	15.0

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	3.9355	0.064	61.320	0.000	3.810 4.061
D1[T.2_L]	-0.3720	0.045	-8.224	0.000	-0.461 -0.283
D1[T.3_M]	-0.7369	0.048	-15.207	0.000	-0.832 -0.642
D1[T.4_H]	-0.9145	0.044	-20.781	0.000	-1.001 -0.828
D1[T.5_VH]	-1.2531	0.088	-14.187	0.000	-1.426 -1.080
P5[T.4_H]	-0.7771	0.035	-21.923	0.000	-0.847 -0.708
P5[T.5_VH]	0.1282	0.086	1.498	0.134	-0.040 0.296
P6[T.4_H]	-0.1638	0.028	-5.910	0.000	-0.218 -0.109
P6[T.5_VH]	-0.5596	0.083	-6.769	0.000	-0.722 -0.398
np.log(KLOC)	0.9393	0.013	70.423	0.000	0.913 0.965

	Log-Likelihood	AIC	R-squared	Adj. R-squared	formula
0	-352.862536	711.725072	0.798001	0.793920	prestige ~ education + income
1	-328.507546	667.015091	0.834857	0.827755	prestige ~ education + income + type
2	-328.483590	668.967180	0.834938	0.825967	prestige ~ education + income + women + type

	VIF
Intercept	53.203117
type[T.prof]	6.295713
type[T.wc]	2.209983
education	5.973932
income	1.681325

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	3.1819	0.622	5.117	0.000	1.895 4.468
D1[T.2_L]	0.0438	0.613	0.071	0.944	-1.225 1.312
D1[T.3_M]	-0.2468	0.565	-0.436	0.667	-1.417 0.923
D1[T.4_H]	-1.0533	0.557	-1.890	0.071	-2.206 0.100
D1[T.5_VH]	-1.1696	0.660	-1.771	0.090	-2.536 0.197
np.log(KLOC)	0.9333	0.082	11.389	0.000	0.764 1.103

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	3.1155	0.279	11.161	0.000	2.541 3.690
D1_new[T.4_H]	-0.9215	0.201	-4.576	0.000	-1.336 -0.507
D1_new[T.5_VH]	-1.0436	0.397	-2.630	0.014	-1.861 -0.226
np.log(KLOC)	0.9110	0.074	12.295	0.000	0.758 1.064

	Failures	predict
Failures	1.000000	0.886818
predict	0.886818	1.000000

	Failures	predict
Failures	1.000000	0.950359
predict	0.950359	1.000000