今度は説明変数にカテゴリーデータを含む場合の回帰分析を行う。

%matplotlib inline

対象となるデータはR処理系のcarパッケージに付属しているPrestigeというデータを write.csv(Prestige, "Prestige.csv", quote=FALSE, row.names=TRUE) でCSVにしたものである。

prestige = pd.read_csv("Prestige.csv", index_col=0)
prestige.columns

Index([u'education', u'income', u'women', u'prestige', u'census', u'type'], dtype='object')

このデータの各行は職業で、カナダの国勢調査で得られたデータである。変数は「定性的データ分析」（金明哲）によると次の通り。

education: 平均教育年数, income: 平均所得, women: 女性が占める割合, prestige: 職業地位の得点, census: カナダの国勢調査の職業コード, type: {bc: ブルーカラー, prof: 専門家, wc: ホワイトカラー}

職業コードは無視すると、typeがカテゴリーデータである。

次に散布図行列を書く。pandasのscatter_matrixは残念ながらカテゴリーデータを散布図行列に含めてくれない。もっとも含まれていても相関の解釈はしづらい。

from pandas.tools.plotting import scatter_matrix
scatter_matrix(prestige.drop("census", axis=1), figsize=(7, 7))
plt.show()

目的変数にしたいprestigeと相関を持ちそうなのがeducationとincomeであることがわかる。

重回帰分析にはstatsmodels.formula.apiのほうの関数を使う。というのもこちらを使うとカテゴリー変数に対して自動的にダミー変数を作ってくれるからだ。

import statsmodels.formula.api as smf

mod = smf.ols(formula="prestige ~ education + income + women + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.826
Method:                 Least Squares   F-statistic:                     93.07
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           1.93e-34
Time:                        22:51:20   Log-Likelihood:                -328.48
No. Observations:                  98   AIC:                             669.0
Df Residuals:                      92   BIC:                             684.5
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.8139      5.331     -0.153      0.879       -11.402     9.774
type[T.prof]     5.9052      3.938      1.500      0.137        -1.915    13.726
type[T.wc]      -2.9171      2.665     -1.094      0.277        -8.211     2.377
education        3.6624      0.646      5.671      0.000         2.380     4.945
income           0.0010      0.000      3.976      0.000         0.001     0.002
women            0.0064      0.030      0.212      0.832        -0.054     0.067
==============================================================================
Omnibus:                        0.681   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.711   Jarque-Bera (JB):                0.798
Skew:                          -0.112   Prob(JB):                        0.671
Kurtosis:                       2.619   Cond. No.                     7.48e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.48e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

次に変数選択を行う。ここではAICでもAdj. R-squaredでも同じく、education + income + typeの場合が最もよい。

formulas = [
    "prestige ~ education + income",
    "prestige ~ education + income + type",
    "prestige ~ education + income + women + type",
]
steps = [[res.llf, res.aic, res.rsquared, res.rsquared_adj, formula]
         for (res, formula) in [(smf.ols(formula=formula, data=prestige).fit(), formula) for formula in formulas]]
DataFrame(steps, columns=['Log-Likelihood', 'AIC', 'R-squared', 'Adj. R-squared', 'formula'])

選択された変数で改めて重回帰分析を行う。

mod = smf.ols(formula="prestige ~ education + income + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.828
Method:                 Least Squares   F-statistic:                     117.5
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           1.70e-35
Time:                        22:51:20   Log-Likelihood:                -328.51
No. Observations:                  98   AIC:                             667.0
Df Residuals:                      93   BIC:                             679.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       -0.6229      5.228     -0.119      0.905       -11.004     9.758
type[T.prof]     6.0390      3.867      1.562      0.122        -1.640    13.718
type[T.wc]      -2.7372      2.514     -1.089      0.279        -7.729     2.255
education        3.6732      0.641      5.735      0.000         2.401     4.945
income           0.0010      0.000      4.586      0.000         0.001     0.001
==============================================================================
Omnibus:                        0.500   Durbin-Watson:                   1.787
Prob(Omnibus):                  0.779   Jarque-Bera (JB):                0.652
Skew:                          -0.106   Prob(JB):                        0.722
Kurtosis:                       2.662   Cond. No.                     7.34e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.34e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

多重共線性について確かめるためVIFを計算する。

from statsmodels.stats.outliers_influence import variance_inflation_factor

num_cols = mod.exog.shape[1] # 説明変数の列数
vifs = [variance_inflation_factor(mod.exog, i)
        for i in range(0, num_cols)]
DataFrame(vifs, index=mod.exog_names, columns=["VIF"])

ここではカテゴリー変数typeのダミー変数に対してVIFが計算されているが、「定性的データ分析」のRの実行結果ではtypeそのものに対して6.102131という結果が出ている。これをStatsmodelsでどうすればいいのか（Rがどう計算しているのか）はわからない。いったん10以下であることから多重共線性はないと判断する。

「定性的データ分析」ではprestigeに対するincomeの関係が直線でなく右側が吊り上がっていることに注目し、変数incomeを対数変換することを提案している。これをStatsmodelsで行う場合、式にはlog(income)でなく、numpy.logを使って書く必要があるようだ。

import math
mod = smf.ols(formula="prestige ~ education + np.log(income) + type", data=prestige)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.855
Model:                            OLS   Adj. R-squared:                  0.849
Method:                 Least Squares   F-statistic:                     137.6
Date:                Thu, 10 Nov 2016   Prob (F-statistic):           3.51e-38
Time:                        22:51:21   Log-Likelihood:                -321.97
No. Observations:                  98   AIC:                             653.9
Df Residuals:                      93   BIC:                             666.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept        -81.2019     13.743     -5.909      0.000      -108.493   -53.911
type[T.prof]       6.7509      3.618      1.866      0.065        -0.435    13.937
type[T.wc]        -1.4394      2.378     -0.605      0.546        -6.162     3.283
education          3.2845      0.608      5.401      0.000         2.077     4.492
np.log(income)    10.4875      1.717      6.109      0.000         7.078    13.897
==============================================================================
Omnibus:                        0.073   Durbin-Watson:                   1.979
Prob(Omnibus):                  0.964   Jarque-Bera (JB):                0.027
Skew:                           0.034   Prob(JB):                        0.987
Kurtosis:                       2.956   Cond. No.                         292.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

AICが良くなり、Adj. R-squaredも0.828から0.849に向上している。

	Log-Likelihood	AIC	R-squared	Adj. R-squared	formula
0	-352.862536	711.725072	0.798001	0.793920	prestige ~ education + income
1	-328.507546	667.015091	0.834857	0.827755	prestige ~ education + income + type
2	-328.483590	668.967180	0.834938	0.825967	prestige ~ education + income + women + type

wonderful cool something

カテゴリーデータを説明変数に含む回帰分析

	VIF
Intercept	53.203117
type[T.prof]	6.295713
type[T.wc]	2.209983
education	5.973932
income	1.681325