Post on 22-May-2020
データ解析基礎論BW06:GLM
回帰分析の理論回帰モデル:!"# = %& + %()# !"#: データ(被験者)iの予測された目的変数値 %&: intercept、回帰線の切片 %(: slope、回帰線の傾き )#:データiの説明変数・独立変数 回帰線: 説明変数と目的変数の相関線上近似
母集団の関係:"# = *& + *()# + +# +#: 誤差
回帰分析の理論:仮説・仮定Zero-mean assumption: ! "# = 0, ∀( 誤差(epsilon)はランダム変数で、その平均はゼロである。
Constant-variance assumption: )*+, = )*,, ∀( 誤差の分散は独立変数の値に関係無く定数である。
Independent assumption: cov "#, "0 = 0, ∀(, 1 誤差は独立である。(他の誤差に対して)
Normality assumption: "~3 0, )* 誤差は正規分布に従う。
! 4|6 = ! 78 + 7:6 + "|6 = ! 78|6 + ! 7:6|6 + ! "|6 = 78 + 7:6VAR 4|6 = VAR 78 + 7:6 + "|6 = VAR "|6 = )*,4~3 78 + 7:6, )*|6
回帰分析の診断> par(mfrow=c(2,2))> plot(dat.lm01)
-1.5 -1.0 -0.5 0.0 0.5 1.0
-2-1
01
2
Fitted values
Residuals
Residuals vs Fitted
43
2716
-2 -1 0 1 2
-2-1
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
43
27 16
-1.5 -1.0 -0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
Fitted values
Standardized residuals
Scale-Location43
2716
0.00 0.04 0.08 0.12
-2-1
01
23
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance 0.5
0.5
1Residuals vs Leverage
16
43
27
二項分布に従う目的変数の場合
二項分布に従う目的変数の場合
二項分布に従う目的変数の場合
ロジスティック回帰分析ロジスティック回帰分析 従属変数:2値 (2項分布、ベルヌーイ) 独立変数:定量的&定性的変数 目的:
2グループを説明・予測 成功・失敗を説明・予測
回帰分析 従属変数:定量的変数 独立変数:定量的&定性的変数 目的:定量的変数を予測
(グループの差を検証)
勉強時間と単位修得
回帰モデル案確率を予測するモデル P = b0+b1X1
P = link(b0+b1X1)
LOGISTIC REGRESSION! "|$ = & $ = exp *+,*-.
1+exp *+,*-.↔ 234 5 .
675 . = 89 + 86$exp 89 + 86$1+exp 89 + 86$
= 11+exp − 89 + 86$
234 & $1 − & $ = 234;< & $
min & $ = 0 &max & $ = 1 → π(x) は確率と解釈できるlim*+,*-. →7E
exp *+,*-.1+exp *+,*-.
= 0
lim*+,*-. →,Eexp *+,*-.1+exp *+,*-.
= 1
ロジスティック回帰分析
ロジスティック回帰分析
例:ロジスティック回帰分析dat.lr<-glm(pass ~ study, family = binomial, data = dat)> summary(dat.lr)
Deviance Residuals: Min 1Q Median 3Q Max
-2.3104 -0.4319 0.1478 0.4921 2.4196
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.14560 0.38588 -8.152 3.59e-16 ***study 0.27346 0.02922 9.359 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 406.83 on 299 degrees of freedomResidual deviance: 212.73 on 298 degrees of freedomAIC: 216.73
⽬的変数の分布︓⼆項分布
LOGISTIC 回帰モデルP(pass) を従属変数とした場合:! "#$$ = &
1+exp , -./-012345
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.14560 0.38588 -8.152 3.59e-16 ***
study 0.27346 0.02922 9.359 < 2e-16 ***
P(P|5)=1/(1+exp(-1*(-3.146+0.273*5)))=0.145
P(P|15)=1/(1+exp(-1*(-3.146+0.273*15)))=0.722
P(P|25)=1/(1+exp(-1*(-3.146+0.273*25)))=0.976
LOGISTIC REGRESSIONの解釈Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) -3.14560 0.38588 -8.152 3.59e-16 ***
study 0.27346 0.02922 9.359 < 2e-16 ***
•勉強時間が1時間増えるごとに変化するoddsの対数 = 0.273
•勉強時間が1時間増えるごとに変化するodds = exp(0.273) = 1.3145•勉強時間が1時間増えるごとに、パスするオッズ比(確率ではない)は1.31倍になる
LOGISTIC REGRESSIONの解釈パスする確率(10時間から15時間の勉強時間)> pred.pass.p = 1/(1+exp(-(coef[1]+coef[2]*c(10:15))))> pred.pass.p[1] 0.3986719 0.4656686 0.5339273 0.6009388 0.6643720 0.7223802
パス・notパスのオッズ> odds=pred.pass.p/(1-pred.pass.p)> odds[1] 0.6629855 0.8714978 1.1455882 1.5058815 1.9794888 2.6020479
オッズの1時間毎の変化> odds[2:6]/odds[1:5][1] 1.314505 1.314505 1.314505 1.314505 1.314505> exp(coef[2])
study1.314505
LOGISTIC REGRESSIONの評価dat<-read.csv("http://www.matsuka.info/data_folder/datWA01.txt")dat.lr<-glm(gender~shoesize,family=binomial,data=dat)
> anova(dat.lr, test ="Chisq")Analysis of Deviance Table
Model: binomial, link: logit
Response: genderTerms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 69 96.983 shoesize 1 42.247 68 54.737 8.045e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
帰無仮説!": $% &' = $% &)*+)
対⽴仮説!,: $% &' ≠ $% &)*+)
モデルの比較> dat.lr0<-glm(gender~1,family="binomial",data=dat)> summary(dat.lr0)
Deviance Residuals: Min 1Q Median 3Q Max
-1.153 -1.153 -1.153 1.202 1.202
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.05716 0.23914 -0.239 0.811
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 96.983 on 69 degrees of freedomResidual deviance: 96.983 on 69 degrees of freedomAIC: 98.983
Number of Fisher Scoring iterations: 3
!"# = 2& + ()*+
k: パラメター数(「複雑さ」)L: likelihood(「誤差」)
例:ロジスティック回帰分析dat.lrS<-glm(gender~shoesize,family=binomial,data=dat)> summary(dat.lr)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -34.3214 7.3489 -4.670 3.01e-06 ***shoesize 1.3817 0.2969 4.654 3.25e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 96.983 on 69 degrees of freedomResidual deviance: 54.737 on 68 degrees of freedomAIC: 58.737
Number of Fisher Scoring iterations: 5
モデルの比較dat.lrh<-glm(gender~h,family="binomial",data=dat)> summary(dat.lrh)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.01836 9.19792 -4.460 8.21e-06 ***h 0.24958 0.05603 4.455 8.41e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 96.983 on 69 degrees of freedomResidual deviance: 56.410 on 68 degrees of freedomAIC: 60.41
Number of Fisher Scoring iterations: 5
!"検定とLOGISTIC REGRESSION
喫煙と健康の相関の検定 喫煙者 非喫煙者 計 喫煙率肺癌患者 52(40) 8(20) 60 0.87健常者 48(60) 42(30) 90 0.53計 100 50 150 0.67
M=matrix(c(52,48,8,42),nrow=2)rownames(M)<-c("present", "absent")colnames(M)<-c("smoker",'non-smoker’)
> Msmoker non-smoker
present 52 8absent 48 42
> chisq.test(M)
Pearson's Chi-squared test with Yates' continuity correction
data: MX-squared = 16.5312, df = 1, p-value = 4.785e-05
!"検定とLOGISTIC REGRESSION
喫煙と健康の相関の検定喫煙者 非喫煙者 計 喫煙率
肺癌患者 52(40) 8(20) 60 0.87健常者 48(60) 42(30) 90 0.53計 100 50 150 0.67
dat<-as.data.frame((as.table(M)))colnames(dat)<-c("cancer","smoking","freq")dat=dat[rep(1:nrow(dat),dat$freq),1:2]rownames(dat)<-c()dat.glm<-glm(cancer~smoking,family=binomial,data=dat)> summary(dat.glm)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08004 0.20016 -0.4 0.689 smokingnon-smoker 1.73827 0.43460 4.0 6.34e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 201.90 on 149 degrees of freedomResidual deviance: 182.44 on 148 degrees of freedomAIC: 186.44
!"検定とLOGISTIC REGRESSION
喫煙と健康の相関の検定喫煙者 非喫煙者 計 喫煙率
肺癌患者 52(40) 8(20) 60 0.87健常者 48(60) 42(30) 90 0.53計 100 50 150 0.67
> anova(dat.glm, test="Chisq")Analysis of Deviance TableModel: binomial, link: logitResponse: cancerTerms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 149 201.90 smoking 1 19.467 148 182.44 1.023e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> chisq.test(M)data: MX-squared = 16.5312, df = 1, p-value = 4.785e-05
例:新生児の生存率クロス表dat<-read.csv("http://www.matsuka.info/data_folder/cda7-16.csv")
>table(dat) , , NdaysGESTATION = <=260, survival = no
Ncigarettesage <5 5+
<30 50 930+ 41 4
, , NdaysGESTATION = >260, survival = no
Ncigarettesage <5 5+
<30 24 630+ 14 1
, , NdaysGESTATION = <=260, survival = yes
Ncigarettesage <5 5+
<30 315 4030+ 147 11
, , NdaysGESTATION = >260, survival = yes
Ncigarettesage <5 5+
<30 4012 459 30+ 1594 124
例:新生児の生存率変数 Age: 年齢(30歳未満or30歳以上) Ncigarettes: 喫煙本数(5本未満or5本以上) NdaysGESTATION: 妊娠期間(260日以下、261日以上) Survival: 生存(yes or no)
LOGISTIC 回帰モデルdat.glm<-glm(survival~age, family=binomial,data=dat)> summary(dat.glm)
Call:glm(formula = survival ~ age, family = binomial, data = dat)
Deviance Residuals: Min 1Q Median 3Q Max
-2.8325 0.1912 0.1912 0.2509 0.2509
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.9931 0.1070 37.333 < 2e-16 ***age30+ -0.5506 0.1692 -3.253 0.00114 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1435.5 on 6850 degrees of freedomResidual deviance: 1425.4 on 6849 degrees of freedomAIC: 1429.4
! " = $%& '( + '*+,$1 + $%& '( + '*+,$
LOGISTIC 回帰モデルdat.glm2<-glm(survival~Ncigarettes,family=binomial,data=dat)> summary(dat.glm1)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.85097 0.08897 43.283 <2e-16 ***Ncigarettes5+ -0.39466 0.24391 -1.618 0.106 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1435.5 on 6850 degrees of freedomResidual deviance: 1433.2 on 6849 degrees of freedomAIC: 1437.2
! " = $%& '( + '*+,-./01 + $%& '( + '*+,-./0
LOGISTIC 回帰モデルdat.glm3<-glm(survival~NdaysGESTATION,family=binomial,data=dat)> summary(dat.glm3)
Deviance Residuals: Min 1Q Median 3Q Max
-3.1404 0.1204 0.1204 0.1204 0.6076
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5959 0.1075 14.84 <2e-16 ***NdaysGESTATION>260 3.3280 0.1842 18.06 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1435.5 on 6850 degrees of freedomResidual deviance: 1093.2 on 6849 degrees of freedomAIC: 1097.2
! " = $%& '( + '*+,-."1 + $%& '( + '*+,-."
LOGISTIC 回帰モデルdat.glmAllAdd=glm(survival~age+Ncigarettes+NdaysGESTATION,family=binomial,data=dat)> summary(dat.glmAllAdd)Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8139 0.1351 13.430 < 2e-16 ***age30+ -0.4675 0.1803 -2.592 0.00954 ** Ncigarettes5+ -0.4228 0.2624 -1.611 0.10710 NdaysGESTATION>260 3.3098 0.1846 17.929 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1) Null deviance: 1435.5 on 6850 degrees of freedomResidual deviance: 1084.8 on 6847 degrees of freedomAIC: 1092.8Number of Fisher Scoring iterations: 7
! " = $%& '( + '*+,$ + '-./0,+1 + '2.3+4"1 + $%& '( + '*+,$ + '-./0,+1 + '2.3+4"
LOGISTIC 回帰モデルdat.glmAllMult=glm(survival~age*Ncigarettes*NdaysGESTATION,family=binomial,data=dat)> summary(dat.glmAllMult)Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.84055 0.15223 12.090 <2e-16 ***age30+ -0.56369 0.23317 -2.418 0.0156 * Ncigarettes5+ -0.34889 0.39911 -0.874 0.3820 NdaysGESTATION>260 3.27844 0.25508 12.853 <2e-16 ***age30+:Ncigarettes5+ 0.08364 0.72896 0.115 0.9087 age30+:NdaysGESTATION>260 0.17964 0.41026 0.438 0.6615 Ncigarettes5+:NdaysGESTATION>260 -0.43281 0.60829 -0.712 0.4768 age30+:Ncigarettes5+:NdaysGESTATION>260 0.78340 1.34987 0.580 0.5617 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 1435.5 on 6850 degrees of freedomResidual deviance: 1083.3 on 6843 degrees of freedomAIC: 1099.3
! " = $%& '( + '*+,$ + '-./0,+1 + '2.3+4" + '5+,$×./0,+1…1 + $%& '( + '*+,$ + '-./0,+1 + '2.3+4" + '5+,$×./0,+1…
モデルの選択Logistic回帰分析では
Model01(AIC=1429.4): ! " = $%& '()'*+,$-)$%& '()'*+,$
Model02(AIC=1437.2): ! " = $%& '()'*./0,+1-)$%& '()'*./0,+1
Model03(AIC=1097.2): ! " = $%& '()'*.2+34-)$%& '()'*.2+34
ModelAllAdd(AIC=1092.8): ! " = $%& '()'*+,$)'5./0,+1)'6.2+34-)$%& '()'*+,$)'5./0,+1)'6.2+34
ModelAllMult(AIC=1099.3):! " = $%& '()'*+,$)'5./0,+1)'6.2+34)'7+,$×./0,+1…-)$%& '()'*+,$)'5./0,+1)'6.2+34)'7+,$×./0,+1…
LOGISTIC 回帰モデルlibrary(MASS)> stepAIC(dat.glmAllMult)Start: AIC=1099.27survival ~ age * Ncigarettes * NdaysGESTATION
Df Deviance AIC- age:Ncigarettes:NdaysGESTATION 1 1083.6 1097.6<none> 1083.3 1099.3
Step: AIC=1097.63survival ~ age + Ncigarettes + NdaysGESTATION + age:Ncigarettes + age:NdaysGESTATION + Ncigarettes:NdaysGESTATION
Df Deviance AIC- Ncigarettes:NdaysGESTATION 1 1083.9 1095.9- age:Ncigarettes 1 1084.0 1096.0- age:NdaysGESTATION 1 1084.1 1096.1<none> 1083.6 1097.6
Step: AIC=1095.86survival ~ age + Ncigarettes + NdaysGESTATION + age:Ncigarettes + age:NdaysGESTATION
Df Deviance AIC- age:Ncigarettes 1 1084.2 1094.2- age:NdaysGESTATION 1 1084.4 1094.4<none> 1083.9 1095.9
LOGISTIC 回帰モデル> stepAIC(dat.glmAllMult)....
Step: AIC=1094.22survival ~ age + Ncigarettes + NdaysGESTATION + age:NdaysGESTATIONDf Deviance AIC
- age:NdaysGESTATION 1 1084.8 1092.8<none> 1084.2 1094.2- Ncigarettes 1 1086.7 1094.7
Step: AIC=1092.76survival ~ age + Ncigarettes + NdaysGESTATIONDf Deviance AIC<none> 1084.8 1092.8- Ncigarettes 1 1087.2 1093.2- age 1 1091.3 1097.3- NdaysGESTATION 1 1422.4 1428.4
Call: glm(formula = survival ~ age + Ncigarettes + NdaysGESTATION, family = binomial, data = dat)Coefficients: (Intercept) age30+ Ncigarettes5+ NdaysGESTATION>260
1.8139 -0.4675 -0.4228 3.3098 Degrees of Freedom: 6850 Total (i.e. Null); 6847 ResidualNull Deviance: 1436 Residual Deviance: 1085 AIC: 1093
非負のデータの場合
回帰モデル案確率を予測するモデル count = b0+b1X1
count = link(b0+b1X1)
POISSON REGRESSION! "|$ = & $ = exp *+ + *-$ ↔ /01 & = *+ + *-$
2 3 = &45$6 −&3!
例:ポワソン回帰分析dat.pr<-glm(eye.count ~ attr, family = poisson, data = dat)> summary(dat.pr)
Coefficients:Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.041779 0.065525 0.638 0.524 attr 0.207579 0.009156 22.672 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 652.310 on 499 degrees of freedomResidual deviance: 84.548 on 498 degrees of freedomAIC: 1565.4
Number of Fisher Scoring iterations: 4
⽬的変数の分布︓ポワソン分布
POISSON REGRESSIONの解釈Coefficients:
Estimate Std. Error z value Pr(>|z|) (Intercept) 0.041779 0.065525 0.638 0.524 attr 0.207579 0.009156 22.672 <2e-16 ***
λ= exp 0.042 + 0.208,魅力が1ユニット増えることに変化する目への着目回数の増加割合exp(0.208) = 1.231倍
x
POISSON REGRESSION! "|$ = & $ = exp *+ + *-$./0 & = *+ + *-$