對表17-1資料中的產(chǎn)前護理量與接受護理地點和嬰兒存活情況進行卡方檢驗,如果不考慮接受護理地點的影響,對產(chǎn)前護理量和嬰兒存活情況分別合并后得到(Χ2=5.255,P=0.022),即產(chǎn)前護理量與嬰兒存活情況有關聯(lián):
# 不考慮護理地點的影響,就是2*2列聯(lián)表
M <- matrix(c(20,373,6,316),nrow = 2, byrow = T,
dimnames = list(`產(chǎn)前護理量`=c("少","多"),
`存活情況`=c("死","活")))
M
## 存活情況
## 產(chǎn)前護理量 死 活
## 少 20 373
## 多 6 316
# 進行卡方檢驗
chisq.test(M,correct = F)
##
## Pearson's Chi-squared test
##
## data: M
## X-squared = 5.2555, df = 1, p-value = 0.02188
如果考慮接受護理地點的影響,分別對兩個診所進行假設檢驗時卻發(fā)現(xiàn)產(chǎn)前護理量與嬰兒存活情況是沒有關聯(lián)的,即基于接受護理地點的條件下產(chǎn)前護理量與嬰兒存活情況沒有關聯(lián):
# 診所甲
M1 <- matrix(c(3,176,4,293),nrow = 2, byrow = T,
dimnames = list(`產(chǎn)前護理量`=c("少","多"),
`存活情況`=c("死","活")))
M1
## 存活情況
## 產(chǎn)前護理量 死 活
## 少 3 176
## 多 4 293
# 進行卡方檢驗
chisq.test(M1,correct = F)
## Warning in chisq.test(M1, correct = F): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: M1
## X-squared = 0.083522, df = 1, p-value = 0.7726
# 診所乙
M2 <- matrix(c(17,197,2,23),nrow = 2, byrow = T,
dimnames = list(`產(chǎn)前護理量`=c("少","多"),
`存活情況`=c("死","活")))
M2
## 存活情況
## 產(chǎn)前護理量 死 活
## 少 17 197
## 多 2 23
# 進行卡方檢驗
chisq.test(M2,correct = F)
## Warning in chisq.test(M2, correct = F): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: M2
## X-squared = 9.6186e-05, df = 1, p-value = 0.9922
這是因為產(chǎn)前護理量和接受護理地點之間不是獨立的(Χ2=173.372,P<0.001),這個例子說明由于接受護理量不是一個獨立的變量,故對二維以上的列聯(lián)表資料不能通過簡單合并的方法進行卡方檢驗。
# 產(chǎn)前護理量和護理地點之間的獨立性檢驗
M3 <- matrix(c(179,297,214,25),nrow = 2, byrow = T,
dimnames = list(`產(chǎn)前護理量`=c("少","多"),
`護理地點`=c("甲","乙")))
M3
## 護理地點
## 產(chǎn)前護理量 甲 乙
## 少 179 297
## 多 214 25
# 進行卡方檢驗
chisq.test(M3,correct = F)
##
## Pearson's Chi-squared test
##
## data: M3
## X-squared = 173.37, df = 1, p-value < 2.2e-16
在列聯(lián)表17-1中有8個格子,總頻數(shù)715分布于這8個格子中,假設在列聯(lián)表資料中,這種分布是隨機變量,那么在總樣本量、行合計、列合計、層合計均固定時,我們可以通過這種多項分布建立模型,描述變量之間的關系。這就是我們要著重介紹的列聯(lián)表對數(shù)線性模型,將每個格子的期望頻數(shù)的自然對數(shù)與分類變量之間建立線性聯(lián)系,分析分類變量之間的關系。具體統(tǒng)計分析過程包括建立對數(shù)線性模型、擬合優(yōu)度檢驗及參數(shù)估計。
對數(shù)線性模型的概念
對數(shù)線性模型為層次模型,如果模型中包含了某幾個變量的高階交互效應項時,這幾個變量的低階交互效應項與主效應項也一定包含在模型中。對數(shù)線性模型的建立一般以飽和模型(saturated model))開始,飽和模型包含了所有變量的主效應、低階交互效應和高階交互效應項。通過后退法逐漸排除沒有統(tǒng)計學意義的作用項,最后擬合最優(yōu)的簡化模型即不飽和模型(unsaturated model)。
2*2列聯(lián)表
孫振球醫(yī)學統(tǒng)計學第4版例17-2。一項蒙古族高血壓危險因素的研究中得到性別和血壓的關系數(shù)據(jù),此數(shù)據(jù)由分類變量X(性別)、Y(是否高血壓)構成的2×2二維列聯(lián)表。試分析蒙古族高血壓危險因素研究中性別和血壓的關系。
data17_2 <- haven::read_sav("datasets/例17-02.sav",encoding = "GBK")
#str(data17_2)
data17_2 <- haven::as_factor(data17_2)
str(data17_2)
## tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
## $ 性別: Factor w/ 2 levels "男性","女性": 1 1 2 2
## $ 血壓: Factor w/ 2 levels "正常血壓","高血壓": 1 2 1 2
## $ 頻數(shù): num [1:4] 579 485 1032 483
## ..- attr(*, "format.spss")= chr "F8.0"
data17_2
## # A tibble: 4 × 3
## 性別 血壓 頻數(shù)
## <fct> <fct> <dbl>
## 1 男性 正常血壓 579
## 2 男性 高血壓 485
## 3 女性 正常血壓 1032
## 4 女性 高血壓 483
先改變一下格式,變成矩陣:
M <- matrix(data17_2$頻數(shù),nrow = 2,byrow = T,
dimnames = list(trt = c("男性", "女性"),
effect = c("正常","高血壓")))
M
## effect
## trt 正常 高血壓
## 男性 579 485
## 女性 1032 483
這個其實是2維列聯(lián)表,你用卡方檢驗也可以的:
chisq.test(M,correct = F)
##
## Pearson's Chi-squared test
##
## data: M
## X-squared = 50.046, df = 1, p-value = 1.502e-12
從卡方檢驗的結果來看性別和血壓是獨立的。
下面做對數(shù)線性模型,直接使用R語言自帶的loglin
函數(shù):
fm <- loglin(M, margin=list(1,2), fit=T, param=T)
## 2 iterations: deviation 0
fm$lrt # 查看似然比G^2^
## [1] 49.84297
fm$pearson # 查看卡方值
## [1] 50.04632
1 - pchisq(fm$lrt, fm$df) # 計算似然比G^2^的P值
## [1] 1.665557e-12
1 - pchisq(fm$pearson, fm$df) # 計算卡方的P值
## [1] 1.501577e-12
似然比統(tǒng)計量G2和pearson-x2都和課本一樣,P值小于0.0001。
也可以使用MASS
包中的loglm
擬合對數(shù)線性模型,可以使用公式的形式,結果也更加簡潔易懂:
library(MASS)
# 不飽和模型(沒有交互項)
f <- loglm(`頻數(shù)` ~ `性別` + `血壓`, data = data17_2)
f
## Call:
## loglm(formula = 頻數(shù) ~ 性別 + 血壓, data = data17_2)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 49.84297 1 1.665557e-12
## Pearson 50.04632 1 1.501577e-12
如果要對飽和模型進行擬合優(yōu)度檢驗,可以使用以下代碼:
f1 <- update(f, ~ .^2) # 直接更新模型
anova(f, f1) # 比較飽和模型和不飽和模型
## LR tests for hierarchical log-linear models
##
## Model 1:
## 頻數(shù) ~ 性別 + 血壓
## Model 2:
## 頻數(shù) ~ 性別 + 血壓 + 性別:血壓
##
## Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1 49.84297 1
## Model 2 0.00000 0 49.84297 1 0
## Saturated 0.00000 0 0.00000 0 1
當然你不更新模型,直接擬合一個飽和模型也是可以的:
# 或者重新擬合一個飽和模型
f2 <- loglm(`頻數(shù)` ~ `性別` * `血壓`, data = data17_2)
anova(f,f2) # 比較飽和模型和不飽和模型
## LR tests for hierarchical log-linear models
##
## Model 1:
## 頻數(shù) ~ 性別 + 血壓
## Model 2:
## 頻數(shù) ~ 性別 * 血壓
##
## Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1 49.84297 1
## Model 2 0.00000 0 49.84297 1 0
## Saturated 0.00000 0 0.00000 0 1
結果和課本是一致的。G2=49.843,P值<0.0001,說明飽和模型和簡化模型(沒有交互項的模型)有顯著性差異,簡化模型不能取代飽和模型,確定最終模型為飽和模型。
如何計算OR值?可以使用glm
函數(shù)擬合模型,然后根據(jù)系數(shù)計算:
f3 <- glm(`頻數(shù)` ~ `性別` * `血壓`, data = data17_2, family = poisson())
#f3
coef(f3) # 查看系數(shù)
## (Intercept) 性別女性 血壓高血壓 性別女性:血壓高血壓
## 6.3613025 0.5779515 -0.1771536 -0.5820837
exp(-0.5820837) # 計算OR值
## [1] 0.5587329
除此之外,還可以使用Crosstabs.Loglinear
包中的LOGLINEAR
函數(shù)擬合對數(shù)線性模型,給出的信息超級全面,而且格式和SPSS類似,更容易觀看:
library(Crosstabs.Loglinear)
LOGLINEAR(data = data17_2,
data_type = "counts",
variables = c("性別","血壓"),
Freq = "頻數(shù)")
# 給出的結果,太全面了!
The input data:
血壓
性別 正常血壓 高血壓
男性 579 485
女性 1032 483
K - Way and Higher-Order Effects
K df LR Chi-Square p Pearson Chi-Square p AIC
1 3 291.135 0 319.455 0 326.153
2 1 49.843 0 50.046 0 88.860
0 0 0.000 1 0.000 1 41.017
These are tests that K - Way and Higher-Order Effects are zero, i.e., tests
of the hypothesis that Kth-order and higher interactions are zero
If these effects and all higher order effects are removed from the model,
then here are the consequences.
The df values indicate the number of effects (model terms) that are removed.
The first row, labeled as 1, shows the consequences of removing all of the main
effects and all higher order effects (i.e., everything) from the model. This
usually results in poor fit. A statistically significant chi-square indicates
that the prediction of the cell frequencies is significantly worse than the
prediction that is provided by the saturated model. It would suggest that at
least one of the removed effects needs to be included in the model.
The second row, labeled as 2, shows the consequences of removing all of the
two-way and higher order effects from the model, while keeping the main effects.
A statistically significant chi-square indicates a reduction in prediction success
compared to the saturated model and that at least one of the removed effects needs
to be included in the model.
The same interpretation process applies if there is a K = 3 row, and so on.
A K = 3 row in the table would show the consequences of removing all of the
three-way and higher order effects from the model, while keeping the two-way
interactions and main effects.
A nonsignificant chi-square for a row would indicate that removing the
model term(s) does not significantly worsen the prediction of the cell
frequencies and the term(s) is nonessential and can be dropped from the model.
The bottom row in the table, labeled as 0, is for the saturated mode. It
includes all possible model terms and therefore provides perfect prediction
of the cell frequencies. The AIC values for this model can be helpful in
gaging the relative fit of models with fewer terms.
K-Way Effects
K df LR Chi-Square p Pearson Chi-Square p AIC diff.
1 2 241.292 0 269.409 0 237.292
2 1 49.843 0 50.046 0 47.843
These are tests that the K - Way Effects are zero, i.e., tests whether
interactions of a particular order are zero. The tests are for model
comparisons/differences. For each K-way test, a model is fit with and then
without the interactions and the change/difference in chi-square and
likelihood ratio chi-square values are computed.
For example, the K = 1 test is for the comparison of the model with
all main effects and the intercept with the model with only the intercept.
A statistically significant K = 1 test is (conventionally) considered to
mean that the main effects are not zero and that they are needed in the model.
The K = 2 test is for the comparison of the model with all two-way
interactions, all main effects, and the intercept with the model with
the main effects, and the intercept. A statistically significant K = 2 test
is (conventionally) considered to mean that the two-way interactions are
not zero and that they are needed in the model.
The K = 3 test (if there is one) is for the comparison of the model
with all three-way interactions, all two-way interactions, all main
effects, and the intercept with the model with all two-way interactions,
all main effects, and the intercept. A statistically significant K = 3 test
is (conventionally) considered to mean that the three-way interactions
are not zero and that they are needed in the model, and so on.
The df values for the model comparisons are the df values associated
with the K-way terms.
The above "K - Way and Higher-Order Effects" and "K - Way" tests are for the
ncollective importance of the effects at each value of K. There are not tests
nof individual terms. For example, a significant K = 2 test means that the set
nof two-way terms is important, but it does not mean that every two-way term is
significant.
Partial Associations:
Effect LR.Chi.Square df p AIC.diff.
1
2 性別 79.275 1 0 77.275
3 血壓 162.017 1 0 160.017
These are tests of individual terms in the model, with the restriction that
higher-order terms at each step are excluded. The tests are for differences
between models. For example, the tests of 2-way interactions are for the
differences between the model with all 2-way interactions (and no higher-order
interactions) and the model when each individual 2-way interaction is removed in turn.
Parameter Estimates (SPSS "Model Selection", not "General", Parameter Estimates):
For saturated models, .500 has been added to all observed cells:
Estimate Std. Error z value p CI_lb CI_ub
(Intercept) 6.417 0.021 310.767 0 6.377 6.458
性別1 -0.143 0.021 -6.943 0 -0.184 -0.103
血壓1 0.234 0.021 11.328 0 0.193 0.274
性別1:血壓1 -0.145 0.021 -7.043 0 -0.186 -0.105
Backward Elimination Statistics:
Step GenDel Effects LR_Chi_Square df p AIC
0 Generating Class 性別:血壓 0 0 1 41.017
Deleted Effect 性別:血壓 49.843 1 0 88.86
The hierarchical backward elimination procedure begins with all possible
terms in the model and then removes, one at a time, terms that do not
satisfy the criteria for remaining in the model.
A term is dropped only when it is determined that removing the term does
not result in a reduction in model fit AND if the term is not involved in any
higher order interaction. On each Step above, the focus is on the term that results
in the least-significant change in the likelihood ratio chi-squre if removed.
If the change is not significant, then the term is removed.
The Final Model Formula:
Freq ~ 性別 + 血壓 + 性別:血壓
The Final Model Goodness-of-Fit Tests:
df LR Chi-Square p Pearson Chi-Square p AIC
0 0 1 0 0 41.017
Generalized Linear Model Coefficients for the Final Model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.361 0.042 153.068 0.000
性別女性 0.578 0.052 11.131 0.000
血壓高血壓 -0.177 0.062 -2.878 0.004
性別女性:血壓高血壓 -0.582 0.083 -7.044 0.000
Cell Counts and Residuals:
性別 血壓 Obsd. Freq. Exp. Freq. Residuals Std. Resid. Adjusted Resid.
1 男性 正常血壓 579 579 0 0 0
3 男性 高血壓 485 485 0 0 0
2 女性 正常血壓 1032 1032 0 0 0
4 女性 高血壓 483 483 0 0 0
R*C表
孫振球醫(yī)學統(tǒng)計學第4版例17-3。比較3種方劑治療胃潰瘍的效果,將200名病情類似的患者隨機分到3個治療組,療效如下。試分析3個方劑的治療效果有無差別。
data17_3 <- haven::read_sav("datasets/例17-03.sav",encoding = "GBK")
data17_3 <- haven::as_factor(data17_3)
str(data17_3)
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
## $ 治療方法: Factor w/ 3 levels "甲方劑","乙方劑",..: 1 1 2 2 3 3
## $ 治療效果: Factor w/ 2 levels "有效","無效": 1 2 1 2 1 2
## $ 頻數(shù) : num [1:6] 42 18 38 27 56 19
## ..- attr(*, "format.spss")= chr "F8.0"
data17_3
## # A tibble: 6 × 3
## 治療方法 治療效果 頻數(shù)
## <fct> <fct> <dbl>
## 1 甲方劑 有效 42
## 2 甲方劑 無效 18
## 3 乙方劑 有效 38
## 4 乙方劑 無效 27
## 5 丙方劑 有效 56
## 6 丙方劑 無效 19
直接使用loglm
函數(shù)擬合對數(shù)線性模型:
# 擬合不飽和模型
f <- loglm(`頻數(shù)` ~ `治療方法`+`治療效果`,data = data17_3)
f
## Call:
## loglm(formula = 頻數(shù) ~ 治療方法 + 治療效果, data = data17_3)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 4.310314 2 0.1158850
## Pearson 4.359917 2 0.1130462
進行擬合優(yōu)度檢驗:
f1 <- update(f, ~ .^2) # 飽和模型
anova(f,f1) # 擬合優(yōu)度檢驗
## LR tests for hierarchical log-linear models
##
## Model 1:
## 頻數(shù) ~ 治療方法 + 治療效果
## Model 2:
## 頻數(shù) ~ 治療方法 + 治療效果 + 治療方法:治療效果
##
## Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1 4.310314 2
## Model 2 0.000000 0 4.310314 2 0.11588
## Saturated 0.000000 0 0.000000 0 1.00000
這個結果也是和課本一致的,似然比G2=4.31,p值=0.1159,不拒絕H0,簡化模型可以取代飽和模型。
三維列聯(lián)表
孫振球醫(yī)學統(tǒng)計學第4版例17-4。Vandenbroucke等人采用病例對照研究,研究避孕藥與Fcator-V-Leiden等位基因在靜脈血栓發(fā)生中的作用。共調(diào)查324人,其中病例155人,對照169人。試對避孕藥與基因的交互作用進行分析。
data17_4 <- haven::read_sav("datasets/例17-04.sav",encoding = "GBK")
data17_4 <- haven::as_factor(data17_4)
str(data17_4)
## tibble [8 × 4] (S3: tbl_df/tbl/data.frame)
## $ 人群分組 : Factor w/ 2 levels "病例組","對照組": 1 1 1 1 2 2 2 2
## $ 口服避孕藥暴露水平: Factor w/ 2 levels "暴露","未暴露": 1 1 2 2 1 1 2 2
## $ 基因型 : Factor w/ 2 levels "突變型","野生型": 1 2 1 2 1 2 1 2
## $ 頻數(shù) : num [1:8] 25 84 10 36 2 63 4 100
## ..- attr(*, "format.spss")= chr "F8.0"
data17_4
## # A tibble: 8 × 4
## 人群分組 口服避孕藥暴露水平 基因型 頻數(shù)
## <fct> <fct> <fct> <dbl>
## 1 病例組 暴露 突變型 25
## 2 病例組 暴露 野生型 84
## 3 病例組 未暴露 突變型 10
## 4 病例組 未暴露 野生型 36
## 5 對照組 暴露 突變型 2
## 6 對照組 暴露 野生型 63
## 7 對照組 未暴露 突變型 4
## 8 對照組 未暴露 野生型 100
下面就是建立飽和模型和多個不飽和模型,書中介紹了4種不飽和模型:
然后借助SAS對飽和模型進行逐步篩選,并根據(jù)BIC和G2選擇最終模型。
我們先建立飽和模型和幾個不飽和模型(并沒有完全按照書中來,太麻煩了,不嫌麻煩的可以自己逐個列出來)。
library(MASS)
# 無交互項,只有主效應,完全獨立模型
f1 <- loglm(`頻數(shù)` ~ `人群分組`+`口服避孕藥暴露水平`+`基因型`,data = data17_4)
# 添加1階交互效應,即只有兩個變量的交互,沒有3變量交互
f2 <- update(f1, ~ .^2)
# 添加2階交互效應,即飽和模型,既有兩變量交互,又有3變量交互
f3 <- update(f1, ~ .^3)
下面使用逐步回歸法對飽和模型進行篩選,使用的指標的AIC(AIC和BIC類似,也是值越小說明模型擬合效果越好):
f <- step(f3, direction = "both")
## Start: AIC=16
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 +
## 人群分組:口服避孕藥暴露水平 + 人群分組:基因型 +
## 口服避孕藥暴露水平:基因型 + 人群分組:口服避孕藥暴露水平:基因型
##
## Df AIC
## - 人群分組:口服避孕藥暴露水平:基因型 1 14.096
## <none> 16.000
##
## Step: AIC=14.1
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 +
## 人群分組:口服避孕藥暴露水平 + 人群分組:基因型 +
## 口服避孕藥暴露水平:基因型
##
## Df AIC
## - 口服避孕藥暴露水平:基因型 1 12.097
## <none> 14.096
## + 人群分組:口服避孕藥暴露水平:基因型 1 16.000
## - 人群分組:基因型 1 37.909
## - 人群分組:口服避孕藥暴露水平 1 42.920
##
## Step: AIC=12.1
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 +
## 人群分組:口服避孕藥暴露水平 + 人群分組:基因型
##
## Df AIC
## <none> 12.097
## + 口服避孕藥暴露水平:基因型 1 14.096
## - 人群分組:基因型 1 38.751
## - 人群分組:口服避孕藥暴露水平 1 43.762
f # AIC最小的模型
## Call:
## loglm(formula = 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 +
## 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型,
## data = data17_4, evaluate = FALSE)
##
## Statistics:
## X^2 df P(> X^2)
## Likelihood Ratio 0.09702652 2 0.9526447
## Pearson 0.09561789 2 0.9533159
結果得到的最終模型和書中是一樣的:
頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型
可以再比較一下飽和模型和AIC最小的這個模型:
anova(f3,f)
## LR tests for hierarchical log-linear models
##
## Model 1:
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型
## Model 2:
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型 + 口服避孕藥暴露水平:基因型 + 人群分組:口服避孕藥暴露水平:基因型
##
## Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1 0.09702652 2
## Model 2 0.00000000 0 0.09702652 2 0.95264
## Saturated 0.00000000 0 0.00000000 0 1.00000
結果顯示似然比G2=0.097,P值=0.095(就是上面f
的結果),簡化模型可以取代飽和模型。