【原】R語言對數(shù)線性模型

阿越就是我 2025-03-17 發(fā)布于上海

展開全文

??專注R語言在??生物醫(yī)學中的使用

在一些生物醫(yī)學研究中，很多時候反應變量不符合正態(tài)分布的假設，或者反應變量為屬性變量或離散型變量，往往需要借助于廣義線性模型（generalized linear model）的分析方法。廣義線性模型是一般線性模型的直接推廣，在廣義線性模型中，因變量可以是連續(xù)型的資料也可以是離散型資料，如正態(tài)分布、二項分布、泊松分布等。在廣義線性模型中因變量的總體均值經(jīng)過函數(shù)轉換與自變量的線性預測值建立聯(lián)系。因變量總體均值的轉換函數(shù)稱作連接函數(shù)（link function），連接函數(shù)為單調(diào)可微（連續(xù)且充分光滑）的函數(shù)。常用的廣義線性模型有對數(shù)線性模型、logistic回歸模型、Probit回歸模型、Poisson回歸模型、負二項回歸模型等。本章著重介紹對數(shù)線性模型。

對數(shù)線性模型（log-linear model）多用于列聯(lián)表資料的分析。是將列聯(lián)表資料中各個格子理論頻數(shù)的自然對數(shù)表示為分類變量以及分類變量之間的交互作用的線性模型。在對數(shù)線性模型中，連接函數(shù)為自然對數(shù)，要對反應變量的總體均數(shù)（理論頻數(shù)）進行自然對數(shù)轉換。應用對數(shù)線性模型對列聯(lián)表資料進行分析時，所有變量均為因變量，即不區(qū)分因變量和自變量，模型的假設檢驗通過分析列聯(lián)表單元格的實際頻數(shù)和理論頻數(shù)差異的大小進行推斷，從而分析變量之間的相互聯(lián)系。對數(shù)線性模型是列聯(lián)表資料分析方法的拓展，應用于二維及二維以上多維列聯(lián)表資料的分析，卡方檢驗用于二維列聯(lián)表資料的分析，不能分析二維以上列聯(lián)表中變量的關系。

對數(shù)線性模型和卡方檢驗

孫振球醫(yī)學統(tǒng)計學第4版例17-1。研究產(chǎn)前護理量對嬰兒死亡率的影響，收集了甲乙兩家醫(yī)院的資料，此資料由分類變量X（產(chǎn)前護理量）、Y（嬰兒是否存活）、Z（接受護理地點）構成的2×2×2三維列聯(lián)表，X為行變量、Y為列變量、Z為分層變量。資料見表17-1，試對此資料中3個變量的關系進行分析。

對表17-1資料中的產(chǎn)前護理量與接受護理地點和嬰兒存活情況進行卡方檢驗，如果不考慮接受護理地點的影響，對產(chǎn)前護理量和嬰兒存活情況分別合并后得到（Χ²=5.255，P=0.022），即產(chǎn)前護理量與嬰兒存活情況有關聯(lián)：

# 不考慮護理地點的影響，就是2*2列聯(lián)表
M <- matrix(c(20,373,6,316),nrow = 2, byrow = T,
            dimnames = list(`產(chǎn)前護理量`=c("少","多"),
                            `存活情況`=c("死","活")))
M
##           存活情況
## 產(chǎn)前護理量 死  活
##         少 20 373
##         多  6 316

# 進行卡方檢驗
chisq.test(M,correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  M
## X-squared = 5.2555, df = 1, p-value = 0.02188

如果考慮接受護理地點的影響，分別對兩個診所進行假設檢驗時卻發(fā)現(xiàn)產(chǎn)前護理量與嬰兒存活情況是沒有關聯(lián)的，即基于接受護理地點的條件下產(chǎn)前護理量與嬰兒存活情況沒有關聯(lián)：

# 診所甲
M1 <- matrix(c(3,176,4,293),nrow = 2, byrow = T,
            dimnames = list(`產(chǎn)前護理量`=c("少","多"),
                            `存活情況`=c("死","活")))
M1
##           存活情況
## 產(chǎn)前護理量 死  活
##         少  3 176
##         多  4 293

# 進行卡方檢驗
chisq.test(M1,correct = F)
## Warning in chisq.test(M1, correct = F): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  M1
## X-squared = 0.083522, df = 1, p-value = 0.7726

# 診所乙
M2 <- matrix(c(17,197,2,23),nrow = 2, byrow = T,
            dimnames = list(`產(chǎn)前護理量`=c("少","多"),
                            `存活情況`=c("死","活")))
M2
##           存活情況
## 產(chǎn)前護理量 死  活
##         少 17 197
##         多  2  23

# 進行卡方檢驗
chisq.test(M2,correct = F)
## Warning in chisq.test(M2, correct = F): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  M2
## X-squared = 9.6186e-05, df = 1, p-value = 0.9922

這是因為產(chǎn)前護理量和接受護理地點之間不是獨立的（Χ²=173.372，P<0.001），這個例子說明由于接受護理量不是一個獨立的變量，故對二維以上的列聯(lián)表資料不能通過簡單合并的方法進行卡方檢驗。

# 產(chǎn)前護理量和護理地點之間的獨立性檢驗
M3 <- matrix(c(179,297,214,25),nrow = 2, byrow = T,
            dimnames = list(`產(chǎn)前護理量`=c("少","多"),
                            `護理地點`=c("甲","乙")))
M3
##           護理地點
## 產(chǎn)前護理量  甲  乙
##         少 179 297
##         多 214  25

# 進行卡方檢驗
chisq.test(M3,correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  M3
## X-squared = 173.37, df = 1, p-value < 2.2e-16

在列聯(lián)表17-1中有8個格子，總頻數(shù)715分布于這8個格子中，假設在列聯(lián)表資料中，這種分布是隨機變量，那么在總樣本量、行合計、列合計、層合計均固定時，我們可以通過這種多項分布建立模型，描述變量之間的關系。這就是我們要著重介紹的列聯(lián)表對數(shù)線性模型，將每個格子的期望頻數(shù)的自然對數(shù)與分類變量之間建立線性聯(lián)系，分析分類變量之間的關系。具體統(tǒng)計分析過程包括建立對數(shù)線性模型、擬合優(yōu)度檢驗及參數(shù)估計。

對數(shù)線性模型的概念

對數(shù)線性模型為層次模型，如果模型中包含了某幾個變量的高階交互效應項時，這幾個變量的低階交互效應項與主效應項也一定包含在模型中。對數(shù)線性模型的建立一般以飽和模型（saturated model）)開始，飽和模型包含了所有變量的主效應、低階交互效應和高階交互效應項。通過后退法逐漸排除沒有統(tǒng)計學意義的作用項，最后擬合最優(yōu)的簡化模型即不飽和模型（unsaturated model）。

2*2列聯(lián)表

孫振球醫(yī)學統(tǒng)計學第4版例17-2。一項蒙古族高血壓危險因素的研究中得到性別和血壓的關系數(shù)據(jù)，此數(shù)據(jù)由分類變量X（性別）、Y（是否高血壓）構成的2×2二維列聯(lián)表。試分析蒙古族高血壓危險因素研究中性別和血壓的關系。

data17_2 <- haven::read_sav("datasets/例17-02.sav",encoding = "GBK")
#str(data17_2)
data17_2 <- haven::as_factor(data17_2)
str(data17_2)
## tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
##  $ 性別: Factor w/ 2 levels "男性","女性": 1 1 2 2
##  $ 血壓: Factor w/ 2 levels "正常血壓","高血壓": 1 2 1 2
##  $ 頻數(shù): num [1:4] 579 485 1032 483
##   ..- attr(*, "format.spss")= chr "F8.0"
data17_2
## # A tibble: 4 × 3
##   性別  血壓      頻數(shù)
##   <fct> <fct>    <dbl>
## 1 男性  正常血壓   579
## 2 男性  高血壓     485
## 3 女性  正常血壓  1032
## 4 女性  高血壓     483

先改變一下格式，變成矩陣：

M <- matrix(data17_2$頻數(shù),nrow = 2,byrow = T,
            dimnames = list(trt = c("男性", "女性"),
                            effect = c("正常","高血壓")))

M
##       effect
## trt    正常 高血壓
##   男性  579    485
##   女性 1032    483

這個其實是2維列聯(lián)表，你用卡方檢驗也可以的：

chisq.test(M,correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  M
## X-squared = 50.046, df = 1, p-value = 1.502e-12

從卡方檢驗的結果來看性別和血壓是獨立的。

下面做對數(shù)線性模型，直接使用R語言自帶的loglin函數(shù)：

fm <- loglin(M, margin=list(1,2), fit=T, param=T)
## 2 iterations: deviation 0
fm$lrt # 查看似然比G^2^
## [1] 49.84297
fm$pearson # 查看卡方值
## [1] 50.04632
1 - pchisq(fm$lrt, fm$df) # 計算似然比G^2^的P值
## [1] 1.665557e-12
1 - pchisq(fm$pearson, fm$df) # 計算卡方的P值
## [1] 1.501577e-12

似然比統(tǒng)計量G²和pearson-x²都和課本一樣，P值小于0.0001。

也可以使用MASS包中的loglm擬合對數(shù)線性模型，可以使用公式的形式，結果也更加簡潔易懂：

library(MASS)

# 不飽和模型（沒有交互項）
f <- loglm(`頻數(shù)` ~ `性別` + `血壓`, data = data17_2)
f
## Call:
## loglm(formula = 頻數(shù) ~ 性別 + 血壓, data = data17_2)
## 
## Statistics:
##                       X^2 df     P(> X^2)
## Likelihood Ratio 49.84297  1 1.665557e-12
## Pearson          50.04632  1 1.501577e-12

如果要對飽和模型進行擬合優(yōu)度檢驗，可以使用以下代碼：

f1 <- update(f, ~ .^2) # 直接更新模型
anova(f, f1) # 比較飽和模型和不飽和模型
## LR tests for hierarchical log-linear models
## 
## Model 1:
##  頻數(shù) ~ 性別 + 血壓 
## Model 2:
##  頻數(shù) ~ 性別 + 血壓 + 性別:血壓 
## 
##           Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   49.84297  1                                    
## Model 2    0.00000  0   49.84297         1              0
## Saturated  0.00000  0    0.00000         0              1

當然你不更新模型，直接擬合一個飽和模型也是可以的：

# 或者重新擬合一個飽和模型
f2 <- loglm(`頻數(shù)` ~ `性別` * `血壓`, data = data17_2)
anova(f,f2) # 比較飽和模型和不飽和模型
## LR tests for hierarchical log-linear models
## 
## Model 1:
##  頻數(shù) ~ 性別 + 血壓 
## Model 2:
##  頻數(shù) ~ 性別 * 血壓 
## 
##           Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   49.84297  1                                    
## Model 2    0.00000  0   49.84297         1              0
## Saturated  0.00000  0    0.00000         0              1

結果和課本是一致的。G²=49.843，P值<0.0001，說明飽和模型和簡化模型（沒有交互項的模型）有顯著性差異，簡化模型不能取代飽和模型，確定最終模型為飽和模型。

如何計算OR值？可以使用glm函數(shù)擬合模型，然后根據(jù)系數(shù)計算：

f3 <- glm(`頻數(shù)` ~ `性別` * `血壓`, data = data17_2, family = poisson())
#f3
coef(f3) # 查看系數(shù)
##         (Intercept)            性別女性          血壓高血壓 性別女性:血壓高血壓 
##           6.3613025           0.5779515          -0.1771536          -0.5820837
exp(-0.5820837) # 計算OR值
## [1] 0.5587329

除此之外，還可以使用Crosstabs.Loglinear包中的LOGLINEAR函數(shù)擬合對數(shù)線性模型，給出的信息超級全面，而且格式和SPSS類似，更容易觀看：

library(Crosstabs.Loglinear)

LOGLINEAR(data = data17_2,
          data_type = "counts",
          variables = c("性別","血壓"),
          Freq = "頻數(shù)")

# 給出的結果，太全面了！
The input data:

      血壓
性別    正常血壓    高血壓
  男性         579       485
  女性        1032       483


K - Way and Higher-Order Effects

    K    df    LR Chi-Square    p    Pearson Chi-Square    p        AIC
    1     3          291.135    0               319.455    0    326.153
    2     1           49.843    0                50.046    0     88.860
    0     0            0.000    1                 0.000    1     41.017

    These are tests that K - Way and Higher-Order Effects are zero, i.e., tests
    of the hypothesis that Kth-order and higher interactions are zero
    If these effects and all higher order effects are removed from the model,
    then here are the consequences.
    
    The df values indicate the number of effects (model terms) that are removed.
    
    The first row, labeled as 1, shows the consequences of removing all of the main
    effects and all higher order effects (i.e., everything) from the model. This
    usually results in poor fit. A statistically significant chi-square indicates   
    that the prediction of the cell frequencies is significantly worse than the 
    prediction that is provided by the saturated model. It would suggest that at
    least one of the removed effects needs to be included in the model.
    
    The second row, labeled as 2, shows the consequences of removing all of the
    two-way and higher order effects from the model, while keeping the main effects.
    A statistically significant chi-square indicates a reduction in prediction success
    compared to the saturated model and that at least one of the removed effects needs
    to be included in the model.
    
    The same interpretation process applies if there is a K = 3 row, and so on.
    A K = 3 row in the table would show the consequences of removing all of the
    three-way and higher order effects from the model, while keeping the two-way
    interactions and main effects.
    
    A nonsignificant chi-square for a row would indicate that removing the
    model term(s) does not significantly worsen the prediction of the cell
    frequencies and the term(s) is nonessential and can be dropped from the model.
    
    The bottom row in the table, labeled as 0, is for the saturated mode. It
    includes all possible model terms and therefore provides perfect prediction
    of the cell frequencies. The AIC values for this model can be helpful in
    gaging the relative fit of models with fewer terms.


K-Way Effects

    K    df    LR Chi-Square    p    Pearson Chi-Square    p    AIC diff.
    1     2          241.292    0               269.409    0      237.292
    2     1           49.843    0                50.046    0       47.843

    These are tests that the K - Way Effects are zero, i.e., tests whether
    interactions of a particular order are zero. The tests are for model
    comparisons/differences. For each K-way test, a model is fit with and then
    without the interactions and the change/difference in chi-square and
    likelihood ratio chi-square values are computed.
        
    For example, the K = 1 test is for the comparison of the model with
    all main effects and the intercept with the model with only the intercept.
    A statistically significant K = 1 test is (conventionally) considered to
    mean that the main effects are not zero and that they are needed in the model.
        
    The K = 2 test is for the comparison of the model with all two-way
    interactions, all main effects, and the intercept with the model with
    the main effects, and the intercept. A statistically significant K = 2 test
    is (conventionally) considered to mean that the two-way interactions are
    not zero and that they are needed in the model.
        
    The K = 3 test (if there is one) is for the comparison of the model
    with all three-way interactions, all two-way interactions, all main
    effects, and the intercept with the model with all two-way interactions,
    all main effects, and the intercept. A statistically significant K = 3 test
    is (conventionally) considered to mean that the three-way interactions
    are not zero and that they are needed in the model, and so on.
        
    The df values for the model comparisons are the df values associated
    with the K-way terms.
        
    The above "K - Way and Higher-Order Effects" and "K - Way" tests are for the
    ncollective importance of the effects at each value of K. There are not tests
    nof individual terms. For example, a significant K = 2 test means that the set
    nof two-way terms is important, but it does not mean that every two-way term is
    significant.


Partial Associations:

     Effect    LR.Chi.Square    df    p    AIC.diff.
1                                                   
2      性別           79.275     1    0       77.275
3      血壓          162.017     1    0      160.017

    These are tests of individual terms in the model, with the restriction that
    higher-order terms at each step are excluded. The tests are for differences
    between models. For example, the tests of 2-way interactions are for the
    differences between the model with all 2-way interactions (and no higher-order
    interactions) and the model when each individual 2-way interaction is removed in turn.



Parameter Estimates (SPSS "Model Selection", not "General", Parameter Estimates):

    For saturated models, .500 has been added to all observed cells:

               Estimate    Std. Error    z value    p     CI_lb     CI_ub
(Intercept)       6.417         0.021    310.767    0     6.377     6.458
性別1            -0.143         0.021     -6.943    0    -0.184    -0.103
血壓1             0.234         0.021     11.328    0     0.193     0.274
性別1:血壓1      -0.145         0.021     -7.043    0    -0.186    -0.105


Backward Elimination Statistics:

    Step              GenDel      Effects    LR_Chi_Square    df    p       AIC
                                                                               
       0    Generating Class    性別:血壓                0     0    1    41.017
                                                                               
              Deleted Effect    性別:血壓           49.843     1    0     88.86

    The hierarchical backward elimination procedure begins with all possible
    terms in the model and then removes, one at a time, terms that do not
    satisfy the criteria for remaining in the model.
    A term is dropped only when it is determined that removing the term does
    not result in a reduction in model fit AND if the term is not involved in any
    higher order interaction. On each Step above, the focus is on the term that results
    in the least-significant change in the likelihood ratio chi-squre if removed.
    If the change is not significant, then the term is removed.



The Final Model Formula:

Freq ~ 性別 + 血壓 + 性別:血壓


The Final Model Goodness-of-Fit Tests:

    df    LR Chi-Square    p    Pearson Chi-Square    p       AIC
     0                0    1                     0    0    41.017


Generalized Linear Model Coefficients for the Final Model:

                       Estimate    Std. Error    z value    Pr(>|z|)
(Intercept)               6.361         0.042    153.068       0.000
性別女性                  0.578         0.052     11.131       0.000
血壓高血壓               -0.177         0.062     -2.878       0.004
性別女性:血壓高血壓      -0.582         0.083     -7.044       0.000


Cell Counts and Residuals:

     性別        血壓    Obsd. Freq.    Exp. Freq.    Residuals    Std. Resid.    Adjusted Resid.
1    男性    正常血壓            579           579            0              0                  0
3    男性      高血壓            485           485            0              0                  0
2    女性    正常血壓           1032          1032            0              0                  0
4    女性      高血壓            483           483            0              0                  0

R*C表

孫振球醫(yī)學統(tǒng)計學第4版例17-3。比較3種方劑治療胃潰瘍的效果，將200名病情類似的患者隨機分到3個治療組，療效如下。試分析3個方劑的治療效果有無差別。

data17_3 <- haven::read_sav("datasets/例17-03.sav",encoding = "GBK")
data17_3 <- haven::as_factor(data17_3)
str(data17_3)
## tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
##  $ 治療方法: Factor w/ 3 levels "甲方劑","乙方劑",..: 1 1 2 2 3 3
##  $ 治療效果: Factor w/ 2 levels "有效","無效": 1 2 1 2 1 2
##  $ 頻數(shù)    : num [1:6] 42 18 38 27 56 19
##   ..- attr(*, "format.spss")= chr "F8.0"
data17_3
## # A tibble: 6 × 3
##   治療方法 治療效果  頻數(shù)
##   <fct>    <fct>    <dbl>
## 1 甲方劑   有效        42
## 2 甲方劑   無效        18
## 3 乙方劑   有效        38
## 4 乙方劑   無效        27
## 5 丙方劑   有效        56
## 6 丙方劑   無效        19

直接使用loglm函數(shù)擬合對數(shù)線性模型：

# 擬合不飽和模型
f <- loglm(`頻數(shù)` ~ `治療方法`+`治療效果`,data = data17_3)
f
## Call:
## loglm(formula = 頻數(shù) ~ 治療方法 + 治療效果, data = data17_3)
## 
## Statistics:
##                       X^2 df  P(> X^2)
## Likelihood Ratio 4.310314  2 0.1158850
## Pearson          4.359917  2 0.1130462

進行擬合優(yōu)度檢驗：

f1 <- update(f, ~ .^2) # 飽和模型
anova(f,f1) # 擬合優(yōu)度檢驗
## LR tests for hierarchical log-linear models
## 
## Model 1:
##  頻數(shù) ~ 治療方法 + 治療效果 
## Model 2:
##  頻數(shù) ~ 治療方法 + 治療效果 + 治療方法:治療效果 
## 
##           Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   4.310314  2                                    
## Model 2   0.000000  0   4.310314         2        0.11588
## Saturated 0.000000  0   0.000000         0        1.00000

這個結果也是和課本一致的，似然比G²=4.31，p值=0.1159，不拒絕H0，簡化模型可以取代飽和模型。

三維列聯(lián)表

孫振球醫(yī)學統(tǒng)計學第4版例17-4。Vandenbroucke等人采用病例對照研究，研究避孕藥與Fcator-V-Leiden等位基因在靜脈血栓發(fā)生中的作用。共調(diào)查324人，其中病例155人，對照169人。試對避孕藥與基因的交互作用進行分析。

data17_4 <- haven::read_sav("datasets/例17-04.sav",encoding = "GBK")
data17_4 <- haven::as_factor(data17_4)
str(data17_4)
## tibble [8 × 4] (S3: tbl_df/tbl/data.frame)
##  $ 人群分組          : Factor w/ 2 levels "病例組","對照組": 1 1 1 1 2 2 2 2
##  $ 口服避孕藥暴露水平: Factor w/ 2 levels "暴露","未暴露": 1 1 2 2 1 1 2 2
##  $ 基因型            : Factor w/ 2 levels "突變型","野生型": 1 2 1 2 1 2 1 2
##  $ 頻數(shù)              : num [1:8] 25 84 10 36 2 63 4 100
##   ..- attr(*, "format.spss")= chr "F8.0"
data17_4
## # A tibble: 8 × 4
##   人群分組 口服避孕藥暴露水平 基因型  頻數(shù)
##   <fct>    <fct>              <fct>  <dbl>
## 1 病例組   暴露               突變型    25
## 2 病例組   暴露               野生型    84
## 3 病例組   未暴露             突變型    10
## 4 病例組   未暴露             野生型    36
## 5 對照組   暴露               突變型     2
## 6 對照組   暴露               野生型    63
## 7 對照組   未暴露             突變型     4
## 8 對照組   未暴露             野生型   100

下面就是建立飽和模型和多個不飽和模型，書中介紹了4種不飽和模型：

無二階交互效應的模型
條件獨立模型
聯(lián)合獨立模型
完全獨立模型

然后借助SAS對飽和模型進行逐步篩選，并根據(jù)BIC和G²選擇最終模型。

我們先建立飽和模型和幾個不飽和模型（并沒有完全按照書中來，太麻煩了，不嫌麻煩的可以自己逐個列出來）。

library(MASS)

# 無交互項，只有主效應，完全獨立模型
f1 <- loglm(`頻數(shù)` ~ `人群分組`+`口服避孕藥暴露水平`+`基因型`,data = data17_4)
# 添加1階交互效應，即只有兩個變量的交互，沒有3變量交互
f2 <- update(f1, ~ .^2)
# 添加2階交互效應，即飽和模型，既有兩變量交互，又有3變量交互
f3 <- update(f1, ~ .^3)

下面使用逐步回歸法對飽和模型進行篩選，使用的指標的AIC(AIC和BIC類似，也是值越小說明模型擬合效果越好)：

f <- step(f3, direction = "both")
## Start:  AIC=16
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 
##     人群分組:口服避孕藥暴露水平 + 人群分組:基因型 + 
##     口服避孕藥暴露水平:基因型 + 人群分組:口服避孕藥暴露水平:基因型
## 
##                                      Df    AIC
## - 人群分組:口服避孕藥暴露水平:基因型  1 14.096
## <none>                                  16.000
## 
## Step:  AIC=14.1
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 
##     人群分組:口服避孕藥暴露水平 + 人群分組:基因型 + 
##     口服避孕藥暴露水平:基因型
## 
##                                      Df    AIC
## - 口服避孕藥暴露水平:基因型           1 12.097
## <none>                                  14.096
## + 人群分組:口服避孕藥暴露水平:基因型  1 16.000
## - 人群分組:基因型                     1 37.909
## - 人群分組:口服避孕藥暴露水平         1 42.920
## 
## Step:  AIC=12.1
## 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 
##     人群分組:口服避孕藥暴露水平 + 人群分組:基因型
## 
##                               Df    AIC
## <none>                           12.097
## + 口服避孕藥暴露水平:基因型    1 14.096
## - 人群分組:基因型              1 38.751
## - 人群分組:口服避孕藥暴露水平  1 43.762
f # AIC最小的模型
## Call:
## loglm(formula = 頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 
##     基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型, 
##     data = data17_4, evaluate = FALSE)
## 
## Statistics:
##                         X^2 df  P(> X^2)
## Likelihood Ratio 0.09702652  2 0.9526447
## Pearson          0.09561789  2 0.9533159

結果得到的最終模型和書中是一樣的：

頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型

可以再比較一下飽和模型和AIC最小的這個模型：

anova(f3,f)
## LR tests for hierarchical log-linear models
## 
## Model 1:
##  頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型 
## Model 2:
##  頻數(shù) ~ 人群分組 + 口服避孕藥暴露水平 + 基因型 + 人群分組:口服避孕藥暴露水平 + 人群分組:基因型 + 口服避孕藥暴露水平:基因型 + 人群分組:口服避孕藥暴露水平:基因型 
## 
##             Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
## Model 1   0.09702652  2                                    
## Model 2   0.00000000  0 0.09702652         2        0.95264
## Saturated 0.00000000  0 0.00000000         0        1.00000