apply函數(shù)族入門

panhoy 2014-07-09

展開全文

在任何一個R語言問答網(wǎng)站或者論壇，你都能看見這樣的問題：
[code]
Q:如何用循環(huán)做【...各種奇怪的事情...】
A:不用用循環(huán)哦，親！apply函數(shù)可以解決這個問題哦，親！
[/code]
那么，這個神奇的apply函數(shù)到底是神馬呢？下面通過一些簡單的操作示范給各位看官。
打開R，敲入??apply函數(shù)，選定base包部分你會看到下面的東西：
[code]
base::apply Apply Functions Over Array Margins
base::by Apply a Function to a Data Frame Split by Factors
base::eapply Apply a Function Over Values in an Environment
base::lapply Apply a Function over a List or Vector
base::mapply Apply a Function to Multiple List or Vector Arguments
base::rapply Recursively Apply a Function to a List
base::tapply Apply a Function Over a Ragged Array
[/code]
下面一一示范。
1.apply
先看看幫助文檔中對其的描述：
[code]
“Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. ”
[/code]
好吧，vector、array和function是神馬我都明白，margins是神馬東東？簡單來說，margins為1時是指行，margins為2時是指列，如果是c(1:2),好吧，這個啰嗦的舉動，指的是整個array或者matrix。例子如下:
[code]
#創(chuàng)建一個10行2列的矩陣
m=matrix(c(1:10,11:20),nrow=10,ncol=2)
#求m的每一行的均值
apply(m,1,mean)
[1] 6 7 8 9 10 11 12 13 14 15
#求m的每一列的均值
apply(m,2,mean)
[1] 5.5 15.5
#將m的每個值除以2
apply(m,1:2,function(x) x/2)
[,1] [,2]
[1,] 0.5 5.5
[2,] 1.0 6.0
[3,] 1.5 6.5
[4,] 2.0 7.0
[5,] 2.5 7.5
[6,] 3.0 8.0
[7,] 3.5 8.5
[8,] 4.0 9.0
[9,] 4.5 9.5
[10,] 5.0 10.0
[/code]
最后一個例子僅僅是為了示范，我們有更簡單的方法來實現(xiàn)。
[code]
m/2
[,1] [,2]
[1,] 0.5 5.5
[2,] 1.0 6.0
[3,] 1.5 6.5
[4,] 2.0 7.0
[5,] 2.5 7.5
[6,] 3.0 8.0
[7,] 3.5 8.5
[8,] 4.0 9.0
[9,] 4.5 9.5
[10,] 5.0 10.0
[/code]

2.by
幫助文檔中的描述：
[code]
“Function by is an object-oriented wrapper for tapply applied to data frames. ”
[/code]
事實上，by的功能絕不是這一句話所能描述的。接著讀下去，你會看到
[code]
“A data frame is split by row into data frames subsetted by the values of one or more factors, and function FUN is applied to each subset in turn. ”
[/code]
這里，我們用一個帶有定性變量的數(shù)據(jù)進(jìn)行示范。
這個數(shù)據(jù)集就是著名的iris。
[code]
attach(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
#根據(jù)Species分類，求前4個變量的均值
by(iris[,1:4],Species,mean)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
---------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
--------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
[/code]
其實，就是根據(jù)一個定性變量將數(shù)據(jù)分為若干個子集，然后在對自己進(jìn)行apply操作。
3.eapply
幫助文檔的描述：
[code]
“eapply applies FUN to the named values from an environment and returns the results as a list. The user can request that all named objects are used (normally names that begin with a dot are not). The output is not sorted and no enclosing environments are searched. ”
[/code]
有理解這句話，重點是理解environment這個東西。environment相當(dāng)于是R里面的一個小系統(tǒng)，這個系統(tǒng)包含有自己的變量和函數(shù)等內(nèi)容。用一個簡單的例子來示范：
[code]
#創(chuàng)建一個新的environment
e=new.env()
#在e中創(chuàng)建兩個變量
e$a=1:10
e$b=11:20
#求e中變量的均值
eapply(e,mean)
$a
[1] 5.5
$b
[1] 15.5
[/code]
一般人兒可能不常用environment這個東西。不過,Bioconductor們是例外哦！
4.lapply
幫助文檔的描述：
[code]
“l(fā)apply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X. ”
[/code]
這是apply函數(shù)的幫助文檔中最簡明扼要的一個。直接給示范：
[code]
#創(chuàng)建一個list
l=list(a=1:10,b=11:20)
#計算list中每個元素的均值
lapply(l,mean)
$a
[1] 5.5
$b
[1] 15.5
#計算list中每個元素的和
lapply(l,sum)
$a
[1] 55
$b
[1] 155
[/code]
lapply的文檔中讓我們進(jìn)一步參考sapply、vapply和replicate。那我們走去看看咯！
4.1 sapply
幫助文檔的描述：
[code]
“sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify="array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify=FALSE, USE.NAMES=FALSE) is the same as lapply(x,f). ”
[/code]
上面一堆簡單說就是，lapply返回的是一個含有兩個元素$a和$b的list，而sapply返回的是一個含有元素[["a"]]和[["b"]]的vector，或者列名為a和b的矩陣。
示范如下：
[code]
#創(chuàng)建一個list
l=list(a=1:10,b=11:20)
#用sapply求均值
l.mean=sapply(l,mean)
#觀察返回結(jié)果的類型
class(l.mean)
[1] "numeric"
#提取元素a的均值
1.mean[['a']]
[1] 5.5
[/code]
4.2 vapply
幫助文檔的描述：
[code]
“vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”
[/code]
直接示范：
[code]
l=list(a=1:10,b=11:20)
#用vapply函數(shù)計算五分位數(shù)
l.fivenum=vapply(l,fivenum,c(Min.=0,"lst Qu."=0,Median=0,"3rd Qu."=0,Max.=0))
class(l.fivenum)
[1] "matrix"
#結(jié)果
l.fivenum
a b
Min. 1.0 11.0
lst Qu. 3.0 13.0
Median 5.5 15.5
3rd Qu. 8.0 18.0
Max. 10.0 20.0
[/code]
所以，親，你看到了，vapply返回的是一個矩陣。矩陣的列名是list的元素，行名取決于函數(shù)的輸出結(jié)果。
4.3 replicate
幫助文檔的描述:
[code]
“replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation). ”
[/code]
replicate是一個非常強(qiáng)大的函數(shù)，它有兩個強(qiáng)制參數(shù)：replications,即操作的重復(fù)次數(shù)；function，及要重復(fù)的操作。還有一個可選擇參數(shù)：simplify=T，是否將操作結(jié)果轉(zhuǎn)化為vector或者matrix。
示范：
[code]
replicate(10,rnorm(10))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0615684 0.5398778 0.9815460 -1.352409971 -0.46670108 -0.5609335
[2,] 0.7501444 1.3515495 -1.1324161 0.482136905 0.01806138 -0.2143325
[3,] -1.6764568 -0.5816864 0.4668710 0.016770345 -1.19560774 0.6414898
[4,] -0.4259504 1.6960433 -0.1759500 0.293043551 -0.13894691 1.8681723
[5,] -2.4212326 1.1064597 1.6042605 -1.157019574 2.60824933 -0.6255382
[6,] -0.6131776 -1.7253104 -1.1349404 0.009324671 -2.11739811 -0.8523519
[7,] 0.6331760 -0.5458755 -0.1237157 -0.874786715 0.16970787 -0.3328544
[8,] 0.3754509 0.1577973 1.5376246 0.109439826 -0.30158661 -0.6086636
[9,] 1.1086812 -2.1814234 -0.4258651 -0.152788898 -0.25801517 -0.9072564
[10,] 1.9340591 0.5341643 0.4909151 0.877046384 1.13504362 0.3492340
[,7] [,8] [,9] [,10]
[1,] 0.55758137 -0.2411162 -2.66867275 -1.009182336
[2,] -0.10909235 1.2934438 1.13655059 -0.462670113
[3,] -1.13680550 -0.5422744 0.19473334 -2.053553409
[4,] 0.17695953 -0.9123063 -0.03708775 0.019742325
[5,] 0.08053346 -1.3154510 -1.05838904 0.211655454
[6,] 1.08128078 -1.0607662 -0.25984969 -0.150065431
[7,] 1.45707769 0.3940861 0.59462210 -0.270396491
[8,] -0.24380501 -1.0949531 0.45358256 0.005766857
[9,] -2.00170358 -1.8108618 -0.86100307 2.014660900
[10,] -0.94547942 1.6362386 -0.19392441 -0.729144393
[/code]
5.mapply
幫助文檔的描述：
[code]
“mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. ”
[/code]
如果你認(rèn)真看看mapply的幫助文檔的話，我打賭你會看到頭大。這里給看官兩個簡單的例子：
[code]
l1=list(a=1:10,b=11:20)
l2=list(c=21:30,d=31:40)
#計算l1,l2中各元素的和
mapply(sum,l1$a,l1$b,l2$c,l2$d)
[1] 64 68 72 76 80 84 88 92 96 100
l1
$a
[1] 1 2 3 4 5 6 7 8 9 10
$b
[1] 11 12 13 14 15 16 17 18 19 20
l2
$c
[1] 21 22 23 24 25 26 27 28 29 30
$d
[1] 31 32 33 34 35 36 37 38 39 40
[/code]
豎直看下來，加總就得到了上面的結(jié)果。
6.rapply
幫助文檔的描述：
[code]
“rapply is a recursive version of lapply. ”
[/code]
這個描述是史上最差描述之一。因為rapply跟recursive并沒有太大關(guān)系。rapply的創(chuàng)造性在于提供了一個結(jié)果輸出形式的參數(shù)。
示范：
[code]
#創(chuàng)建list
l=list(a=1:10,b=11:20)
#計算l中元素的log2
rapply(l,log2)
a1 a2 a3 a4 a5 a6 a7 a8
0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
a9 a10 b1 b2 b3 b4 b5 b6
3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
b7 b8 b9 b10
4.087463 4.169925 4.247928 4.321928
#將結(jié)果的輸出形式設(shè)定為list
rapply(l,log2,how="list")
$a
[1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
[9] 3.169925 3.321928
$b
[1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
[9] 4.247928 4.321928
#計算均值
rapply(l,mean)
a b
5.5 15.5
rapply(l,mean,how="list")
$a
[1] 5.5
$b
[1] 15.5
[/code]
綜上，rapply函數(shù)的輸出結(jié)果取決于函數(shù)和how參數(shù)。當(dāng)how="list"時，數(shù)據(jù)的原始結(jié)構(gòu)被保留，否則，輸出結(jié)果被轉(zhuǎn)化為vector。
當(dāng)然，看官還可以將classes函數(shù)傳遞給rapply函數(shù)。例如在混合型list中，可以通過classes=numeric，是的函數(shù)子對數(shù)字型元素進(jìn)行操作。
7.tapply
幫助文檔的描述：
[code]
“Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”
[/code]
哇哦，親，被嚇到了吧。親莫怕。詳細(xì)說明中：
“tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)”中的“X”是
“an atomic object, typically a vector.”，而“INDEX”是“l(fā)ist of factors, each of same length as X. The elements are coerced to factors by as.factor.”
依然以iris數(shù)據(jù)集為例。
[code]
attach(iris)
#根據(jù)sprcies進(jìn)行分類，計算petal的均值
tapply(iris$Petal.Length,Species,mean)
setosa versicolor virginica
1.462 4.260 5.552
[/code]
簡短的總結(jié)：
這里給出的都是極其簡單的例子，基于最簡單的數(shù)據(jù)，和最簡單的函數(shù)。因為對于每一個操作而言，看官都可查看數(shù)據(jù)的操作前狀態(tài)和操作后狀態(tài)，這樣便于看官知道，操作到底對數(shù)據(jù)干了什么事情。
當(dāng)然了，apply函數(shù)的功能不限于文中介紹的這些，進(jìn)一步的功用期待看官自己去挖掘。
給出幾個使用apply函數(shù)的建議，在使用之前應(yīng)當(dāng)思考：
原始數(shù)據(jù)是什么類型？vector?matrix?data frame?....
想對原始數(shù)據(jù)的哪些子集進(jìn)行操作？行?列？所有元素？....
操作將返回什么結(jié)果？原始數(shù)據(jù)的結(jié)構(gòu)是如何變化的？
只是一個老生常談的關(guān)于“輸入——操作——輸出”的故事：你有什么？你想要什么？兩者之間需要什么？