R語言 tidyr包的三個(gè)重要函數(shù)：gather，spread，separate的用法和舉例

醫(yī)學(xué)數(shù)據(jù)科學(xué) 2020-05-05

展開全文

tidyr是Hadley（Tidy Data的作者Hadley Wickham）寫的非常有用、并且經(jīng)常會(huì)使用到的包，常與dplyr包結(jié)合使用（這個(gè)包也是他寫的）

準(zhǔn)備工作：

首先安裝tidyr包（一定要加引號(hào)，不然報(bào)錯(cuò)）

install.packages("tidyr")

載入tidyr（可以不加引號(hào)）

library(tidyr)

gather()

gather函數(shù)類似于Excel（2016起）中的數(shù)據(jù)透視的功能，能把一個(gè)變量名含有變量的二維表轉(zhuǎn)換成一個(gè)規(guī)范的二維表（類似數(shù)據(jù)庫(kù)中關(guān)系的那種表，具體看例子）

我們先 >?gather，看看官方文檔說明：

gather {tidyr} R Documentation

Gather columns into key-value pairs.

Description

Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather() when you notice that you have columns that are not variables.

Usage

gather(data, key = "key", value = "value", ..., na.rm = FALSE,

convert = FALSE, factor_key = FALSE)

Arguments

data

A data frame.

key, value

Names of new key and value columns, as strings or symbols.

This argument is passed by expression and supports quasiquotation (you can unquote strings and symbols). The name is captured from the expression with rlang::ensym() (note that this kind of interface where symbols do not represent actual objects is now discouraged in the tidyverse; we support it here for backward compatibility).

... （這是一個(gè)參數(shù)）

A selection of columns. If empty, all variables are selected. You can supply bare variable names, select all variables between x and z with x:z, exclude y with -y. For more options, see the dplyr::select() documentation. See also the section on selection rules below.

na.rm

If TRUE, will remove rows from output where the value column in NA.

convert

If TRUE will automatically run type.convert() on the key column. This is useful if the column types are actually numeric, integer, or logical.

factor_key

If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.

說明：

第一個(gè)參數(shù)放的是原數(shù)據(jù)，數(shù)據(jù)類型要是一個(gè)數(shù)據(jù)框；

下面?zhèn)饕粋€(gè)鍵值對(duì)，名字是自己起的，這兩個(gè)值是做新轉(zhuǎn)換成的二維表的表頭，即兩個(gè)變量名；

第四個(gè)是選中要轉(zhuǎn)置的列，這個(gè)參數(shù)不寫的話就默認(rèn)全部轉(zhuǎn)置；

后面還可以加可選參數(shù)na.rm，如果na.rm = TRUE，那么將會(huì)在新表中去除原表中的缺失值(NA)。

gather()舉例

先構(gòu)造一個(gè)數(shù)據(jù)框stu：

stu<-data.frame(grade=c("A","B","C","D","E"), female=c(5, 4, 1, 2, 3), male=c(1, 2, 3, 4, 5))

這個(gè)數(shù)據(jù)框什么意思就不說了，就是你想的那樣，成績(jī)-性別的人數(shù)分布。

變量中的female和male就是上面所說的變量名中含有了變量，female和male應(yīng)該是“性別”這個(gè)變量的的變量值，下面的人數(shù)的變量名（或者說屬性名）應(yīng)該是“人數(shù)”，下面我們需要把原grade一列保留，去掉female和male兩列，增加sex和count兩列，值分別與原表對(duì)應(yīng)起來，使用這個(gè)gather函數(shù)：

gather(stu, gender, count,-grade)

結(jié)果如下，行列就轉(zhuǎn)換過來了，第一個(gè)參數(shù)是原數(shù)據(jù)stu，二、三兩個(gè)參數(shù)是鍵值對(duì)（性別，人數(shù)），第四個(gè)表示減去（除去grade列，就只轉(zhuǎn)置剩下兩列）

在原表中單看這兩列是這樣對(duì)應(yīng)的：

(female, 5), (female, 4), (female, 1), (female, 2), (female, 3)

(male, 1), (male, 2), (male, 3), (male, 4), (male, 5),

就是把原變量名（屬性名）做鍵（key），變量值做值（value）。

接下來就可以繼續(xù)正常的統(tǒng)計(jì)分析了。

separate()

separate負(fù)責(zé)分割數(shù)據(jù)，把一個(gè)變量中就包含兩個(gè)變量的數(shù)據(jù)分來（上例gather中是屬性名也是一個(gè)變量，一個(gè)屬性名一個(gè)變量），直接上例子：

separate()舉例

構(gòu)造一個(gè)新數(shù)據(jù)框stu2：

stu2<-data.frame(grade=c("A","B","C","D","E"),

female_1=c(5, 4, 1, 2, 3), male_1=c(1, 2, 3, 4, 5),

female_2=c(4, 5, 1, 2, 3), male_2=c(0, 2, 3, 4, 6))

跟上面stu很像，性別后面的1、2表示班級(jí)

我們先用剛才的gather函數(shù)轉(zhuǎn)置一下：

stu2_new<-gather(stu2,gender_class,count,-grade)

不解釋了，跟上面一樣，結(jié)果如下：

但這個(gè)表仍然不是個(gè)規(guī)范二維表，我們發(fā)現(xiàn)有一列（gender_class）的值包含多個(gè)屬性（變量），使用separate()分開，separate用法如下：

separate(data, col, into, sep (= 正則表達(dá)式), remove =TRUE,convert = FALSE, extra = "warn", fill = "warn", ...)

第一個(gè)參數(shù)放要分離的數(shù)據(jù)框；

第二個(gè)參數(shù)放要分離的列；

第三個(gè)參數(shù)是分割成的變量的列（肯定是多個(gè)），用向量表示；

第四個(gè)參數(shù)是分隔符，用正則表達(dá)式表示，或者寫數(shù)字，表示從第幾位分開（文檔里是這樣寫的：

If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string. The length of sep should be one less than into.）

后面參數(shù)就不一一說明了，可以自己看文檔

現(xiàn)在我們要做的就是把gender_class這一列分開：

separate(stu2_new,gender_class,c("gender","class"))

注意第三個(gè)參數(shù)是向量，用c()表示，第四個(gè)參數(shù)本來應(yīng)該是"_"，這里省略不寫了（可能是下劃線是默認(rèn)分隔符？）

結(jié)果如下：

spread()

spread用來擴(kuò)展表，把某一列的值（鍵值對(duì)）分開拆成多列。

spread(data, key, value, fill = NA, convert = FALSE, drop =TRUE, sep = NULL)

key是原來要拆的那一列的名字（變量名），value是拆出來的那些列的值應(yīng)該填什么（填原表的哪一列）

下面直接上例子

spread()舉例

構(gòu)造數(shù)據(jù)框stu3：

name<-rep(c("Sally","Jeff","Roger","Karen","Brain"),c(2,2,2,2,2))

test<-rep(c("midterm","final"),5)

class1<-c("A","C",NA,NA,NA,NA,NA,NA,"B","B")

class2<-c(NA,NA,"D","E","C","A",NA,NA,NA,NA)

class3<-c("B","C",NA,NA,NA,NA,"C","C",NA,NA)

class4<-c(NA,NA,"A","C",NA,NA,"A","A",NA,NA)

class5<-c(NA,NA,NA,NA,"B","A",NA,NA,"A","C")

stu3<-data.frame(name,test,class1,class2,class3,class4,class5)

總共5門課，每個(gè)學(xué)生選兩門，列出期中、期末成績(jī)。

顯然，原表是不整潔的數(shù)據(jù)，表頭中含有變量（class1-5），所以先用gather函數(shù)。注意，這里面有很多缺失值，就可以用到上面所講的na.rm=TRUE參數(shù)，自動(dòng)去除有缺失值的記錄（一條記錄就是一行）：

如果不寫 na.rm=TRUE 的話，結(jié)果是這樣的：

（未截全）

分析學(xué)生沒選課的“NA”成績(jī)是沒有意義的，所以這個(gè)情況下應(yīng)該舍棄有缺失值的記錄。

現(xiàn)在這個(gè)表看起來已經(jīng)很整齊了，但是每個(gè)人都有四條記錄，其中每門課除了test和grade的值不一樣，姓名、課程是一樣的，并且很多時(shí)候，我們需要分別對(duì)期中、期末成績(jī)進(jìn)行統(tǒng)計(jì)分析，那么現(xiàn)在這個(gè)表就不利于做分類統(tǒng)計(jì)了。

用spread函數(shù)將test列分來成midterm和final兩列，這兩列的值是選的兩門課的成績(jī)。

再重復(fù)一遍，第二個(gè)參數(shù)是要拆分的那一列的列名，第三個(gè)參數(shù)是擴(kuò)展出的列的值應(yīng)該來自原表的哪一列的列名。

stu3_new<-gather(stu3, class, grade, class1:class5, na.rm = TRUE)

spread(stu3_new,test,grade)

結(jié)果如下：

現(xiàn)在得到非常整齊的僅有10條數(shù)據(jù)的表，處理起來會(huì)更加方便。

最后補(bǔ)充一條，現(xiàn)在class列顯得有些冗余，直接用數(shù)字似乎更簡(jiǎn)潔，使用readr包中的parse_number()提出數(shù)字（還用到了dplyr的mutate函數(shù)），下面放出代碼：

install.packages("dplyr")

install.packages("readr")

library(readr)

library(dplyr)

mutate(spread(stu3_new,test,grade),class=parse_number(class))

最終結(jié)果：

是不是整整齊齊很好看！??！(*?▽?*)

————————————————

版權(quán)聲明：本文為CSDN博主「six66667」的原創(chuàng)文章，遵循CC 4.0 BY-SA版權(quán)協(xié)議，轉(zhuǎn)載請(qǐng)附上原文出處鏈接及本聲明。

原文鏈接：https://blog.csdn.net/six66667/article/details/84888644

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：醫(yī)學(xué)數(shù)據(jù)科學(xué) > 《數(shù)據(jù)科學(xué)》

舉報(bào)/認(rèn)領(lǐng)