R中的普通文本處理

gearss 2018-04-25

展開全文

介紹：
1. 文本文件的讀寫 2. 字符統(tǒng)計(jì)和字符翻譯
3. 字符串連接
4. 字符串拆分
5. 字符串查詢
6. 字符串替換
7. 字符串提取

說明：

普通文本文件不同于我們平時(shí)接觸到的表格式文本文件，這里的文本文件是純文本文件，里面包含的大部分都是字符串；而表格式文本文件大多是行列比較整齊的數(shù)據(jù)文件，讀取這一類的文件，需要用到read.table()或read.csv()之類的函數(shù)。
關(guān)于正則表達(dá)式的介紹暫時(shí)不涉及
stringr、parser、R.utils等包暫時(shí)也不涉及，但不可否認(rèn)它們提供的函數(shù)可用性更高些！

1.文本文件的讀寫

R里面讀取文本文件的函數(shù)主要有readLines()和scan()，兩者都可以指定讀入內(nèi)容的編碼方式（通過encoding參數(shù)設(shè)置），整個(gè)文本讀入R中之后會(huì)被存儲(chǔ)在一個(gè)字符型的向量里。

text <- readLines("file.txt", encoding = "UTF-8")  #假設(shè)有這么一個(gè)文件可供使用
scan("file.txt", what = character(0))  #默認(rèn)設(shè)置，每個(gè)單詞作為字符向量的一個(gè)元素
scan("file.txt", what = character(0), sep = "\n")  #設(shè)置成每一行文本作為向量的一個(gè)元素，這類似于readLines
scan("file.txt", what = character(0), sep = ".")  #設(shè)置成每一句文本作為向量的一個(gè)元素

同樣R對(duì)象里面的內(nèi)容也可以寫入文本文件中，主要有cat()和writeLines()。默認(rèn)情況下，cat()會(huì)將向量里的元素連在一起寫入到文件中去，但可以sep參數(shù)設(shè)置分割符。

cat(text, file = "file.txt", sep = "\n")
writeLines(text, con = "file.txt", sep = "\n", useBytes = F)

2.字符統(tǒng)計(jì)及字符翻譯

nchar()用來統(tǒng)計(jì)每個(gè)元素的字符個(gè)數(shù)，注意與length()的區(qū)別，length()用來統(tǒng)計(jì)每個(gè)向量中的元素個(gè)數(shù)。

x <- c("we are the world", "we are the children")
x

## [1] "we are the world"    "we are the children"

nchar(x)

## [1] 16 19

length(x)

## [1] 2


nchar("")

## [1] 0

length("")  #雖然字符為空，但是它仍然是一個(gè)元素。

## [1] 1

字符翻譯常用的函數(shù)有tolower(),toupper()和chartr()

dna <- "AgCTaaGGGcctTagct"
dna

## [1] "AgCTaaGGGcctTagct"

tolower(dna)

## [1] "agctaagggccttagct"

toupper(dna)

## [1] "AGCTAAGGGCCTTAGCT"

chartr("Tt", "Uu", dna)  #將T堿基替換成U堿基

## [1] "AgCUaaGGGccuUagcu"

3.字符串連接

paste()是R中用來連接字符串的函數(shù)，但是它的功能又遠(yuǎn)遠(yuǎn)不止于此。

paste("control", 1:3, sep = "_")

## [1] "control_1" "control_2" "control_3"


x <- list(a = "aa", b = "bb")
y <- list(c = 1, d = 2)
paste(x, y, sep = "-")

## [1] "aa-1" "bb-2"


paste(x, y, sep = "-", collapse = ";")

## [1] "aa-1;bb-2"

paste(x, collapse = ":")

## [1] "aa:bb"

## $a
## [1] "aa"
## 
## $b
## [1] "bb"

as.character(x)  #將其它類型的對(duì)象轉(zhuǎn)換成字符

## [1] "aa" "bb"

unlist(x)

##    a    b 
## "aa" "bb"

4.字符串拆分

strsplit()是一個(gè)拆分函數(shù)，該函數(shù)可以使用正則表達(dá)式進(jìn)行匹配拆分。其命令形式為：
strsplit(x, split, fixed= F, perl= F, useBytes= F)

參數(shù)x為字符串格式向量，函數(shù)依次對(duì)向量的每個(gè)元素進(jìn)行拆分
參數(shù)split為拆分位置的字串向量，即在哪個(gè)字串處開始拆分；該參數(shù)默認(rèn)是正則表達(dá)式匹配；若設(shè)置fixed= T則表示是用普通文本匹配或者正則表達(dá)式的精確匹配。用普通文本來匹配的運(yùn)算速度要快些。
參數(shù)perl的設(shè)置和perl的版本有關(guān)，表示可以使用perl語(yǔ)言里面的正則表達(dá)式。如果正則表達(dá)式過長(zhǎng)，則可以考慮使用perl的正則來提高運(yùn)算速度。
參數(shù)useBytes表示是否逐字節(jié)進(jìn)行匹配，默認(rèn)為FALSE，表示是按字符匹配而不是按字節(jié)進(jìn)行匹配。

text <- "We are the world.\nWe are the children!"
text

## [1] "We are the world.\nWe are the children!"

cat(text)  #注意\n被解釋稱換行符，R里字符串自身也是正則！

## We are the world.
## We are the children!


strsplit(text, " ")

## [[1]]
## [1] "We"         "are"        "the"        "world.\nWe" "are"       
## [6] "the"        "children!"

strsplit(text, "\\s")  #以任意空白符作為分割的位置，注意雙反斜線

## [[1]]
## [1] "We"        "are"       "the"       "world."    "We"        "are"      
## [7] "the"       "children!"

class(strsplit(text, "\\s"))

## [1] "list"

strsplit()的返回結(jié)果是list類型，如果想將其轉(zhuǎn)換成字符串類型，則可以使用上面提到的unlist()和as.character()。

有一種特殊情況，如果strsplit()的split參數(shù)為空字符串的話，得函數(shù)的返回結(jié)果是一個(gè)個(gè)字符。

strsplit(text, "")

## [[1]]
##  [1] "W"  "e"  " "  "a"  "r"  "e"  " "  "t"  "h"  "e"  " "  "w"  "o"  "r" 
## [15] "l"  "d"  "."  "\n" "W"  "e"  " "  "a"  "r"  "e"  " "  "t"  "h"  "e" 
## [29] " "  "c"  "h"  "i"  "l"  "d"  "r"  "e"  "n"  "!"

5.字符串查詢

字符串的查詢或者搜索著要是應(yīng)用了正則表達(dá)式的匹配來完成任務(wù)的，R里正方面的函數(shù)有g(shù)rep()，grepl()，regexpr()，gregexpr()和regexec()等。

其中g(shù)rep()和grepl()兩個(gè)函數(shù)的命令形式如下：
grep(pattern, x, ignore.case= F, perl= F, value= F, fixed= F, useBytes= F, invert= F)
grep(pattern, x, ignore.case= F, perl= F, fixed= F, useBytes= F) 由命令形式可以看出，前者返回了向量x中哪個(gè)元素匹配了模式pattern（即返回了向量x的某些下標(biāo)）或者具體哪個(gè)元素匹配了模式（通過設(shè)置value參數(shù)來完成），而后者則返回了一系列邏輯值，其長(zhǎng)度等同于向量x的長(zhǎng)度，表示向量x中的元素是否匹配了模式。它們都沒有提供具體的位置信息，即向量x中的元素在哪個(gè)位置匹配了模式。

text <- c("We are the world", "we are the children")
grep("We", text)  #向量text中的哪些元素匹配了單詞'We'

## [1] 1

grep("We", text, invert = T)  #向量text中的哪些元素沒有匹配單詞'We'

## [1] 2

grep("we", text, ignore.case = T)  #匹配時(shí)忽略大小寫

## [1] 1 2

grepl("are", text)  #向量text中的每個(gè)元素是否匹配了單詞'We'，即只返回TRUE或FALSE

## [1] TRUE TRUE

regexpr(),gregexpr()和regexec()函數(shù)同樣也可用來進(jìn)行字符串搜索，與grep()和grepl()不同的是它們返回的結(jié)果中包含了匹配的具體位置和字符串長(zhǎng)度信息（因此可用于字符串的提取操作中去）。它們的命令形式如下：
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)

text <- c("We are the world", "we are the children")
regexpr("e", text)

## [1] 2 2
## attr(,"match.length")
## [1] 1 1
## attr(,"useBytes")
## [1] TRUE

class(regexpr("e", text))

## [1] "integer"

gregexpr("e", text)

## [[1]]
## [1]  2  6 10
## attr(,"match.length")
## [1] 1 1 1
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1]  2  6 10 18
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"useBytes")
## [1] TRUE

class(gregexpr("e", text))

## [1] "list"

regexec("e", text)

## [[1]]
## [1] 2
## attr(,"match.length")
## [1] 1
## 
## [[2]]
## [1] 2
## attr(,"match.length")
## [1] 1

class(regexec("e", text))

## [1] "list"

從regexpr()的返回結(jié)果看，返回結(jié)果是個(gè)整數(shù)型向量，但是它還具有兩個(gè)額外的屬性(attributes)，分別是匹配字段的長(zhǎng)度和是否按字節(jié)進(jìn)行匹配；regexpr()的返回結(jié)果為-1和1，其中-1表示沒有匹配上，1表示text中第2個(gè)元素中的第一個(gè)字符被匹配上，且匹配字符的長(zhǎng)度為2（屬性值中提供）；gregexpr()的返回結(jié)果中包含了全部的匹配結(jié)果的位置信息，而regexpr()只返回了向量text里每個(gè)元素中第一個(gè)匹配的位置信息，gregexpr()的返回結(jié)果類型是list類型對(duì)象；regexec()的返回結(jié)果基本與regexpr()類似，只返回了第一個(gè)匹配的位置信息，但其結(jié)果是一個(gè)list類型的對(duì)象，并且列表里面的元素少了一個(gè)屬性值，即attr(,“useBytes”)。

除了上面的字符串的查詢，有時(shí)還會(huì)用到完全匹配，這是會(huì)用到match()，其命令形式如下： match(x, table, nomatch= NAinteger, incomparables)
只有參數(shù)x的內(nèi)容被完全匹配，函數(shù)才會(huì)返回參數(shù)x所在table參數(shù)中的下標(biāo)，否則的話會(huì)返回nomatch參數(shù)中定義的值（默認(rèn)是NA）。

text <- c("We are the world", "we are the children", "we")
match("we", text)

## [1] 3

match(2, c(3, 4, 2, 8))

## [1] 3

match("xx", c("abc", "xxx", "xx", "xx"))  #只會(huì)返回第一個(gè)完全匹配的元素的下標(biāo)

## [1] 3

match(2, c(3, 4, 2, 8, 2))

## [1] 3

match("xx", c("abc", "xxx"))  # 沒有完全匹配的，因此返回NA

## [1] NA

此外還有一個(gè)charmatch()，其命令形式類似于match，但從下面的例子來看其行為有些古怪。同樣該函數(shù)也會(huì)返回其匹配字符串所在table中的下標(biāo)，該函數(shù)在進(jìn)行匹配時(shí)，會(huì)從table里字符串的最左面（即第一個(gè)字符）開始匹配，如果起始位置沒有匹配則返回NA；如果同時(shí)部分匹配和完全匹配，則會(huì)優(yōu)先選擇完全匹配；如果同時(shí)有多個(gè)完全匹配或者多個(gè)部分匹配時(shí)，則會(huì)返回0；如果以上三個(gè)都沒有，則返回NA。另外還有一個(gè)pmatch()，其功能同charmatch()一樣，僅僅寫法不同。

charmatch("xx", c("abc", "xxa"))

## [1] 2

charmatch("xx", c("abc", "axx"))  # 從最左面開始匹配

## [1] NA

charmatch("xx", c("xxa", "xxb"))  # 不唯一

## [1] 0

charmatch("xx", c("xxa", "xxb", "xx"))  # 優(yōu)先選擇完全匹配，盡管有兩個(gè)部分匹配

## [1] 3

charmatch(2, c(3, 4, 2, 8))

## [1] 3

charmatch(2, c(3, 4, 2, 8, 2))

## [1] 0

不知道這樣一個(gè)奇怪的函數(shù)在那里能夠用到，真是有點(diǎn)期待！

6.字符串的替換

雖然sub()和gsub()能夠提供替換的功能，但其替換的實(shí)質(zhì)是先創(chuàng)建一個(gè)對(duì)象，然后對(duì)原始對(duì)象進(jìn)行重新賦值，最后結(jié)果好像是“替換”了一樣。（R語(yǔ)言對(duì)參數(shù)都是傳值不傳址）

sub()和gsub()的命令形式具體如下：
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

text <- c("we are the world", "we are the children")
sub("w", "W", text)

## [1] "We are the world"    "We are the children"

gsub("w", "W", text)

## [1] "We are the World"    "We are the children"


sub(" ", "", "abc def ghi")

## [1] "abcdef ghi"

gsub(" ", "", "abc def ghi")

## [1] "abcdefghi"

從上面的輸出結(jié)果可以看出，sub()和gsub()的區(qū)別在于，前者只替換第一次匹配的字串（請(qǐng)注意輸出結(jié)果中world的首字母），而后者會(huì)替換掉所有匹配的字串。
注意：gsub()是對(duì)向量里面的每個(gè)元素進(jìn)行搜素，如果發(fā)現(xiàn)元素里面有多個(gè)位置匹配了模式，則全部進(jìn)行替換，而grep()也是對(duì)向量里每個(gè)元素進(jìn)行搜索，但它僅僅知道元素是否匹配了模式（并返回該元素在向量中的下標(biāo)），但具體元素中匹配了多少次卻無法知道。在這里僅僅是為了說明這兩者的區(qū)別，這在實(shí)際中可能不會(huì)用到。

text <- c("we are the world", "we are the children")
grep("w", text)  #grep的返回結(jié)果為1和2，表示text向量的兩個(gè)元素中都含有字符w，但text向量的第一個(gè)元素里共含有兩個(gè)字符。

## [1] 1 2

gsub("w", "W", text)

## [1] "We are the World"    "We are the children"

7.字符串的提取

字符串的提取，有些地方類似于字符串的拆分。常用到的提取函數(shù)有substr()和substring()，它們都是靠位置來進(jìn)行提取的，它們自身并不適用正則表達(dá)式，但是它們可以結(jié)合正則表達(dá)式函數(shù)regexpr(),gregexpr()和regexec()等可以方便地從文本中提取所需信息。兩者的命令形式如下：
substr(x, start, stop)
substring(text, first, last)
x和text為要從中提取的字符串向量，start和first為提取的起始位置向量，stop和last為提取的終止位置向量，但是這兩個(gè)函數(shù)的返回值的長(zhǎng)度稍有區(qū)別：

substr()返回的字符串個(gè)數(shù)等于第一個(gè)向量的長(zhǎng)度
substring()返回的字符串個(gè)數(shù)等于其三個(gè)參數(shù)中長(zhǎng)度最長(zhǎng)的那個(gè)參數(shù)的長(zhǎng)度

x <- "123456789"
substr(x, c(2, 4), c(4, 5, 8))

## [1] "234"

substring(x, c(2, 4), c(4, 5, 8))

## [1] "234"     "45"      "2345678"


y <- c("12345678", "abcdefgh")
substr(y, c(2, 4), c(4, 5, 8))

## [1] "234" "de"

substring(y, c(2, 4), c(4, 5, 8))

## [1] "234"     "de"      "2345678"

從上面的輸出結(jié)果來，向量x的長(zhǎng)度為1，substr()不管后面的兩個(gè)參數(shù)的長(zhǎng)度如何，它只會(huì)用到這兩個(gè)參數(shù)的第一個(gè)數(shù)值，即分別為2和4，表示提取的起始和終止位置分別為2和4，返回的結(jié)果則是字符串“234”。而用substring()來提取時(shí)，則會(huì)依據(jù)參數(shù)最長(zhǎng)的last參數(shù)，此外還需要注意的是first和last兩個(gè)參數(shù)的長(zhǎng)度不等，這時(shí)會(huì)用到R里面的“短向量循環(huán)”原則，參數(shù)first會(huì)自動(dòng)延長(zhǎng)為c(2, 4, 2)，函數(shù)會(huì)依次提取從2到4，從4到5，從2到8這三個(gè)字符串。

用substing()可以很方便地把DNA或RNA序列進(jìn)行翻譯（三個(gè)堿基轉(zhuǎn)換成一個(gè)密碼子）。

dna <- paste(sample(c("A", "G", "C", "T"), 12, replace = T), collapse = "")
dna

## [1] "ATAACGCGTGGG"

substring(dna, seq(1, 10, by = 3), seq(3, 12, by = 3))

## [1] "ATA" "ACG" "CGT" "GGG"

8.字符串的定制輸出

這個(gè)內(nèi)容有點(diǎn)類似于字符串的連接。這里用到了strtrim()，用于將字符串修剪到特定的顯示寬度，其命令形式如下：
strtrim(x, width)
該函數(shù)返回的字符串向量的長(zhǎng)度等于參數(shù)x的長(zhǎng)度。

strtrim(c("abcde", "abcde", "abcde"), c(1, 5, 10))

## [1] "a"     "abcde" "abcde"

strtrim(c(1, 123, 12345), 4)  #短向量循環(huán)

## [1] "1"    "123"  "1234"

strtrim()會(huì)根據(jù)width參數(shù)提供的數(shù)字來修剪字符串，若width提供的數(shù)字大于字符串的字符數(shù)的話，則該字符串會(huì)保持原樣，不會(huì)增加空格之類的東西。

strwrap()會(huì)把字符串當(dāng)成一個(gè)段落來處理（不管段落中是否有換行），按照段落的格式進(jìn)行縮進(jìn)和分行，返回結(jié)果就是一行行的字符串，其命令形式如下：
strwrap(x, width, indent= 0, exdent= 0, prefix= “”, simplify= T, initial= prefix)
函數(shù)返回結(jié)果中的每一行的字符串中的字符數(shù)目等于參數(shù)width。

string <- "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries."
string

## [1] "Each character string in the input is first split into\n paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries."

cat(string)

## Each character string in the input is first split into
##  paragraphs (or lines containing whitespace only). The paragraphs are then formatted by breaking lines at word boundaries.

strwrap(string)  #直接將換行符忽略了

## [1] "Each character string in the input is first split into paragraphs"
## [2] "(or lines containing whitespace only). The paragraphs are then"   
## [3] "formatted by breaking lines at word boundaries."

strwrap(string, width = 40, indent = 4)  #首行縮進(jìn)

## [1] "    Each character string in the input"
## [2] "is first split into paragraphs (or"    
## [3] "lines containing whitespace only). The"
## [4] "paragraphs are then formatted by"      
## [5] "breaking lines at word boundaries."

strwrap(string, width = 40, exdent = 4)  #除了首行的其余行縮進(jìn)

## [1] "Each character string in the input is" 
## [2] "    first split into paragraphs (or"   
## [3] "    lines containing whitespace only)."
## [4] "    The paragraphs are then formatted" 
## [5] "    by breaking lines at word"         
## [6] "    boundaries."

strwrap(string, width = 40, simplify = F)  # 返回結(jié)果是個(gè)列表，而不再是個(gè)字符串向量

## [[1]]
## [1] "Each character string in the input is"
## [2] "first split into paragraphs (or lines"
## [3] "containing whitespace only). The"     
## [4] "paragraphs are then formatted by"     
## [5] "breaking lines at word boundaries."

strwrap(string, width = 40, prefix = "******")

## [1] "******Each character string in the"    
## [2] "******input is first split into"       
## [3] "******paragraphs (or lines containing" 
## [4] "******whitespace only). The paragraphs"
## [5] "******are then formatted by breaking"  
## [6] "******lines at word boundaries."

參考博客：

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： gearss > 《輸入輸出》

舉報(bào)/認(rèn)領(lǐng)