Rでアメリカ英語の大規模コーパス COCA中の英語表現の使用頻度と実例を調べてみる
Rを使ったテキスト処理
基礎は以下を参照
https://sugiura-ken.org/wiki/wiki.cgi/exp?page=R
COCAのデータについて
https://www.corpusdata.org/formats.asp
9億5千万語の有料版でなくても無料で890万語が使用可能。
https://www.corpusdata.org/coca/samples/coca-samples-text.zip
これをダウンロードして、解凍して COCA というフォルダーに入れておく。
8つのジャンルがある。
https://www.english-corpora.org/coca/help/coca2020_overview.pdf
ファイルの確認
list.files("COCA")
## [1] "text_acad.txt" "text_blog.txt" "text_fic.txt" "text_mag.txt"
## [5] "text_news.txt" "text_spok.txt" "text_tvm.txt" "text_web.txt"
ジャンルごとに、一つのテキストファイルになっている。
ファイル名と表現の該当件数を出力するスクリプト:howmeny4()"
# copyleft 2022 sugiura@nagoya-u.jp
howmany4 <- function(dore){
file.zenbu <- list.files()
kekka <- NULL # 結果を入れておく変数の初期化
for (i in file.zenbu){ # ファイルのリストから一つずつ取り出して i に入れる
kazu <- NULL # ファイルごとに数を初期化
ruiseki <- NULL
yomikomi <- NULL
tmp <- readLines(i, warn=F)
tmp <- gsub("[.?!]$", "", tmp) # エッセイの末尾の記号削除
tmp2 <- strsplit(tmp, "[.?!] ") # 一文ずつに分割
yomikomi <- unlist(tmp2) # リスト形式を外す
ruiseki <- c(ruiseki, yomikomi) # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
hit <- grep(dore, ruiseki) # ため込んだすべての行を対象に検索
kazu <- length(hit)
kazu <- paste(i, "\t", kazu) # ファイル名「i」とヒットした数を連結
kekka <- c(kekka, kazu) # kazuをc()を使ってkekkaに追加していく
}
return(kekka)
}
スクリプトの実行例: “a monkey on your back”という熟語を調べる
setwd("COCA")
coca.monkey <- howmany4("a monkey on your back")
coca.monkey
## [1] "text_acad.txt \t 0" "text_blog.txt \t 0" "text_fic.txt \t 0"
## [4] "text_mag.txt \t 0" "text_news.txt \t 0" "text_spok.txt \t 0"
## [7] "text_tvm.txt \t 0" "text_web.txt \t 0"
1例もない。
“a monkey on”を調べる
setwd("COCA")
coca.monkey2 <- howmany4("a monkey on ")
coca.monkey2
## [1] "text_acad.txt \t 0" "text_blog.txt \t 0" "text_fic.txt \t 0"
## [4] "text_mag.txt \t 0" "text_news.txt \t 0" "text_spok.txt \t 0"
## [7] "text_tvm.txt \t 0" "text_web.txt \t 1"
1例ある。
該当の文をファイル名とともに出力するスクリプト: whichOne()
# copyleft 2022 sugiura@nagoya-u.jp
whichOne <- function(dore){
file.zenbu <- list.files()
ruiseki <- ""
for (i in file.zenbu){
chunk <- readLines(i, warn=F)
tmp <- gsub("[.?!]$", "", chunk)
tmp2 <- strsplit(tmp, "[.?!] ") # 一文ずつに分割
tmp3 <- sapply(tmp2, function(y){return (paste(i, ":", y))}) # 要素ごとにファイル名結合
yomikomi <- unlist(tmp3)
ruiseki <- c(ruiseki, yomikomi) # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
}
result<- grep(dore, ruiseki, value=T)
return(result)
}
“a monkey on”を含む文を調べる
setwd("COCA")
kekka <- whichOne("a monkey on ")
kekka
## [1] "text_web.txt : Last yr/like a salmon quitting the cold ocean-leaping and bucking up his birth stream/I hitchhiked my way from LA with 16 caps in my pocket and a monkey on my back "
“I hitchhiked my way from LA with 16 caps in my pocket and a monkey on my back”というweb上の英語で使用されていた。
「写真を撮る」という表現("take a picture")のバリエーションを探してみる。
takeの活用の可能性を列挙:take, takes, took, taken, taking
pictureが複数の場合もある:picture, pictures
takeとpictureの間に単語が挟まる可能性がある:正規表現で \\w*
setwd("COCA")
coca.picture <- howmany4("(take|takes|took|taken|taking) \\w* (picture|pictures)")
coca.picture
## [1] "text_acad.txt \t 0" "text_blog.txt \t 1" "text_fic.txt \t 4"
## [4] "text_mag.txt \t 5" "text_news.txt \t 2" "text_spok.txt \t 6"
## [7] "text_tvm.txt \t 19" "text_web.txt \t 4"
ジャンルごとに頻度がかなり違う。
実際の文を見てみる
setwd("COCA")
kekka.picture <- whichOne("(take|takes|took|taken|taking) \\w* (picture|pictures)")
kekka.picture
## [1] "text_blog.txt : Please just remember that someone 's worked hard to create a recipe or take a picture , so give credit where credit is due "
## [2] "text_fic.txt : <p> \" Only if you take your pictures after me , \" I say , and they hedge and demur , and push me into the shiny plastic booth "
## [3] "text_fic.txt : \" She took his picture , his legs spread a little , smiling that young , rakish smile of his , and then he took hers "
## [4] "text_fic.txt : \" In those days , who could take a picture "
## [5] "text_fic.txt : I do n't mean you personally , but the man who took your picture for the photography exhibit "
## [6] "text_mag.txt : Next , he takes a picture of my left profile "
## [7] "text_mag.txt : There are a number of non -- water park areas where kids can climb ( on Cookie Mountain , a vinyl cone that participants slide down once they reach the top ) , bounce ( on Ernie 's Bed , a huge , springy air mattress that 's perfect for jumping and leaping ) , and enjoy an assortment of slides , mazes , tunnels , ropes , sand toys , and oversized @ @ @ @ @ @ @ @ @ @ each visit is to climb the dozens of steps to the top ( or face ) of the Big Bird statue , where they yell down for us to take their picture while they proudly stand as if they 've conquered Mt "
## [8] "text_mag.txt : Customers took the pictures and sent the whole camera back to Rochester for developing and reloading "
## [9] "text_mag.txt : Says ABC News Washington bureau chief George Watson : \" Literally interpreted , the proposed rules say you could n't take a picture of a wounded soldier "
## [10] "text_mag.txt : Regardless of how much money you spent on your laptop , it 's wise to keep receipts related to your purchase , take a picture of your laptop , and register it @ @ @ @ @ @ @ @ @ @ That way , losing your laptop wo n't have to be a huge financial burden on top of all the unavoidable hassles you 'll face "
## [11] "text_news.txt : Thomas asked MacIntosh to take a picture as a souvenir "
## [12] "text_news.txt : \" <p> Zeigfinger stopped eating animal products for ethical reasons ; she also wo n't wear leather shoes and asks that a photographer not take her picture sitting on a leather chair "
## [13] "text_spok.txt : @!Mr-M-HIGGS : Well , I took this picture , and as you can see , they 're like two school kids "
## [14] "text_spok.txt : ABRAHAM PALASIOS : I could take a picture with me bloody and everything , and I could take a picture from- ISRAEL TAPEA : But she was not bloody or nothing "
## [15] "text_spok.txt : They 've been people who 've been interested in changing lives whether it 's Henry Ford enabling @ @ @ @ @ @ @ @ @ @ a picnic or George Eastman producing the Brownie camera to take a picture of that picnic "
## [16] "text_spok.txt : So she , in this paper , one of the papers , she took a picture of herself in a red bathrobe"
## [17] "text_spok.txt : And I 've literally had to tell people , like , please do n't take , I remember a really good friend of mine took a picture like that on my birthday , and I was , like , can we not , like , make a face like this when I 'm , like , holding a martini and wearing a low-cut dress "
## [18] "text_spok.txt : @1AL-ROKER@2 : And dad 's taking a picture right there "
## [19] "text_tvm.txt : Can I take your picture "
## [20] "text_tvm.txt : ioney , take another picture "
## [21] "text_tvm.txt : - They took our picture "
## [22] "text_tvm.txt : They took our picturei i @ @ @ @ @ @ @ @ @ @ here "
## [23] "text_tvm.txt : I take the picture , make sure it gets sliiped to her , she uses it "
## [24] "text_tvm.txt : Peyton , Skinner and some friends stopped for e take a picture "
## [25] "text_tvm.txt : I do n't want them taking my picture "
## [26] "text_tvm.txt : Well , what did you take his picture for "
## [27] "text_tvm.txt : I hate it when people take my picture "
## [28] "text_tvm.txt : Please do n't take any pictures "
## [29] "text_tvm.txt : I promise you I will not take a picture "
## [30] "text_tvm.txt : - He 's taking my picture "
## [31] "text_tvm.txt : I asked you not to take any pictures "
## [32] "text_tvm.txt : You said that you would n't take any pictures "
## [33] "text_tvm.txt : I 'm going to take my picture , pay my 17.50 and be @ @ @ @ @ @ @ @ @ @ to cut costs , I was trying to cut class "
## [34] "text_tvm.txt : You took a picture of what happened "
## [35] "text_tvm.txt : - It was n't actual - Hey , Frank , take a picture - Come on , come on , come on , come on , come on - They should give you the show You are better than the other guy - Kevin "
## [36] "text_tvm.txt : Oh , we 're gon na take a picture here "
## [37] "text_tvm.txt : Felix took some pictures "
## [38] "text_web.txt : <p> \" There are a few girls waiting ; sign a few stuffs and take some pictures but be quick , \" Paul instructed and they were joined by a group of security "
## [39] "text_web.txt : You 'll all meet the President of the United States , and we 'll take the picture to prove it "
## [40] "text_web.txt : One of our employees may be watching with a camera , ready to take that picture "
## [41] "text_web.txt : My 5MP camera started taking better pictures than your 10MP shooter "
<<<<<以上<<<<<
杉浦正利