見出し画像

Rでアメリカ英語の大規模コーパス COCA中の英語表現の使用頻度と実例を調べてみる

Rを使ったテキスト処理

基礎は以下を参照
https://sugiura-ken.org/wiki/wiki.cgi/exp?page=R

COCAのデータについて

https://www.corpusdata.org/formats.asp

9億5千万語の有料版でなくても無料で890万語が使用可能。

https://www.corpusdata.org/coca/samples/coca-samples-text.zip
これをダウンロードして、解凍して
COCA というフォルダーに入れておく。

8つのジャンルがある。
https://www.english-corpora.org/coca/help/coca2020_overview.pdf

ファイルの確認

list.files("COCA")
## [1] "text_acad.txt" "text_blog.txt" "text_fic.txt"  "text_mag.txt" 
## [5] "text_news.txt" "text_spok.txt" "text_tvm.txt"  "text_web.txt"

ジャンルごとに、一つのテキストファイルになっている。

ファイル名と表現の該当件数を出力するスクリプト:howmeny4()"

# copyleft 2022 sugiura@nagoya-u.jp

howmany4 <- function(dore){

  file.zenbu <- list.files()  

  kekka <- NULL                         # 結果を入れておく変数の初期化
  
  for (i in file.zenbu){               # ファイルのリストから一つずつ取り出して i に入れる
  
    kazu <- NULL                    # ファイルごとに数を初期化
    ruiseki <- NULL
    yomikomi <- NULL
    
    tmp <- readLines(i, warn=F)
  
    tmp <- gsub("[.?!]$", "", tmp)   # エッセイの末尾の記号削除
  
    tmp2 <- strsplit(tmp, "[.?!] ")    # 一文ずつに分割
  
    yomikomi <- unlist(tmp2)           # リスト形式を外す
  
    ruiseki <- c(ruiseki, yomikomi)    # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
  
  hit <- grep(dore, ruiseki)         # ため込んだすべての行を対象に検索
  
  kazu <- length(hit)

  kazu <- paste(i, "\t", kazu)        # ファイル名「i」とヒットした数を連結
    
  kekka <- c(kekka, kazu)            # kazuをc()を使ってkekkaに追加していく
  }
  return(kekka)
}


スクリプトの実行例: “a monkey on your back”という熟語を調べる

setwd("COCA")

coca.monkey <- howmany4("a monkey on your back")

coca.monkey
## [1] "text_acad.txt \t 0" "text_blog.txt \t 0" "text_fic.txt \t 0" 
## [4] "text_mag.txt \t 0"  "text_news.txt \t 0" "text_spok.txt \t 0"
## [7] "text_tvm.txt \t 0"  "text_web.txt \t 0"
  • 1例もない。

“a monkey on”を調べる

setwd("COCA")

coca.monkey2 <- howmany4("a monkey on ")

coca.monkey2
## [1] "text_acad.txt \t 0" "text_blog.txt \t 0" "text_fic.txt \t 0" 
## [4] "text_mag.txt \t 0"  "text_news.txt \t 0" "text_spok.txt \t 0"
## [7] "text_tvm.txt \t 0"  "text_web.txt \t 1"
  • 1例ある。


該当の文をファイル名とともに出力するスクリプト: whichOne()

# copyleft 2022 sugiura@nagoya-u.jp

whichOne <- function(dore){
  
  file.zenbu <- list.files()  

  ruiseki <- ""    

  for (i in file.zenbu){ 
  
    chunk <- readLines(i, warn=F)    

    tmp <- gsub("[.?!]$", "", chunk) 
  
   tmp2 <- strsplit(tmp, "[.?!] ")    # 一文ずつに分割
  
   tmp3 <- sapply(tmp2, function(y){return (paste(i, ":", y))})  # 要素ごとにファイル名結合
  
   yomikomi <- unlist(tmp3)
  
    ruiseki <- c(ruiseki, yomikomi)    # 読み込んだ結果を、ruisekiの中に c()を使って追加していく
  
  }

  result<- grep(dore, ruiseki, value=T)
  
  return(result)

}


“a monkey on”を含む文を調べる

setwd("COCA")

kekka <- whichOne("a monkey on ")

kekka
## [1] "text_web.txt : Last yr/like a salmon quitting the cold ocean-leaping and bucking up his birth stream/I hitchhiked my way from LA with 16 caps in my pocket and a monkey on my back "
  • “I hitchhiked my way from LA with 16 caps in my pocket and a monkey on my back”というweb上の英語で使用されていた。

「写真を撮る」という表現("take a picture")のバリエーションを探してみる。

  1. takeの活用の可能性を列挙:take, takes, took, taken, taking

  2. pictureが複数の場合もある:picture, pictures

  3. takeとpictureの間に単語が挟まる可能性がある:正規表現で \\w*

setwd("COCA")

coca.picture <- howmany4("(take|takes|took|taken|taking) \\w* (picture|pictures)")

coca.picture
## [1] "text_acad.txt \t 0" "text_blog.txt \t 1" "text_fic.txt \t 4" 
## [4] "text_mag.txt \t 5"  "text_news.txt \t 2" "text_spok.txt \t 6"
## [7] "text_tvm.txt \t 19" "text_web.txt \t 4"

ジャンルごとに頻度がかなり違う。

実際の文を見てみる

setwd("COCA")

kekka.picture <- whichOne("(take|takes|took|taken|taking) \\w* (picture|pictures)")

kekka.picture
##  [1] "text_blog.txt : Please just remember that someone 's worked hard to create a recipe or take a picture , so give credit where credit is due "                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [2] "text_fic.txt : <p> \" Only if you take your pictures after me , \" I say , and they hedge and demur , and push me into the shiny plastic booth "                                                                                                                                                                                                                                                                                                                                                                                                                                                   
##  [3] "text_fic.txt : \" She took his picture , his legs spread a little , smiling that young , rakish smile of his , and then he took hers "                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [4] "text_fic.txt : \" In those days , who could take a picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
##  [5] "text_fic.txt : I do n't mean you personally , but the man who took your picture for the photography exhibit "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [6] "text_mag.txt : Next , he takes a picture of my left profile "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [7] "text_mag.txt : There are a number of non -- water park areas where kids can climb ( on Cookie Mountain , a vinyl cone that participants slide down once they reach the top ) , bounce ( on Ernie 's Bed , a huge , springy air mattress that 's perfect for jumping and leaping ) , and enjoy an assortment of slides , mazes , tunnels , ropes , sand toys , and oversized @ @ @ @ @ @ @ @ @ @ each visit is to climb the dozens of steps to the top ( or face ) of the Big Bird statue , where they yell down for us to take their picture while they proudly stand as if they 've conquered Mt "
##  [8] "text_mag.txt : Customers took the pictures and sent the whole camera back to Rochester for developing and reloading "                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [9] "text_mag.txt : Says ABC News Washington bureau chief George Watson : \" Literally interpreted , the proposed rules say you could n't take a picture of a wounded soldier "                                                                                                                                                                                                                                                                                                                                                                                                                         
## [10] "text_mag.txt : Regardless of how much money you spent on your laptop , it 's wise to keep receipts related to your purchase , take a picture of your laptop , and register it @ @ @ @ @ @ @ @ @ @ That way , losing your laptop wo n't have to be a huge financial burden on top of all the unavoidable hassles you 'll face "                                                                                                                                                                                                                                                                     
## [11] "text_news.txt : Thomas asked MacIntosh to take a picture as a souvenir "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [12] "text_news.txt : \" <p> Zeigfinger stopped eating animal products for ethical reasons ; she also wo n't wear leather shoes and asks that a photographer not take her picture sitting on a leather chair "                                                                                                                                                                                                                                                                                                                                                                                           
## [13] "text_spok.txt : @!Mr-M-HIGGS : Well , I took this picture , and as you can see , they 're like two school kids "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [14] "text_spok.txt : ABRAHAM PALASIOS : I could take a picture with me bloody and everything , and I could take a picture from- ISRAEL TAPEA : But she was not bloody or nothing "                                                                                                                                                                                                                                                                                                                                                                                                                      
## [15] "text_spok.txt : They 've been people who 've been interested in changing lives whether it 's Henry Ford enabling @ @ @ @ @ @ @ @ @ @ a picnic or George Eastman producing the Brownie camera to take a picture of that picnic "                                                                                                                                                                                                                                                                                                                                                                    
## [16] "text_spok.txt : So she , in this paper , one of the papers , she took a picture of herself in a red bathrobe"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [17] "text_spok.txt : And I 've literally had to tell people , like , please do n't take , I remember a really good friend of mine took a picture like that on my birthday , and I was , like , can we not , like , make a face like this when I 'm , like , holding a martini and wearing a low-cut dress "                                                                                                                                                                                                                                                                                             
## [18] "text_spok.txt : @1AL-ROKER@2 : And dad 's taking a picture right there "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [19] "text_tvm.txt : Can I take your picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [20] "text_tvm.txt : ioney , take another picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [21] "text_tvm.txt : - They took our picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [22] "text_tvm.txt : They took our picturei i @ @ @ @ @ @ @ @ @ @ here "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [23] "text_tvm.txt : I take the picture , make sure it gets sliiped to her , she uses it "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [24] "text_tvm.txt : Peyton , Skinner and some friends stopped for e take a picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
## [25] "text_tvm.txt : I do n't want them taking my picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [26] "text_tvm.txt : Well , what did you take his picture for "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [27] "text_tvm.txt : I hate it when people take my picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [28] "text_tvm.txt : Please do n't take any pictures "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [29] "text_tvm.txt : I promise you I will not take a picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [30] "text_tvm.txt : - He 's taking my picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [31] "text_tvm.txt : I asked you not to take any pictures "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [32] "text_tvm.txt : You said that you would n't take any pictures "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [33] "text_tvm.txt : I 'm going to take my picture , pay my 17.50 and be @ @ @ @ @ @ @ @ @ @ to cut costs , I was trying to cut class "                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [34] "text_tvm.txt : You took a picture of what happened "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [35] "text_tvm.txt : - It was n't actual - Hey , Frank , take a picture - Come on , come on , come on , come on , come on - They should give you the show You are better than the other guy - Kevin "                                                                                                                                                                                                                                                                                                                                                                                                    
## [36] "text_tvm.txt : Oh , we 're gon na take a picture here "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [37] "text_tvm.txt : Felix took some pictures "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [38] "text_web.txt : <p> \" There are a few girls waiting ; sign a few stuffs and take some pictures but be quick , \" Paul instructed and they were joined by a group of security "                                                                                                                                                                                                                                                                                                                                                                                                                     
## [39] "text_web.txt : You 'll all meet the President of the United States , and we 'll take the picture to prove it "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [40] "text_web.txt : One of our employees may be watching with a camera , ready to take that picture "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [41] "text_web.txt : My 5MP camera started taking better pictures than your 10MP shooter "

<<<<<以上<<<<<
杉浦正利

いいなと思ったら応援しよう!