言語処理100本ノック 2020 (Rev 2)を解いてみる　～第1章: 準備運動～

2022年7月23日 11:33

データ活用系の案件でデータ加工の知識が今後必要になりそうなので、勉強がてら解いてみました。
言語処理100本ノック 2020 (Rev 2)：https://nlp100.github.io/ja/

00. 文字列の逆順

文字列”stressed”の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

考え方
シーケンス（リスト、タプル、文字列等）型で利用できるスライス機能を使って実装。
s[start : end : step]
start：開始、end：終了、step：抽出間隔
以下の通りstepが負の場合、取り出し順が後ろから前方向になる。

+---+---+---+---+---+---+
| P | y | t | h | o | n |
+---+---+---+---+---+---+
0 1 2 3 4 5 6
-6 -5 -4 -3 -2 -1

https://docs.python.org/ja/3/tutorial/introduction.html#strings

コード

text = "stressed"  

new_text = text[::-1]
print(new_text)

出力結果

desserts

01.「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

考え方
1,3,5,7文字目を取り出す＝先頭文字から2文字ずつ取り出す
なので、前の問題のスライス機能を使って実装。

コード

text = "パタトクカシーー"

print(text[::2])

出力結果

パトカー

02.「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

考え方
for文を利用して実装。
ループ内で2つの文字列から一文字ずつ取り出し、結合＆New_textに追記。

コード

text1 = "パトカー"
text2 = "タクシー"

New_text = ""
for i in range(len(text1)):
    New_text += text1[i]+text2[i]

print(New_text)

出力結果

パタトクカシーー

03.円周率

“Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.”という文を単語に分解し，
# 各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

考え方
リスト作成の際に文中に含まれるカンマやピリオドが不要なので、下記の順で処理させる方針とした。
①特殊文字を削除
②文を単語に分解しリストに格納
③文字数のリスト作成

①はtranslate関数を使って変換を実施。引数に変換テーブルを指定する必要があるため、事前にmaketrans関数を用いて変換テーブルを作成。
また、変換テーブル作成の際には特殊記号を表すstring.punctuationを使用。
（print(string.punctuation)したところ、!"#$%&'()*+,-./:;<=>?@[]^_`{|}~を含んでいる模様）
②はsplit()を使って実施。
③はfor文で単語を取り出し、len()で文字長を取得し、リストに格納。
https://www.techiedelight.com/ja/remove-punctuations-string-python/

コード

import string

sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

## 特殊文字を削除
### 変換テーブル作成
tbl = str.maketrans("","",string.punctuation)
### 変換の実施
sentence_re = sentence.translate(tbl)

## 単語に分割
words = sentence_re.split()

# 文字数をカウントしたリストを作成
numlist = []
for word in words:
    numlist.append(len(word))

#リストを表示
print(numlist)

出力結果

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

04.元素記号

“Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.”という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭の2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

考え方
文を単語に分割するところまでは前問と同様。
（特殊記号削除＆単語をリスト化）
各単語から文字を取り出し、辞書を作るところはfor文とif文で対応。

コード

import string

# 文を単語に分解(1-03と同様)
sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
tbl = str.maketrans("","",string.punctuation)
sentence_re = sentence.translate(tbl)
words = sentence_re.split()

# 文字を取り出し、辞書型の配列を作成
dict = {}
for i in range(0,len(words)):
    word = words[i]
    if i+1 in (1, 5, 6, 7, 8, 9, 15, 16, 19):
        dict[i + 1] = word[0]
    else:
        dict[i + 1] = word[:2]

print(dict)

出力結果

{1: 'H', 2: 'He', 3: 'Li', 4: 'Be', 5: 'B', 6: 'C', 7: 'N', 8: 'O', 9: 'F', 10: 'Ne', 11: 'Na', 12: 'Mi', 13: 'Al', 14: 'Si', 15: 'P', 16: 'S', 17: 'Cl', 18: 'Ar', 19: 'K', 20: 'Ca'}

05.n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，”I am an NLPer”という文から単語bi-gram，文字bi-gramを得よ．

考え方
下記引用の通り、N-gramは文書単位（単語/文字）を先頭を1つずつずらしながらN個の塊を作っていくといったものと理解。
単語bi-gram：単語をひとつずつずらしながら2単語の塊を作る
　→今回の場合だと、[I,am] ,[am,an],[an,NLPer]
文字bi-gram：文字ををひとつずつずらしながら2文字の塊を作る
　→今回の場合だと、[I, ] ,[ ,a],[a,m],[m, ],[ ,a]・・・[e,r]
　※空白も一文字として扱うこととした。
今回の設問では「（文字列やリストなど）からn-gramを作る関数を作成」とあるので、1関数でどちらも対応できるように関数を作成する必要がある。
文字列、リスト共にスライス機能が使えるため、for文で文章単位の先頭を1個ずつずらし、target[i:i+n]で先頭iからn個の要素を取り出すよう実装。

N-gramとは、テキスト内のある言語単位（文字や形態素、品詞など）が2言語単位、3言語単位など一般にN言語単位が隣接して生じる言語単位の共起関係（collocation）（それぞれ、2グラム、3グラムおよびNグラムという）で、文書の特長の一端を示すものと考えることができる。たとえば、あるテキストにおいて文書単位が abcdefghijk... と並んでいるとき、2グラムは ab, bc, cd, de, ef, fg, gh, hi, ij, jk, .... 、3グラムは abc, bcd, cde, def, efg, fgh, ghi, hij, .... である。

https://www.isc.meiji.ac.jp/~mizutani/mining/n_gram.html

コード

import string

# n-gram関数定義
def ngram(target,n):
    result = []
    for i in range(0,len(target)-n+1):
        result.append(target[i:i+n])        
    return result

sentence = "I am an NLPer"

# 文を単語に分解(03と同様)
tbl = str.maketrans("","",string.punctuation)
sentence_re = sentence.translate(tbl)
words = sentence_re.split()

# 単語bi-gram
result = ngram(words,2)
print("単語bi-gram：",result)

# 文字bi-gram
result = ngram(sentence,2)
print("文字bi-gram：",result)

出力結果

単語bi-gram： [['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
文字bi-gram： ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']

06.集合

“paraparaparadise”と”paragraph”に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，’se’というbi-gramがXおよびYに含まれるかどうかを調べよ．

考え方

文字b-gramの導出は前問で作成した関数を使用。
集合演算はset型を利用することで各種演算が利用できるため、ngram関数の結果をset関数で変換し、各種演算を実施。
「’se’というbi-gramがXおよびYに含まれるか」はXとYの両方にseという文字列が含まれているかを確認できれば良いため、XとYの積集合にseが含まれているかで判定。

コード

import string

# n-gram関数定義
def ngram(target,n):
    result = []
    for i in range(0,len(target)-n+1):
        result.append(target[i:i+n])        
    return result

text1 = "paraparaparadise"
text2 = "paragraph"

X = set(ngram(text1,2))
print("X =",X)
Y = set(ngram(text2,2))
print("Y = ",Y)

# 和集合
print("和集合：",X.union(Y))

# 積集合
print("積集合：",X.intersection(Y))

# 差集合
print("差集合：",X.difference(Y))

# seがXおよびYに含まれているか
print("'se' contains X and Y：",'se' in X.intersection(Y))

出力結果

X = {'ra', 'di', 'ad', 'pa', 'is', 'ar', 'se', 'ap'}
Y =  {'ap', 'pa', 'ag', 'ar', 'ph', 'gr', 'ra'}
和集合： {'ra', 'di', 'ad', 'gr', 'pa', 'is', 'ar', 'ag', 'se', 'ph', 'ap'}
積集合： {'ra', 'ap', 'pa', 'ar'}
差集合： {'se', 'ad', 'is', 'di'}
'se' contains X and Y： False

07.テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y=”気温”, z=22.4として，実行結果を確認せよ.

考え方
設問通りに実装。

コード

def template(x,y,z):
    print(x,"時の",y,"は",z)
    return 0

x=12
y="気温"
z=22.4

template(x,y,z)

出力結果

12 時の 気温 は 22.4

08.暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．
・英小文字ならば(219 - 文字コード)の文字に置換
・その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

考え方
ポイントとしては下記の通り。
以下では"Message"という文字列を暗号化、復号化しています。

英小文字の判定をislower関数で判定
ord関数で文字コードに変換し演算

コード

def cipher(s):
    result = []
    for char in s:
        if char.islower():
            result.append(chr(219 - ord(char)))
        else:
            result.append(char)
    
    return ''.join(result)

text = "Message"

# 暗号化
en_text = cipher(text)
print(en_text)

# 復号化
de_text = cipher(en_text)
print(de_text)

出力結果

Mvhhztv
Message

09.Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．
ただし，長さが４以下の単語は並び替えないこととする．
適当な英語の文（例えば”I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .”）を与え，その実行結果を確認せよ．

考え方
タイポグリセミア（Typoglycemia）は、文章中のいくつかの単語で最初と最後の文字以外の順番が入れ替わっても正しく読めてしまう現象のこと。
実装におけるポイントは下記の通り。

random.sample(s[1:-1],len(s)-2)で先頭と末尾を除く文字列について並び順をランダムに変更。
'<空白>'.join()で空白を区切り文字として文字列を結合https://docs.python.org/ja/3/library/stdtypes.html#str.join

コード

import random

def Typoglycemia(s):
    #print(s)
    if len(s) <= 4:
        result = s
    else:
        templist = random.sample(s[1:-1],len(s)-2)
        result = ''.join(s[0] + ''.join(templist) + s[-1])
            
    return result
    

text = "I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind ."

words = text.split()
new_words = []
for word in words:
    new_words.append(Typoglycemia(word))
new_text = ' '.join(new_words)

print(new_text)

出力結果

I cnuo’ldt bliveee that I cluod atualcly undntaesrd what I was rdeinag : the pmahnoenel pewor of the human mind .

言語処理100本ノック 2020 (Rev 2)を解いてみる ～第1章: 準備運動～

00. 文字列の逆順

01.「パタトクカシーー」

02.「パトカー」＋「タクシー」＝「パタトクカシーー」

03.円周率

04.元素記号

05.n-gram

06.集合

07.テンプレートによる文生成

08.暗号文

09.Typoglycemia

言語処理100本ノック 2020 (Rev 2)を解いてみる　～第1章: 準備運動～