Pythonでトークン化

かわだ

2023年10月29日 06:17

私はいままでテキストマイニングにはKH Coderを使っていたのですが、スピードと自由度を求めてPythonを使うことにしました。特にこのトークン化はPythonのほうがやりやすいです。

今回はこの文をトークン化していきます。

"This sentence contains 7 different items!"

トークン化

テキストマイニングの前処理として、前回の正規化に加え、トークン化を行います。英語ではTokenizationと言います。
これは、長い文を単語などの単位に分割する作業です。

トークン化には、nitkというライブラリを使います。

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

さっそくこれを使って、例文を分割してみます。

import nltk
from nltk.tokenize import word_tokenize

sentence = "This sentence contains 7 different items!"

tokens = word_tokenize(sentence)
tokens

実行するとこのようにlist化されました。これで文をデータとして扱うイメージがついてきます。

['This', 'sentence', 'contains', '7', 'different', 'items', '!']

ストップワードの除去

ストップワードとは、どの文章にも出てくるような単語のことをいいます。これらはテキストマイニングの際には大きなノイズとなってしまいます。
例えば、 "This sentence contains 7 different items!"だと、thisなんかはいっぱい出てくるでしょうね。また数字や記号などもノイズになります。

記号の除去

まずは記号を取り除いてみます。
stringというライブラリを使います。

import string
print(string.punctuation)

この中には、
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
が入っています。これらをlistから除いてみます。

import string
import nltk
from nltk.tokenize import word_tokenize

sentence = "This sentence contains 7 different items!"
tokens = word_tokenize(sentence)

token_list = [token for token in tokens
              if token not in string.punctuation]
token_list

内包表記で、記号に該当しない場合のみtoken_listに格納します。

['This', 'sentence', 'contains', '7', 'different', 'items']

「!」が消えました。

数字の除去

次にisalpha()を使って数値を消してみます。

import string
import nltk
from nltk.tokenize import word_tokenize

sentence = "This sentence contains 7 different items!"
tokens = word_tokenize(sentence)

token_list = [token for token in tokens if token.isalpha()]
token_list

結果はこのようになります。数値だけでなく「!」も消えますね。

['This', 'sentence', 'contains', 'different', 'items']

ちなみに数字だけにするにはisdecimal、英数字だけにするにはisalnumが使えます。　

頻出単語の除去

頻出英語であるthisやisなどは確実にノイズになります。これらを消していきましょう。stopwordsというライブラリを使います。まずはダウンロード。

from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords.words('言語名')という便利なものがあります。これをset型に格納します。

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

"english"と指定するとこのようなものが得られます。

ちなみに"chinese"とするとこのように出ます。残念ながら日本語はありません。

このstopwordから"not"を除きたい、と言った場合はdiscardを使います。

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.discard('not')
print(stop_words)

これを使って、頻出英語を除いてみます。stopwordsは全て小文字なので、tokens = word_tokenize(sentence.lower())で文を小文字にしています。そして内包表記で該当するもの以外をリスト化するという手順です。

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words.discard('not')

sentence = "This sentence contains 7 different items!"
tokens = word_tokenize(sentence.lower())

token_list = [token for token in tokens if not token in stop_words]
token_list

結果はこのように、thisがなくなりました。

['sentence', 'contains', '7', 'different', 'items', '!']

終わりに

私の仕事はほとんどが英語なので、このトークン化は英語バージョンです。また世の中、英語の情報量は多言語を凌駕しているので、まずは英語をしっかりと扱えるようになることが第一です。日本語はスペースがないなど、処理にひと手間がかかりますので、まずは基本を押さえてから日本語に臨みましょう。