英語 Wikipedia と Simple English Wikipedia の対応関係を観る(2)
前回は、Simple Einglish Wikipedia (以下、SimpleWiki) と English Wikipedia (以下、EnWiki) を比較して同一記事タイトルが約20万件あるとわかりました。
ここからはその記事の本文テキストを見ていく準備をします。
前回の最後に同一の記事タイトル数を数えた時に 2つの dump データを 1つの json ファイルにまとめました。
共通記事タイトルをキーとして、共通記事タイトル(title)、EnWikiでの記事ID(id)、EnWikiの本文テキスト(text)、SimpleWikiでの記事ID(s_id)、SimpleWikiの本文テキスト(s_text)が引ける形になっています。
(略)
length which is typically at the decimeter scale) using the following empirical relationships:\nformula_18.\nSee also.\nReferences.",
"s_id": "76367",
"s_text": "The albedo of an object is the extent to which it reflects light, more specifically light which comes from the Sun. It is defined as the ratio of reflected to incident electromagnetic radiation. It is a unitless measure indicative of a surface's or body's diffuse reflectivity. The word is derived from \"albus\", a Latin word for \"white\".\nOther websites."
},
"A": {
"title": "A",
"id": "290",
"text": "First letter of the Latin alphabet\nA, or a, is the first letter and the first vowel of the Latin alphabet,
(略)
Code points.\nThese are the code points for the forms of the letter in various systems\n1 Also for encodings based on ASCII, including the DOS, Windows, ISO-8859 and Macintosh families of encodings.\nUse as a number.\nIn the hexadecimal (base 16) numbering system, A is a number that corresponds to the number 10 in decimal (base 10) counting.\nNotes.\nFootnotes.\nReferences.",
"s_id": "8",
"s_text": "A or a is the first letter of the English alphabet. The small letter, a or α, is used as a lower case vowel.\nWhen it is spoken, ā is said as a long a, a Diphthong of ĕ and y. A is similar to Alphabet of the Greek alphabet. That is not surprising, because it means the same sound.\n\"Alpha and Omega\" (the last letter of the Greek alphabet) means from beginning to the end. In musical notation, the letter A is the symbol of a note in the scale, below B and above G.\nA is the letter that was used to represent a team in an old TV show, The A-Team. A capital a is written \"A\". Use a capital a at the start of a sentence if writing.\nA is also a musical note, sometimes referred to as \"La\".\nWhere it came from.\nThe letter 'A' was in the Phoenician alphabet's aleph. This symbol came from a simple picture of an ox head.\nThis Phoenician letter helped make the basic blocks of later types of the letter. The Greeks later modified this letter and used it as their letter alpha. The Greek alphabet was used by the Etruscans in northern Italy, and the Romans later modified the Etruscan alphabet for their own language.\nUsing the letter.\nThe letter A has six different sounds. It can sound like æ, in the International Phonetic Alphabet, such as the word \"pad\". Other sounds of this letter are in the words \"father\", which developed into another sound, such as in the word \"ace\".\nUse in mathematics.\nIn algebra, the letter \"A\" along with other letters at the beginning of the alphabet is used to represent known quantities.\nIn geometry, capital A, B, C etc. are used to label line segments, lines, etc. Also, A is typically used as one of the letters to label an angle in a triangle.\nIts letter shape is referred to abstractly in Sir William Vallance Douglas Hodge's 5th postulate, the basis for, as one of the Millennium Prize Problems, the Hodge Conjecture.\nReferences."
},
"Alabama": {
"title": "Alabama",
"id": "303",
"text": "U.S. state\nAlabama () is a state in the Southeastern region of the United States, bordered by Tennessee to the north; Georgia to the east; Florida and the Gulf of Mexico to the south; and Mississippi to the west. Alabama is the 30th largest by area and the 24th-most populous of the U.S. states.\nAlabama is nicknamed the \"Yellowhammer State\", after the state bird.
(略)
毎回この json を処理してもいいのですが、この状態だと text, s_text の可読性が低く、処理の途中で何か不都合が起きたとしても、目視で原因箇所を探るのが大変です。
なので、EnWiki、SimpleWiki の各記事を 1 テキストファイルに展開します。この際、せっかくとった記事間の対応関係が消えないように、対応する記事ファイル名を同一にし、接頭辞として共通の ID を新たに設けます。
f'{_num}_en_{_id}_simple_{_s_id}.txt'
ファイル名の構造としてはこんな感じで、{_num} が新しく設ける共通の接頭辞です。このファイル名を使い、対応する記事間で同一のファイル名を設定し、Simple と En の違いはファイルを格納するディレクトリで行います。
こうしておけば、参照中のファイルのディレクトリ名から、それが EnWiki の記事か、SimpleWiki の記事か判別できますし、対応するもう片方の記事はもう片方のディレクトリの同一ファイル名で見つけられます。また双方の記事 id をファイル名に残しているので、dump ファイルにさかのぼって確認することも可能です。
整形のためのスクリプトはこちらです。
EnWiki, SimpleWiki の各記事が、同一タイトルの関係を保ったままテキストファイルになりました。
このファイルのひとつを開いてみます。
$ less text/en/1000_en_4548_simple_91088.txt
Type of snowstorm
A blizzard is a severe snowstorm characterized by strong sustained winds and low visibility, lasting for a prolonged period of time—typically at least three or four hours. A ground blizzard is a weather condition where snow is not falling but loose snow on the ground is lifted and blown by strong winds. Blizzards can have an immense size and usually stretch to hundreds or thousands of kilometres.
Definition and etymology.
In the United States, the National Weather Service defines a blizzard as a severe snow storm characterized by strong winds causing blowing snow that results in low visibilities. The difference between a blizzard and a snowstorm is the strength of the wind, not the amount of snow. To be a blizzard, a snow storm must have sustained winds or frequent gusts that are greater than or equal to with blowing or drifting snow which reduces visibility to or less and must last for a prolonged period of time—typically three hours or more.
Environment Canada defines a blizzard as a storm with wind speeds exceeding accompanied by visibility of or less, resulting from snowfall, blowing snow, or a combination of the two. These conditions must persist for a period of at least four hours for the storm to be classified as a blizzard, except north of the arctic tree line, where that threshold is raised to six hours.
The Australia Bureau of Meteorology describes a blizzard as, "Violent and very cold wind which is laden with snow, some part, at least, of which has been raised from snow covered ground."
While severe cold and large amounts of drifting snow may accompany blizzards, they are not required. Blizzards can bring whiteout conditions, and can paralyze regions for days at a time, particularly where snowfall is unusual or rare.
A severe blizzard has winds over , near zero visibility, and temperatures of or lower. In Antarctica, blizzards are associated with winds spilling over the edge of the ice plateau at an average velocity of .
Ground blizzard refers to a weather condition where loose snow or ice on the ground is lifted and blown by strong winds. The primary difference between a ground blizzard as opposed to a regular blizzard is that in a ground blizzard no precipitation is produced at the time, but rather all the precipitation is already present in the form of snow or ice at the surface.
(略)
御覧の通り、1 行が段落もしくはセクションタイトルでできています。
1 文ではない。
なので、次に必要な前処理は、文単位に切ることです。
英語の文章を文単位に分割する実装はいくつかあります。例えば、nltk の Punkt 、moses の tokenizer、、、
色々試しましたが、セミコロンやコロン、引用などがたびたび出てくる Wikipedia 記事の文単位分割には spacy の en_core_web_sm モデルがよかったのでこれにしました。
(検証としてこの記事を様々なツールで文単位に分割して見比べました)
spacy は文章(文字列型)を与えてやれば、文単位に分割するだけでなく、品詞タグ付けや、パージング、NERなどなど大体の基盤処理を
doc = nlp(text)
だけで完了してくれて便利なので、いっきに情報付けるとこまでしてしまいましょう。
1ファイルが SimpleWiki もしくは EnWiki の 1 記事に対応しつつ、
各ファイルの1 行が 1 文になっているファイル(.txt)と、
各ファイルの1 行が 1 トークン(もしくはEOS)になっているファイル(.mecab)を作成します。
トークン化の際に付ける情報は、品詞と lemma と 表層形出現形のlowercase、あとはストップワードか否かあたりをタブ区切りで並べておきます。
この形に整形した後のファイル群を、先ほどと同じ en と simple のディレクトリ構造と共通ファイル名を保ったまま、別の場所に格納します。
スクリプトはこちら。
$ less sentences/en/1000_en_4548_simple_91088.txt
Type of snowstorm
A blizzard is a severe snowstorm characterized by strong sustained winds and low visibility, lasting for a prolonged period of time—typically at least three or four hours.
A ground blizzard is a weather condition where snow is not falling but loose snow on the ground is lifted and blown by strong winds.
Blizzards can have an immense size and usually stretch to hundreds or thousands of kilometres.
Definition and etymology.
In the United States, the National Weather Service defines a blizzard as a severe snow storm characterized by strong winds causing blowing snow that results in low visibilities.
The difference between a blizzard and a snowstorm is the strength of the wind, not the amount of snow.
To be a blizzard, a snow storm must have sustained winds or frequent gusts that are greater than or equal to with blowing or drifting snow which reduces visibility to or less and must last for a prolonged period of time—typically three hours or more.
Environment Canada defines a blizzard as a storm with wind speeds exceeding accompanied by visibility of or less, resulting from snowfall, blowing snow, or a combination of the two.
These conditions must persist for a period of at least four hours for the storm to be classified as a blizzard, except north of the arctic tree line, where that threshold is raised to six hours.
The Australia Bureau of Meteorology describes a blizzard as, "Violent and very cold wind which is laden with snow, some part, at least, of which has been raised from snow covered ground.
(略)
$ less sentences/en/1000_en_4548_simple_91088.txt.mecab
Type NOUN NN type type False
of ADP IN of of True
snowstorm NOUN NN snowstorm snowstorm False
A DET DT a a True
blizzard NOUN NN blizzard blizzard False
is AUX VBZ is be True
a DET DT a a True
severe ADJ JJ severe severe False
snowstorm NOUN NN snowstorm snowstorm False
characterized VERB VBN characterized characterize False
by ADP IN by by True
strong ADJ JJ strong strong False
sustained ADJ JJ sustained sustained False
winds NOUN NNS winds wind False
and CCONJ CC and and True
low ADJ JJ low low False
visibility NOUN NN visibility visibility False
, PUNCT , , , False
lasting VERB VBG lasting last False
for ADP IN for for True
a DET DT a a True
prolonged ADJ JJ prolonged prolonged False
period NOUN NN period period False
of ADP IN of of True
time NOUN NN time time False
— PUNCT : — — False
typically ADV RB typically typically False
at ADV RB at at True
least ADV RBS least least True
three NUM CD three three True
or CCONJ CC or or True
four NUM CD four four True
hours NOUN NNS hours hour False
. PUNCT . . . False
EOS
.mecab という形式は 形態素解析器『MeCab』のデフォルト出力が、トークン単位に切った後、各行にトークン情報(今回のタブ区切りと若干形式は異なりますが)を記述し、文末に「EOS」だけの行を置く設定なので、その流用です。
文末を表す専用の行表記があると、繰り返し処理を書くときにフラグを書かなくていいのと、この行を数えればファイル内の文数が間違いなくわかる利点があります。
残る前処理は、対応する記事間で、対応する文を獲得することですが、それは次回に続きます。