【Python】21 lines: XML/HTML parsing　～XML HTML を操ろう～

AyumiKatayama

2023年1月15日 11:30

先に注意事項から。

XML/HTML を扱うプログラムの解説です。

Python のドキュメントで次のように警告されています。

xml.etree.ElementTree モジュールは悪意を持って作成されたデータに対して安全ではありません。信頼できないデータや認証されていないデータをパースする必要がある場合は XML の脆弱性を参照してください。

XML の脆弱性についてはこちら。

さて。
始めましょう。

プログラム

21行プログラムです。

dinner_recipe = '''<html><body><table>
<tr><th>amt</th><th>unit</th><th>item</th></tr>
<tr><td>24</td><td>slices</td><td>baguette</td></tr>
<tr><td>2+</td><td>tbsp</td><td>olive oil</td></tr>
<tr><td>1</td><td>cup</td><td>tomatoes</td></tr>
<tr><td>1</td><td>jar</td><td>pesto</td></tr>
</table></body></html>'''

# From http://effbot.org/zone/element-index.htm
import xml.etree.ElementTree as etree
tree = etree.fromstring(dinner_recipe)

# For invalid HTML use http://effbot.org/zone/element-soup.htm
# import ElementSoup, StringIO
# tree = ElementSoup.parse(StringIO.StringIO(dinner_recipe))

pantry = set(['olive oil', 'pesto'])
for ingredient in tree.getiterator('tr'):
    amt, unit, item = ingredient
    if item.tag == "td" and item.text not in pantry:
        print ("%s: %s %s" % (item.text, amt.text, unit.text))

実行結果

baguette: 24 slices
tomatoes: 1 cup

解説

XLM、HTML を扱うプログラムです。

変数「dinner_recipe」

変数「dinner_recipe」は、HTML で書かれた表です。
この HTML を図示するとこんな風になります。

<html> から </html> までを取り出してテキストで保存しブラウザで表示するとこんな風に見えます。

では、この HTML データで遊びます。

ElementTree

import xml.etree.ElementTree as etree
tree = etree.fromstring(dinner_recipe)

「ElementTree」を import して、「fromstring」を実行します。

「ElementTree」というのは

xml.etree.ElementTree モジュールは、XML データを解析および作成するシンプルかつ効率的な API を実装しています。

「fromstring」は

文字列定数で与えられた XML 断片を解析します。

tree の type は xml.etree.ElementTree.Element。

文字列の構文解析というのは実に面倒です。
この「ElementTree」は、要するに XML やら HTML やらの面倒な構文を解析してくれるライブラリです。

例えば、「getiterator」でイテレータを取り出すことができます。次のようなコードを書いてみます。

for item in tree.getiterator():
    print(item.__getstate__())

するとこのような結果が得られます。

{'tag': 'html', '_children': [<Element 'body' at 0xeb7e5870>], 'attrib': {}, 'text': None, 'tail': None}
{'tag': 'body', '_children': [<Element 'table' at 0xeb7e58d0>], 'attrib': {}, 'text': None, 'tail': None}
{'tag': 'table', '_children': [<Element 'tr' at 0xeb7e5930>, <Element 'tr' at 0xeb7e5a20>, <Element 'tr' at 0xeb7e5ae0>, <Element 'tr' at 0xeb7e5ba0>, <Element 'tr' at 0xeb7e5c60>], 'attrib': {}, 'text': '\n', 'tail': None}
{'tag': 'tr', '_children': [<Element 'th' at 0xeb7e5990>, <Element 'th' at 0xeb7e59c0>, <Element 'th' at 0xeb7e59f0>], 'attrib': {}, 'text': None, 'tail': '\n'}
{'tag': 'th', '_children': [], 'attrib': {}, 'text': 'amt', 'tail': None}
{'tag': 'th', '_children': [], 'attrib': {}, 'text': 'unit', 'tail': None}
{'tag': 'th', '_children': [], 'attrib': {}, 'text': 'item', 'tail': None}
{'tag': 'tr', '_children': [<Element 'td' at 0xeb7e5a50>, <Element 'td' at 0xeb7e5a80>, <Element 'td' at 0xeb7e5ab0>], 'attrib': {}, 'text': None, 'tail': '\n'}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': '24', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'slices', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'baguette', 'tail': None}
{'tag': 'tr', '_children': [<Element 'td' at 0xeb7e5b10>, <Element 'td' at 0xeb7e5b40>, <Element 'td' at 0xeb7e5b70>], 'attrib': {}, 'text': None, 'tail': '\n'}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': '2+', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'tbsp', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'olive oil', 'tail': None}
{'tag': 'tr', '_children': [<Element 'td' at 0xeb7e5bd0>, <Element 'td' at 0xeb7e5c00>, <Element 'td' at 0xeb7e5c30>], 'attrib': {}, 'text': None, 'tail': '\n'}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': '1', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'cup', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'tomatoes', 'tail': None}
{'tag': 'tr', '_children': [<Element 'td' at 0xeb7e5c90>, <Element 'td' at 0xeb7e5cc0>, <Element 'td' at 0xeb7e5cf0>], 'attrib': {}, 'text': None, 'tail': '\n'}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': '1', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'jar', 'tail': None}
{'tag': 'td', '_children': [], 'attrib': {}, 'text': 'pesto', 'tail': None}

「tag」は、 XML/HTML の <　> で記述される部分ですね。
その中にさらに <　> があれば、「_children'」にリンクされます。

tag:HTML の _children は body
tag:body の _children は table
tag:table の _children は tr が5つ
tag:tr の _children は th が3つだったり td が3つだったり

th や td には _children はありません。
そのかわり
<td>tomatoes</td>
などが、「tomatoes」として text に登録されます。

前記の HTML がこうやってイテレータでアクセスできるようになります。こんな風にデータにアクセスできれば後は簡単です。これほど高度なライブラリが「標準」として用意されるのですから有難いことです。C言語では考えられない。

続けます。

pantry = set(['olive oil', 'pesto'])

pantry は変数です。
翻訳すると「貯蔵室」だそうで、「貯蔵室」にある品物を設定しています。「set」はクラス「set」のコンストラクタ。この場合は 'olive oil' と 'pesto' を持ったクラスを作成します。

クラス「set」についてはこちら。

for ingredient in tree.getiterator('tr'):

getiterator については次のような記載がありました。

バージョン 3.2 で非推奨: 代わりに Element.iter() メソッドを使用してください。

あらあら。
「iter」の方がよいようです。

「getiterator」も「iter」も、次のクラスを返します。

<class '_elementtree._element_iterator'>

引数に「'tr'」とあります。
これによって tag が tr のものだけを列挙します。

例えば、最後の ingredient はこんな感じになります。

{'tag': 'tr', '_children': [<Element 'td' at 0xf1aa5c90>, <Element 'td' at 0xf1aa5cc0>, <Element 'td' at 0xf1aa5cf0>], 'attrib': {}, 'text': None, 'tail': '\n'}

type は dics。
_children' には 3つの td を持ちます。
そして次の行。

amt, unit, item = ingredient

ingredient から3つの td を取り出します。

if item.tag == "td" and item.text not in pantry:
    print ("%s: %s %s" % (item.text, amt.text, unit.text))

条件にあったものを標準出力に表示します。
条件は次の2つ。

tag が td であること。 th は除外するということです。

text に pantry を含まないこと。text が 'olive oil' か 'pesto' のものは除外されます。

後書き

XML や HTML を扱えるとなると、いろいろ可能性が広がりそうですよね。
「XML のサイトマップを作る」というプログラムを作ってみようかなぁ。

他に何があるだろう。
ホームページ解析する、とか？

あ。
繰り返しますが、 XML の脆弱性についてはご注意を。