PythonにおけるWEBスクレイピング入門

2019年1月5日 17:44

WEBスクレイピングとはWEB上にあるHTMLなどのコンテンツなどを抽出することをいいます。
この記事ではPythonを用いてのコンテンツの抽出方法について解説いたします。

PythonではRequestsという有名なサードパーティーのライブラリがあります。今回はこのRequestsを使用して、WEBスクレイピングを行っていきます。

requestsのインストール

まずはrequestsをインストールします。pipを使用して、以下のコマンドでrequestsをインストールできます。

$ pip install requests

requestsを使用してWEBサイトのHTMLを取得

まずはrequestsを使用してHTMLを取得してみましょう。取得したいWEBページに対してHTTPリクエストを送る必要があります。以下のコードを実行してみましょう

# request_test.py

import requests

# 取得したいURL
url = "http://example.com"

# urlを引数に指定して、HTTPリクエストを送信してHTMLを取得
response = requests.get(url)

# 文字コードを自動でエンコーディング
response.encoding = response.apparent_encoding

# 取得したHTMLを表示
print(response.text)

以下は上記のコードを実行した結果です。example.comというサイトのHTMLを取得しています。

$ python request_test.py
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

それでは上記コードの内容を一行ずつ解説していきます。
まずはrequestsのモジュールをインポートします。

import requests

取得したいWEBサイトのURLを設定し、requests.getの引数にurlの文字列を格納した変数を渡します。

url = "http://example.com"
response = requests.get(url)

日本語があるWEBページを取得する場合は、文字化けを起こす可能性があります。以下の行を入れることで文字コードを自動で判定してエンコーディングされます。

response.encoding = response.apparent_encoding

最後に取得したHTMLを標準出力します。

print(response.text)

HTMLから特定の文字列を抽出する

今度は上記のHTMLから、タイトル部分や取得したい特定のタグの文字列部分を取得してみます。
HTMLの解析で有名なBeatuifulSoupというライブラリを使用します。

以下のコマンドでBeautifulSoupをインストールできます。

$ pip install beautifulsoup4

上記で取得したexample.comのHTMLのタイトル部分と文章部分を抽出してみましょう。以下のHTMLから<title>Example Domain</title>とbody内にあるpタグの内容を取得してみます。

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

特定タグから文字列を抽出する

まずはHTMLのタイトルを抽出するコードを作成します。以下のコードを実行してください。

# html_parser.py

import requests
from bs4 import BeautifulSoup

# 取得したいURL
url = "http://example.com"

# urlを引数に指定して、HTTPリクエストを送信してHTMLを取得
response = requests.get(url)

# 文字コードを自動でエンコーディング
response.encoding = response.apparent_encoding

# HTML解析
bs = BeautifulSoup(response.text, 'html.parser')
title_tag = bs.find('title')

# 抽出したタグのテキスト部分を出力
print(title_tag.text)

BeautifulSoupのコード部分を解説していきます。
まずはBeautifulSoupの第一引数にhtmlの文字列を渡します。先程requestsで取得したHTMLの文字列はresponse.textで取得できます。

まずはこれでBeautifulSpupのインスタンスを生成します。

bs = BeautifulSoup(response.text, 'html.parser')

titleタグからタイトル部分を抽出します。findメソッドの引数に取得したいタグ名を入れます。

title_tag = bs.find('title')

最後にtitle_tag.textでタグのテキスト部分が表示されます。

print(title_tag.text)

以上で簡易的ではありますが、BeautifulSoupとRequestsを用いたWEBスクレイピングの例を載せました。