45【TechCommit朝活】Scrapy スパイダー不動産の物件情報を抽出してExcelに出力する方法

2024年9月9日 12:44

こんにちは！TechCommitメンバーの友季子です♬この記事では、PythonやScrapyを使って、物件情報サイトSUUMOから物件情報をスクレイピングしてExcelに出力する方法のポイントを紹介します。
PythonでのWebスクレイピングを学んでいる方に役立つ内容を目指して執筆しました。
特にScrapyの使い方やデータの取り扱いについて参考になる部分があれば幸いです。

0. 前提

基本的なScrapyの使い方や、HTMLの解析に慣れていることを前提にしています。初心者の方でも理解できるように、できるだけ詳細に説明しています。

1. 全体の完成コード（サンプル）

以下は、Scrapy スパイダーで物件情報を取得してExcelに出力する完成コードのサンプルです。エラー処理やデータの取得に関する具体的なコードが含まれています。

import scrapy
from scrapy.http.response.html import HtmlResponse
from urllib.parse import urljoin
import pandas as pd
from datetime import datetime

class SuumoSpider(scrapy.Spider):
    name = "suumo"
    origin = "https://suumo.jp"
    allowed_domains = ["suumo.jp"]
    start_urls = [
        "https://suumo.jp/jj/chintai/ichiran/FR301FC005/?ar=030&bs=040&ta=13&sc=13113&cb=0.0&ct=8.0&mb=0&mt=9999999&et=20&cn=5&shkr1=03&shkr2=03&shkr3=03&shkr4=03&sngz=&po1=25&po2=99&pc=50"
    ]
    
    properties = []  # データを保存するリスト

    def parse(self, response: HtmlResponse):
        self.handle_response(response)

        # ページネーション処理
        paginators = response.css(".pagination-parts")
        if paginators:
            last_paginator = paginators[-1]
            if last_paginator.css("::text").get() == "次へ":
                next_url = last_paginator.css("a").attrib.get("href")
                if next_url:
                    yield scrapy.Request(url=urljoin(self.origin, next_url))

    def handle_response(self, response: HtmlResponse):
        for row in response.css("div.property"):
            # 物件タイトルを取得
            title = row.css("h2.property_inner-title a::text").get(default='').strip()

            # 賃料を取得
            rent_price = row.css("div.detailbox-property-point::text").get(default='').strip()

            # プロパティテーブルの取得
            property_table = row.css("div.detailbox > div.detailbox-property > table")
            
            # 間取りと専有面積の取得
            plan_of_house = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[1]/text()').get(default='').strip()
            exclusive_area = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[2]/text()').get(default='').strip()

            # 詳細の取得
            detail_texts = row.xpath('.//div[contains(@class, "detailbox-note")]//div[contains(@class, "detailnote-box")]/div/text()').getall()
            detail_texts = [t.strip() for t in detail_texts if t.strip() != ""]
            detail = "\n".join(detail_texts).strip()

            # スクレイピング結果を一時的にリストに保存
            self.properties.append({
                "Title": title,
                "Rent Price": rent_price,
                "Plan of House": plan_of_house,
                "Exclusive Area": exclusive_area,
                "Detail": detail,
            })

    def close(self, reason):
        """
        スパイダーが全ての処理を完了した際に呼び出される
        """
        self.save_to_excel()

    def save_to_excel(self):
        # pandasを使ってデータをExcelに変換
        df = pd.DataFrame(self.properties)
        
        # 現在の日付と時間をファイル名に組み込む
        current_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        file_name = f"suumo_data_{current_time}.xlsx"

        # ファイルをダウンロードフォルダ直下に保存
        file_path = f"C:/Users/yukik/Downloads/{file_name}"
        df.to_excel(file_path, index=False)

        self.log(f"Excel file saved: {file_path}")

1 物件のループ処理:
response.css("div.property") で物件情報を含む div 要素を選択し、ループ処理しています。row には各物件の詳細が含まれています。

2 物件タイトルの取得:
row.css("h2.property_inner-title a::text").get(default='').strip() で物件タイトルを取得し、前後の空白を削除しています。default='' により、値が存在しない場合に空文字を返します。

3 賃料の取得:
row.css("div.detailbox-property-point::text").get(default='').strip() で賃料を取得し、空白を削除しています。default='' により、値がない場合でも空文字が設定されます。

4 プロパティテーブルの取得:
row.css("div.detailbox > div.detailbox-property > table") で物件の詳細を含むテーブルを取得します。CSSセレクタを使用して、div.detailbox 内の div.detailbox-property 内の table を選択しています。

5 間取りと専有面積の取得:
property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[1]/text()').get(default='').strip() で間取りを取得し、property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[2]/text()').get(default='').strip() で専有面積を取得しています。XPathを使って、テーブル内の特定の td 要素からデータを抽出しています。

6 詳細情報の取得:
row.xpath('.//div[contains(@class, "detailbox-note")]//div[contains(@class, "detailnote-box")]/div/text()').getall() で詳細情報を取得し、strip() メソッドで前後の空白を削除しています。"\n".join(detail_texts) で複数行のテキストを結合しています。

7 スクレイピング結果の保存:
取得した情報を辞書形式で self.properties リストに追加しています。このリストは最終的にExcelファイルとして保存されます。

2. メソッドでのコードの流れ

parse メソッド:
- スパイダーのエントリーポイントです。最初に指定したURLからデータを取得し、handle_response メソッドでデータを処理します。ページネーションがある場合は、次のページのURLを取得して再度リクエストを送信します。
handle_response メソッド:
- 各物件情報をループ処理し、タイトル、賃料、間取り、専有面積、詳細情報を抽出して self.properties リストに保存します。
close メソッド:
- スパイダーの処理が全て完了した後に呼び出され、収集したデータをExcelファイルに保存します。
save_to_excel メソッド:
- pandas ライブラリを使ってデータをExcelファイルとして保存します。ファイル名には現在の日付と時間が組み込まれます。

3. handle_response メソッドの解説

handle_response メソッドは、レスポンスから物件情報を抽出し、処理する中心的なメソッドです♪
以下に、コードの各部分について詳細に解説します。

3.1 物件のループ処理

for row in response.css("div.property"):

①説明: response.css("div.property") で物件情報を含む div 要素を選択し、ループ処理します。各 row には物件の詳細情報が含まれています。
②活用方法: HTML構造が変更された場合には、適切なクラス名に変更して対応します。

3-2 物件タイトルの取得

title = row.css("h2.property_inner-title a::text").get(default='').strip()

①説明: h2.property_inner-title a::text で物件タイトルを取得し、strip() で前後の空白を削除します。get(default='') は、値が存在しない場合に空文字を返します。
②活用方法: タイトルのCSSセレクタやXPathを変更することで、異なるタイトル構造に対応できます。
③row.css:row.cssは Scrapy のセレクタメソッドの一つで、CSSセレクタを使用してHTML要素を抽出するために用います。row.css メソッドのセレクタを変更することで、HTML構造の変化に対応できます。特に、ページのレイアウトやクラス名が変更された場合に有用です。

3-3 賃料の取得

rent_price = row.css("div.detailbox-property-point::text").get(default='').strip()

①説明: div.detailbox-property-point::text で賃料を取得し、strip() で空白を削除します。default='' により、値がない場合でも空文字が設定されます。
②活用方法: 賃料情報が異なる場合（異なるクラス名やタグ）には、適切なセレクタに変更します。

3-4 プロパティテーブルの取得

property_table = row.css("div.detailbox > div.detailbox-property > table")

①説明: 物件の詳細を含むテーブルを取得します。CSSセレクタを使って、div.detailbox 内の div.detailbox-property 内の table を選択します。
②活用方法: テーブルの構造やクラス名が変更された場合には、新しいセレクタに修正します。

3-5 間取りと専有面積の取得

plan_of_house = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[1]/text()').get(default='').strip()
exclusive_area = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[2]/text()').get(default='').strip()

①説明: XPathを使用して、テーブル内の特定の td 要素から間取りと専有面積を取得します。各 div のインデックス [1] や [2] で、間取りと専有面積を分けて取得します。
②活用方法: 情報が異なる場合（異なるタグやクラス名）には、適切なXPathセレクタに変更します。

3-6 詳細情報の取得

detail_texts = row.xpath('.//div[contains(@class, "detailbox-note")]//div[contains(@class, "detailnote-box")]/div/text()').getall()
detail_texts = [t.strip() for t in detail_texts if t.strip() != ""]
detail = "\n".join(detail_texts).strip()

①説明: 詳細情報は複数行のテキストとして取得されるため、XPathを使って複数の div 要素からテキストを抽出し、空白を削除してから結合します。
②活用方法: 詳細情報の構造が変更された場合には、XPathセレクタやテキスト処理のロジックを修正することで対応できます。

3-7 スクレイピング結果の保存

self.properties.append({
    "Title": title,
    "Rent Price": rent_price,
    "Plan of House": plan_of_house,
    "Exclusive Area": exclusive_area,
    "Detail": detail,
})

①説明: 取得した情報を辞書形式で self.properties リストに追加します。このリストは最終的にExcelファイルとして保存されます。
②活用方法: 必要なデータ項目を追加したり、不要な項目を削除したりすることで、スクレイピングの出力をカスタマイズできます。

4. handle_response メソッドのまとめ

handle_response メソッドのまとめをシェアしていきますね。

def handle_response(self, response: HtmlResponse):
    # 各物件情報を含む要素をループ処理
    for row in response.css("div.property"):
        # 物件タイトルを取得
        title = row.css("h2.property_inner-title a::text").get(default='').strip()

        # 賃料を取得
        rent_price = row.css("div.detailbox-property-point::text").get(default='').strip()

        # プロパティテーブルの取得
        property_table = row.css("div.detailbox > div.detailbox-property > table")
        
        # 間取りと専有面積を取得
        plan_of_house = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[1]/text()').get(default='').strip()
        exclusive_area = property_table.xpath('.//td[contains(@class, "detailbox-property--col3")]/div[2]/text()').get(default='').strip()

        # 詳細情報を取得
        detail_texts = row.xpath('.//div[contains(@class, "detailbox-note")]//div[contains(@class, "detailnote-box")]/div/text()').getall()
        detail_texts = [t.strip() for t in detail_texts if t.strip() != ""]
        detail = "\n".join(detail_texts).strip()

        # スクレイピング結果を一時的にリストに保存
        self.properties.append({
            "Title": title,
            "Rent Price": rent_price,
            "Plan of House": plan_of_house,
            "Exclusive Area": exclusive_area,
            "Detail": detail,
        })

4-1詳細な解説

①物件のループ処理:
- response.css("div.property") で物件情報を含む div 要素を選択し、for row in ... でループ処理します。これにより、ページ内の各物件に対して処理を行います。
②物件タイトルの取得:
- row.css("h2.property_inner-title a::text").get(default='') で物件タイトルを抽出し、strip() メソッドで前後の空白を削除します。default='' は、タイトルが存在しない場合に空文字を返すようにしています。
③賃料の取得:
- row.css("div.detailbox-property-point::text").get(default='') で賃料を取得し、空白を削除しています。賃料が存在しない場合には空文字が設定されます。
④プロパティテーブルの取得:
- row.css("div.detailbox > div.detailbox-property > table") で、物件の詳細情報を含むテーブルを取得します。テーブル内の要素はさらに指定して取得します。
⑤間取りと専有面積の取得:
- xpath を使って、テーブル内の td 要素から間取りと専有面積を抽出します。インデックス [1] と [2] で、各情報を分けています。
⑥詳細情報の取得:
- row.xpath('.//div[contains(@class, "detailbox-note")]//div[contains(@class, "detailnote-box")]/div/text()') で詳細情報を取得し、getall() メソッドで全てのテキストをリストとして取得します。空白を削除し、テキストを結合して詳細情報を形成します。

5. おわりに

最後まで読んでいただき、ありがとうございました！この記事が、Scrapyを使ったスクレイピングの学びの参考になれば幸いです。