【Python×SEO】大量のURLを正規表現でカテゴリ分けする　単一カテゴリ編（Polars）

悠生/Python×SEO

2024年3月19日 09:47

SEOの作業のなかで地味に手間がかかるのが「URLのカテゴリ分け」。

カテゴリごとにディレクトリを綺麗に分けられているサイトであれば、ディレクトリごとに区切るだけでも整理できますが、そんなケースは稀です。

「正規表現でバーっと分類できたら良いのに……」と思いながらも、Excelやスプレッドシート関数で組むには複雑なので、結局は手作業で分類している方も多いかと思います。

今回はそんな辛い作業であるURLのカテゴリ分けを、Pythonで正規表現を使って一気に分類する方法を紹介します。

手順①：Polarsのインストール

前回のnoteと同様に、今回もPolarsというライブラリを使用します。
ターミナルから以下のコマンドを実行して、Polarsをインストールしてください。

pip install polars

手順②：URLの一覧ファイル(urls.txt)の準備

テキストファイル(urls.txt)で、チェック対象のURL一覧を作成します。
今回はファッション系のECサイトを想定して、以下のようなURL一覧を使用します。また複数サイトを一括チェックできるように、example.comとexample.netのURLを混在させています。

https://example.com/
https://example.com/jacket/
https://example.com/jacket/short/121110/
https://example.com/coat/
https://example.com/coat/long/35892/
https://example.com/coat/long/35892/kuchikomi/
https://example.com/denim/
https://example.com/denim/23235/
https://example.com/skirt/
https://example.net/
https://example.net/outer/jacket/
https://example.net/outer/jacket/001/
https://example.net/outer/jacket/suit/002/
https://example.net/outer/coat/p-525252/
https://example.net/outer/p-11115/
https://example.net/outer/coat/p-3005/
https://example.net/outer/down/

exmaple.comでは、以下のようなURL構造を想定しています。

大カテゴリページ(アウター等)…URL構造に含まない
中カテゴリページ(ジャケット等)…ドメイン直下（アルファベット）
小カテゴリページ(ショート丈等)…中カテゴリページ直下（アルファベット）
商品詳細ページ…中カテゴリもしくは小カテゴリ直下（桁数不定の数字）

対して、exmaple.netでは以下のようなURL構造となっています。

大カテゴリページ(アウター等)…ドメイン直下（アルファベット）
中カテゴリページ(ジャケット等)…大カテゴリページ直下（アルファベット）
小カテゴリページ(ショート丈等)…中カテゴリページ直下（アルファベット）
商品詳細ページ…大カテゴリもしくは小カテゴリの配下（p-から始まる桁数不定の数字）

手順③：サイト毎の正規表現ルール(sites_configurations.json)を作成する

今回のコードではサイトごとの正規表現ルールの設定を、外部ファイル（sites_configurations.json）で作成し、Python上で読み込んで使用します。

含めている項目は以下のとおりで、listに格納しています。

site_name : サイトの名前
base_url : サイトTOPページのURL
url_patterns : 正規表現パターンが格納されたdict。keyがカテゴリ名になる

注意点としては今回のコードでは、url_patternsの上位のルールが優先されることです。複数の条件にマッチするURLがある場合は、url_patternsの上位に優先したいルールを記載してください。

また今回のコードではsite_nameはなくても構いません。
以下はサンプルです。

[
    {
        "site_name": "example.com",
        "base_url": "https://example.com",
        "url_patterns": {
            "top": "https://example.com/$",
            "product_detail": "https://example.com/[a-zA-Z]*/([a-zA-Z]*/)*\\d*/$",
            "product_kuchikomi": "https://example.com/[a-zA-Z]*/([a-zA-Z]*/)*\\d*/kuchikomi/$",
            "cat_outer": "https://example.com/(jacket|coat)/",
            "cat_bottoms": "https://example.com/(denim|skirt)/"
        }
    },
    {
        "site_name": "example.net",
        "base_url": "https://example.net",
        "url_patterns": {
            "top": "https://example.net/$",
            "product_detail": "https://example.net/.*/p-\\d*/$",
            "cat_outer": "https://example.net/outer/",
            "cat_bottoms": "https://example.net/bottoms/"
        }
    }
]

手順④Pythonファイル(categorize_urls.py)を作成する

以下のコードをcategorize_urls.pyとして保存します。
8行目～10行目で、読み込むURL一覧ファイルや、分類結果のCSV保存先などを変更可能です。

import pandas as pd
import time
import datetime
import json
import polars as pl

# read urls from file
URLS_TXT = "./urls.txt"
SITES_COFIGURATIONS_JSON = "./sites_configurations.json"
OUTPUT_FOLDER = "./output/"

def load_sites_configurations_from_json(json_path):
    """jsonファイルから各サイトごとの設定を読み込む
    Args:
        json_path (str): jsonファイルのパス
        
    Returns:
        List: 各サイトごとの設定(dict)が格納されたリスト
            site_name(str): サイト名
            base_url(str): サイトのベースURL
            url_patterns(dict): URLにマッチさせる正規表現が格納されたdict    
    """
    with open('./sites_configurations.json') as f:
        sites_configurations = json.load(f)
    return sites_configurations



def get_dataframes_with_category_columns(dataframe, patterns):
    """URLにマッチする正規表現を使って、カテゴリー列を作成する
    
    Args:
        dataframe (DataFrame): URLが格納されたDataFrame
        patterns (dict): カテゴリー名と、正規表現のマッチングパターンが格納されたdict
    Returns:
        DataFrame: カテゴリー列が追加されたDataFrame
    """
    df = dataframe
    
    # 新しい列を作成し、正規表現でマッチングを行う
    # catは動的に変わるので、aliasを使って列を作成する
    for cat, pattern in patterns.items():
        df = df.with_columns(
            pl.lit("other").alias("category"),
            pl.col("url")
            .str.contains(pattern)
            .cast(pl.Boolean)
            .fill_null(False)
            .alias(cat)
        )
    
    # 正規表現でマッチングした結果をcategory列に反映
    for cat in patterns.keys():
        df = df.with_columns(
            category = pl.when(
                (pl.col(cat) == True) & (pl.col("category") == "other")
            )
            .then(pl.lit(cat)).otherwise(pl.col("category"))
        )
        
    # 不要な列を削除して、urlとcategory列だけを返す    
    return df.select(["url", "category"])


# utility functions
def get_urls_from_txt(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        urls = [line.strip() for line in f.read().splitlines() if line.strip() != ""]
    return urls

def get_current_datetime_as_string():
    current_datetime = datetime.datetime.now()
    datetime_string = current_datetime.strftime("%Y%m%d_%H-%M-%S")
    return datetime_string



# メイン処理
if __name__ == "__main__":
    # 処理時間計測のための開始時間
    start_time = time.time()
    
    # 結果を格納するためのDataFrameを作成
    df = pl.DataFrame()
    
    # URL、設定をファイルから読み込む
    sites_configurations = load_sites_configurations_from_json(SITES_COFIGURATIONS_JSON)
    urls = get_urls_from_txt(URLS_TXT)
    url_df = pl.DataFrame({"url": urls})
    
    # 各サイトごとに、URLにマッチする正規表現を使って、カテゴリー列を作成する
    for site_config in sites_configurations:
        site_df = url_df.filter(
            pl.col("url").str.contains(site_config['base_url'])
        )
        patterns = site_config['url_patterns']    
        
        df_with_category_columns = get_dataframes_with_category_columns(site_df, patterns)
        df = pl.concat([df, df_with_category_columns])
    
    
    # urlとcategory列を、CSVファイルに出力
    df.select(["url", "category"]).to_pandas().to_csv("url_output3.csv", index=False, encoding='utf-8_sig')
    
    # CSVファイルに結果を保存する
    current_datetime = get_current_datetime_as_string()
    df.to_pandas().to_csv(f"{OUTPUT_FOLDER}categorized_urls_{current_datetime}.csv", index=False, encoding="utf-8_sig")
    print(f"Results saved to \"{OUTPUT_FOLDER}categorized_urls_{current_datetime}.csv\"")
    
    
    # 処理時間計測のための終了時間
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Elapsed time: {elapsed_time:.2f} seconds")

手順⑤：categorize_urls.pyを実行して、URLを分類する

ターミナルから以下のコマンドを入力して、categorize_urls.pyを実行します。

python categorize_urls.py

実行結果

初期設定のままなら/output/フォルダのなかに、URLとカテゴリが記載されたCSVファイルが保存されます。

複雑なURLパターンでなければ、数秒で終わるかと思います。

まとめ

PolarsでURLのカテゴリ分類を行うには、when, then, otherwiseを使います。

ただし、そのまま構文を使用した場合は条件ひとつひとつに対して、when, then, otherwiseを記載する必要があり、とても面倒かつコードの見通しも悪くなってしまいます。

今回のコードでは、site_configurationsで設置したurl_patternsのkeyごとに新しいカラムを追加し、最後にTrueだったカラム名をカテゴリ名としてDataFrameに追加しています。
※もっと良い方法があれば教えてください…