SlackのバックアップをPythonで

2022年8月25日 15:48

Slackが有料化される！

というわけで、どうしようどうしようと思っているうちに、もう来週には9月になってしまいそうです。そんな私に、とある友人が、slackバックアップの相談を持ち掛けてくれたおかげで、ようやくこの問題に立ち向かう決心がつきました。

友人より、このQiitaのリンクを送ってもらい、このままでは動かない、どうしようどうしようと悩んでいるうちに、「あれ、これ全部pythonで解決できるのでは？」と思うに至ったので、今回その方法を公開することにしました。

私みたいに、まだ対策を打てていなかった方、もしもpythonが使えましたら、以下どうぞお使いください！
※ 使用しているPCへの障害は、無いと思いますが、仮に何かがあったとしても当方は責任を持ちませんので、その点はご了承願います。

概要・想定されるoutput

以下にあるpythonのスクリプトを実行すれば、あらかじめ用意したworkspaceのリストに従って、workspaceの名前のフォルダを作成し、その中に、workspace内すべてのチャンネル・会話（public, privateともに）の履歴を1チャンネル/会話につき1つのエクセルファイルとして保存します。

準備

ほとんどこのQiitaそのまんまやってます。本当にお世話になりました。ありがとうございます（こんなところからお礼を言ってみる）。

1. Slack Appを作る・設定する

Slack API のこのページに飛び、Create App --> From scratch でAppを作成してください。このとき、履歴を保存したいWorkspaceを選んでください。App名は何でもOKです。
左欄 Basic Informationをクリックして、右に出るAdd features and functionalityの中の「Permissions」をクリック
下の方の「Scope」のうち「User Token Scopes」にて、以下のものを追加してください。
- groups:history
- channels:history
- im:history
- mpim:history
- groups:read
- channels:read
- im:read
- mpim:read
- users:read
- channels:write
Scopeより上にある OAuth Tokens for Your Workspace にて、Insatll to Workspace する（何か聞かれたら許可する）
User OAuth Token が出るので、Workspace名とともに、Excel等にメモする（あとで使います）
もしもたくさんworkspaceを持っている場合は、1~5の作業を繰り返す
- 下では、Workspace名と、Tokenをまとめたxlsxファイルを使うので、「workspace名」「Token」の順に、一行につきひとつのworkspaceの情報を書き足してください

2. Python3の環境を設定する

pip か何かで、slack_sdk と pandas と openpyxl を入れる（この3つのライブラリが入っていれば、ここは飛ばしてOKです）
- 【注意】Python 3.9以降では動きません！！Python 3.8以前で試してください！！（2022.8.24 現在）

3. 保存先・関連ファイルを用意する

slackの履歴を保存するフォルダ（ディレクトリ）を用意する
その中に、"workspace.xlsx"という名前のxlsxファイルを作り、上の1-6の注意書きで書いてある通りに、ファイル内にworkspace名とそのworkspaceに入れ込んだAppのトークンを記載する（ヘッダー不要です）

実行

下のpythonファイルを、上で用意したフォルダ内に保存して、このフォルダのディレクトリで、上で環境構築したPython3を使って実行してください

2022.8.26 更新

# encoding: utf-8
import logging
import os
import pandas as pd
# Import WebClient from Python SDK (github.com/slackapi/python-slack-sdk)
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError

encode = "utf-8"
workspace_df = pd.read_excel("workspace.xlsx", header=None)

for index, ws in workspace_df.iterrows():
    workspace = ws[0]
    app_token = ws[1]
    if workspace not in os.listdir():
        os.mkdir(f"./{workspace}")

    # WebClient instantiates a client that can call API methods
    # When using Bolt, you can use either `app.client` or the `client` passed to listeners.
    client = WebClient(token=app_token)
    logger = logging.getLogger(__name__)

    # ワークスペース内のメンバー名、メンバーIDを取得
    # slack api webpage（以下）ほぼそのままです
    # https://api.slack.com/methods/users.list
    # You probably want to use a database to store any user information ;)
    users_store = {}
    def fetch_users():
        try:
            # Call the users.list method using the WebClient
            # users.list requires the users:read scope
            result = client.users_list()
            save_users(result["members"])

        except SlackApiError as e:
            logger.error("Error creating conversation: {}".format(e))

    # Put users into the dict
    def save_users(users_array):
        for user in users_array:
            # Key user info on their unique user ID
            user_id = user["id"]
            # Store the entire user object (you may not need all of the info)
            try:
                users_store[user_id] = user["real_name"]
            except:
                users_store[user_id] = user["name"]

    fetch_users()
    users_df = pd.DataFrame(list(users_store.items()), columns=["id", "username"])


    # ワークスペース内のチャンネル名・チャンネルIDを取得
    # slack api webpage（以下）ほぼそのままです
    # https://api.slack.com/methods/conversations.list
    conversations_store = {}
    def fetch_conversations(type):
        try:
            # Call the conversations.list method using the WebClient
            result = client.conversations_list(types=type)
            save_conversations(result["channels"])

        except SlackApiError as e:
            logger.error("Error fetching conversations: {}".format(e))

    # Put conversations into the JavaScript object
    def save_conversations(conversations):
        for conversation in conversations:
            # Key conversation info on its unique ID
            conversation_id = conversation["id"]

            # Store the entire conversation object
            # (you may not need all of the info)
            conversations_store[conversation_id] = conversation

    for type in ["public_channel","private_channel","mpim","im"]:
        fetch_conversations(type)

    channel_ids_dict = {}
    for id in conversations_store.keys():
        if "name" in conversations_store[id].keys(): # チャンネル
            channel_ids_dict[id] = conversations_store[id]["name"]
        elif "user" in conversations_store[id].keys(): # DM等
            channel_ids_dict[id] = users_store[conversations_store[id]["user"]]


    # チャンネルごとに履歴を取得
    # # slack api webpage（以下）ほぼそのままです
    # https://api.slack.com/methods/conversations.history
    conversation_hist_index_list = ['text', 'user', 'ts', 'reply_from', 'team', 'thread_ts', 'reply_count', 'reply_users_count', 'latest_reply', 'reply_users', 'reactions', 'type', 'client_msg_id', 'attachments']
    for channel_id, channelname in channel_ids_dict.items():
        print(f"{workspace}の{channelname}の履歴作成中....")
        conversation_history = []
        try:
            # Call the conversations.history method using the WebClient
            # conversations.history returns the first 100 messages by default
            # These results are paginated, see: https://api.slack.com/methods/conversations.history$pagination
            result = client.conversations_history(channel=channel_id, limit=1000)
            conversation_history = result["messages"]

        except SlackApiError as e:
            logger.error("Error creating conversation: {}".format(e))

        df_conversation_history = pd.DataFrame(index = conversation_hist_index_list)

        threads_ts = [] # replyのある投稿のtimestampを保存

        for i in conversation_history:
            tmp = pd.Series(data=i, name=i['ts'], index=conversation_hist_index_list)
            if "reply_count" in i.keys():
                threads_ts.append(i['ts'])
            df_conversation_history = pd.concat([df_conversation_history, tmp], axis=1)

        # replyの処理
        for ts in threads_ts:
            thread_history = []
            try:
                result = client.conversations_replies(channel=channel_id, ts=ts, limit=1000)
                thread_history = result["messages"]

            except SlackApiError as e:
                logger.error("Error creating conversation: {}".format(e))

            df_thread_history = pd.DataFrame(index = conversation_hist_index_list)
        
            for i in thread_history:
                i['reply_from'] = ts
                tmp = pd.Series(data=i, name=i['ts'], index=conversation_hist_index_list)
                df_thread_history = pd.concat([df_thread_history, tmp], axis=1)

            df_conversation_history = pd.concat([df_conversation_history, df_thread_history], axis=1)

        df_conversation_history = df_conversation_history.T

        # unix time表記を人間が理解できる時刻表記に換算
        for i in ["ts", "reply_from", "thread_ts"]: # ここが原因でFutureWarningが出ますが、今は無視してください。
            df_conversation_history[i] = pd.to_datetime(df_conversation_history[i], unit="s", utc=True)
            df_conversation_history[i] = df_conversation_history[i].dt.tz_convert('Asia/Tokyo')
            df_conversation_history[i] = df_conversation_history[i].dt.strftime('%Y/%m/%d %H:%M:%S')
        df_conversation_history = df_conversation_history.sort_values("ts")

        # メンバー表記を名前に変更
        df_conversation_history = pd.merge(df_conversation_history , users_df, how = "left", left_on = "user", right_on = "id").drop(["id", "user"], axis=1)
        df_conversation_history = df_conversation_history.rename(columns={"username":"user"}).reindex(columns=conversation_hist_index_list)

        # 重複を削除
        df_conversation_history = df_conversation_history.drop_duplicates(subset=["text","ts"])

        # Excelに保存
        df_conversation_history.to_excel(f"./{workspace}/{channelname}.xlsx")

現状の課題

複数人でのdirect messageについて、ファイル名が見にくい状態になっています（mpdm-…… .xlsxという名前になります）。また時間があったら修正するかもしれませんが、エクセルファイル内部では名前がちゃんと出ますので、そういう風に対応していただけますと幸いです。
（2022.8.26 追記）スレッドは現在最大1000件までしか読み込めません（各投稿への返事は、チャンネルに投稿していなければカウントされません）。スレッドが大規模で1000件を超えている場合はお手数ですが、Slack APIのconversations.historyのページにあるテスターを使って、日時指定して、jsonファイルでデータ取得してください。
（2022.8.26 追記）最後のUnixtimeを普通の日時に直す場所で、「FutureWarning」がたくさん吐き出されます。これは、pandas DataFrameの中にunixtimeが入っていない空（NaN）があるために出てくる警告です。特に動作に影響はないので、今は無視してください。
いまのところエラーは出ていないですが、たまに「not_in_channel」というエラーが出るみたいです。その場合は、コード内にあります "client.conversations_join(channel=channel_id)" の部分のコメントアウトを外してもらえれば、少なくともpublic channelについては情報取得できると思います。一度この部分を実行して、その後またコメントアウトすれば、privateやDMも含めてすべて取得できるようになるみたいです（理由はわかっていません）。not_in_channelエラーについては~~こちらに書いてあります。~~
→ （2022.8.26 追記）User Token (Bot Tokenではない）を使っている限りはエラーでないみたいです。なので上の通りにしていただけたらこのエラーは出ません。逆にこれが出たら、Tokenの種類を確認してください。

最後に

まず、こんなことができると教えてくれた友人にはとても感謝しています（きっかけを与えてくれたという意味で）。私が把握していないエラーなどまだまだあるかもしれませんが（OS環境等）、是非是非試してみてください！

いいなと思ったら応援しよう！

お心をもしも戴くことができましたら、励みになります。