DataFrameのrolling利用した集計

2023年9月14日 22:02

やりたいこと

PythonのDataFrameでresampleを利用して一定期間毎に集計を行う場合、時間区切りをresample後の区間ではなく、元データの区間で一定期間毎に集計したい場合がある。

結論

rollingを使う

サンプル

例えば、以下のようなデータを各行毎に過去5分間の値で集計したい場合はrollingを使う必要がある。

                         open      high       low     close     volume      closetime quote_asset_volume  trade     taker_buy        taker_sell ignore
opentime
2023-09-13 04:19:00  25877.00  25877.00  25876.99  25876.99    3.24381  1694578799999     83940.04993810    266    1.10062000    28480.74374000      0
2023-09-13 04:20:00  25876.99  25879.70  25876.99  25879.70    2.42946  1694578859999     62867.29750250    176    1.43217000    37060.41785720      0
2023-09-13 04:21:00  25879.70  25889.54  25879.70  25889.54    5.93222  1694578919999    153560.82116000    364    4.01234000   103859.82802150      0
2023-09-13 04:22:00  25889.54  25892.23  25889.53  25892.23    4.03286  1694578979999    104411.25400370    155    1.63533000    42339.86011000      0
2023-09-13 04:23:00  25892.22  25908.30  25892.22  25907.75   15.43625  1694579039999    399847.18983080    348   12.15098000   314743.74357380      0
...                       ...       ...       ...       ...        ...            ...                ...    ...           ...               ...    ...
2023-09-13 12:34:00  26114.99  26140.00  26091.07  26091.47   84.72744  1694608499999   2212712.33725940   1468   39.45336000  1030282.74286180      0
2023-09-13 12:35:00  26091.07  26138.10  26083.33  26133.56   73.16256  1694608559999   1910696.61983500   1254   29.47772000   769604.49889400      0
2023-09-13 12:36:00  26133.56  26165.30  26115.00  26158.51   81.95945  1694608619999   2142623.98408060   1250   47.53452000  1242973.86156010      0
2023-09-13 12:37:00  26158.52  26245.00  26156.87  26238.11  180.19198  1694608679999   4722858.69512200   3833  116.25198000  3046681.93148760      0
2023-09-13 12:38:00  26238.10  26238.11  26162.19  26164.00   96.75866  1694608739999   2534558.00456210   1316   29.58746000   774699.71054490      0

resampleで集計すると5分間隔のデータとして集計される。

                         open       low      high     close      volume
opentime
2023-09-13 04:15:00  25877.00  25876.99  25877.00  25876.99     3.24381
2023-09-13 04:20:00  25876.99  25876.99  25928.65  25928.30    62.11364
2023-09-13 04:25:00  25928.30  25922.24  25945.00  25928.23    89.80995
2023-09-13 04:30:00  25928.23  25922.67  25949.19  25939.94    78.05533
2023-09-13 04:35:00  25939.94  25926.19  25939.94  25926.26    30.38798
...                       ...       ...       ...       ...         ...
2023-09-13 12:15:00  26215.00  26198.62  26247.08  26239.01   177.00272
2023-09-13 12:20:00  26239.00  26168.74  26244.45  26171.26   172.40035
2023-09-13 12:25:00  26171.26  26162.13  26203.22  26193.99   170.24136
2023-09-13 12:30:00  26193.99  26016.22  26215.95  26091.47  1113.58251
2023-09-13 12:35:00  26091.07  26083.33  26245.00  26164.00   432.07265

rollingで集計すると各行毎に直近5分間のデータを集計できる。
ただし、rollingではfirstやlastが使えないため、lambda式を利用してfirstとlastを取得する必要がある。

                         open       low      high     close     volume
opentime
2023-09-14 04:33:00       NaN       NaN       NaN       NaN        NaN
2023-09-14 04:34:00       NaN       NaN       NaN       NaN        NaN
2023-09-14 04:35:00       NaN       NaN       NaN       NaN        NaN
2023-09-14 04:36:00       NaN       NaN       NaN       NaN        NaN
2023-09-14 04:37:00  26239.30  26227.32  26239.30  26234.05   58.23133
...                       ...       ...       ...       ...        ...
2023-09-14 12:48:00  26417.90  26411.69  26453.19  26437.31  115.75706
2023-09-14 12:49:00  26414.53  26411.69  26453.19  26445.00   97.73054
2023-09-14 12:50:00  26429.52  26429.51  26459.39  26457.40  100.09218
2023-09-14 12:51:00  26448.82  26437.31  26467.25  26457.75   89.49876
2023-09-14 12:52:00  26438.86  26437.31  26467.25  26465.00   77.52727

ソースコード

検証に利用したソースコードは以下。

import pandas as pd
import requests


def create_resample_kline(data:pd.DataFrame, period:int):
    result = data.rolling(period).agg({
        'open':lambda rows: rows[0],
        'low':'min',
        'high':'max',
        'close':lambda rows: rows[-1],
        'volume':'sum'
    })
    return result


data = requests.get('https://api.binance.com/api/v3/klines', params={'symbol':'BTCUSDT', 'interval':'1m'})
kline = pd.DataFrame(data.json(),
                     columns=['opentime', 'open', 'high', 'low', 'close', 'volume', 'closetime', 'quote_asset_volume', 'trade', 'taker_buy', 'taker_sell', 'ignore'])
kline['opentime'] = pd.to_datetime(kline['opentime'], unit='ms')
kline.set_index('opentime', inplace=True)
kline = kline.astype({
    'open':'float',
    'high':'float',
    'low':'float',
    'close':'float',
    'volume':'float',
})

# resample
result = kline.resample('5T').agg({
    'open':     'first',
    'low':      'min',
    'high':     'max',
    'close':    'last',
    'volume':   'sum'
})
print(result)

# rolling
result = create_resample_kline(kline, 5)
print(result)

ただし、lambdaでの集計はデータ量が増えると遅くなる。
高速化する場合は以下のように記載すべき。

ここから先は

343字

¥ 300

ログイン

この記事が気に入ったらチップで応援してみませんか？