港湾統計のExcelデータをPython（GoogleColab）で集計してみる

2022年10月7日 17:57

はじめに

港湾統計は国土交通省ホームページにEXCELデータが公表されているが、ダウンロードして加工して集計して、という作業が面倒だったので、pythonで自動化してみた。

取扱貨物量（総計）

ありがたいことに、各年度のExcelデータの様式が統一されているので、エクセルデータのURLを指定して、DataFrameに格納する関数を作成する。

def  get_cargo(time,url):
  # ブックの指定
  input_file_name = url
  input_book = pd.ExcelFile(input_file_name)

  # 統計結果は2番目のシート、上から2行は無視
  input_sheet_name = input_book.sheet_names
  df = input_book.parse(input_sheet_name[1],skiprows = 2)

  # 都道府県と港湾で集計
  df = df.groupby(['都道府県','港湾']).agg({'合計':  'sum','輸出':  'sum','輸入':  'sum','移出':  'sum','移入':  'sum'})

  # 列の値を行にstack
  df = pd.DataFrame(df.stack()).reset_index()

  # 列名の整形
  column = f'取扱貨物量【{time}】'
  df['港湾'] = df.apply(lambda x:  '{}_{}'.format(x['都道府県'],x['港湾'])  if x['港湾'] != '合計'  else x['都道府県'],  1)
  df[column] = df.apply(lambda x: x[0],  1)
  df['種類'] = df.apply(lambda x: x['level_2'],  1)

 
  # 必要な列だけ抽出
  df = df[['都道府県','港湾','種類',column]]

  return df

年度とURLを指定して関数をテスト

time = '2020年'
url = 'https://www.mlit.go.jp/k-toukei/content/001448034.xlsx'

df_2020 = get_cargo(time,url)
df_2020

あとは、各年次のURLをセットして、

params_cargo = [
  {  'time'  :  '2012年'  ,  'url'  :  'https://www.mlit.go.jp/common/001277668.xls'  },
  {  'time'  :  '2013年'  ,  'url'  :  'https://www.mlit.go.jp/common/001277684.xls'  },
  {  'time'  :  '2014年'  ,  'url'  :  'https://www.mlit.go.jp/common/001277699.xls'  },
  {  'time'  :  '2015年'  ,  'url'  :  'https://www.mlit.go.jp/common/001380149.xlsx'  },
  {  'time'  :  '2016年'  ,  'url'  :  'https://www.mlit.go.jp/k-toukei/content/001448496.xlsx'  },
  {  'time'  :  '2017年'  ,  'url'  :  'https://www.mlit.go.jp/common/001380093.xlsx'  },
  {  'time'  :  '2018年'  ,  'url'  :  'https://www.mlit.go.jp/k-toukei/content/001448540.xlsx'  },
  {  'time'  :  '2019年'  ,  'url'  :  'https://www.mlit.go.jp/k-toukei/content/001448577.xlsx'  },
  {  'time'  :  '2020年'  ,  'url'  :  'https://www.mlit.go.jp/k-toukei/content/001448034.xlsx'  },
]

各年次の結果（DataFrame）をマージする関数を作成

def  get_merge_dataframe(params):
  df_res = pd.DataFrame(index=[], columns=['都道府県','港湾','種類'])
  
  for p in params:
    df = get_cargo(p['time'], p['url'])
    df_res = pd.merge(df_res, df, on=['都道府県','港湾','種類']  , how='outer')
    
return df_res

実行してみる。

df_cargo = get_merge_dataframe(params_cargo)
df_cargo

csvでダウンロードする場合は下記を実行。

from google.colab import files


df_cargo.to_csv('取扱貨物量の推移【総トン数】.csv', encoding="shift_jis")
files.download('取扱貨物量の推移【総トン数】.csv')

取扱貨物量（品目別）

輸出・輸入・移出・移入とも同じ構成のExcelになっているので、EXCELのインポートからDataFameの整形までは共通関数でOK

def get_excel(url):
  # ブックの指定
  input_file_name = url
  input_book = pd.ExcelFile(input_file_name)
  
  # 統計結果は2番目のシート、上から2行は無視
  input_sheet_name = input_book.sheet_names
  df = input_book.parse(input_sheet_name[1],skiprows = 2)

  return df

エクセルをDataFrameに変換する関数

def get_dataframe(time,url):
 
  # EXCELデータをインポート
  df = get_excel(url)
  
  # 都道府県と港湾で集計
  df = df.astype({'トン数': 'int'})
  df = df.groupby(['都道府県','港名','品種名']).agg({'トン数': 'sum'}).reset_index()

  df = df[df['港名'] != '合計']

  df = df.rename(columns={'トン数':time})
  
  return df

繰り返し処理してひとつのDataFrameに整形する関数

def get_merge_dataframe(params):
  df_res = pd.DataFrame(index=[], columns=['都道府県','港名','品種名'])
  
  for p in params:
    df = get_dataframe(p['time'], p['url'])
    df_res = pd.merge(df_res, df, on=['都道府県','港名','品種名'] , how='outer')

  return df_res

輸出

パラメータをセットして

params_export = [
    { 'time' : '2012年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448464.xlsx' },
    { 'time' : '2013年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448471.xlsx' },
    { 'time' : '2014年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448478.xlsx' },
    { 'time' : '2015年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448485.xlsx' },
    { 'time' : '2016年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448499.xlsx' },
    { 'time' : '2017年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448521.xlsx' },
    { 'time' : '2018年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448545.xlsx' },
    { 'time' : '2019年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448582.xlsx' },
    { 'time' : '2020年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448037.xlsx' },
]

実行

df_export = get_merge_dataframe(params_export)
df_export

輸入

パラメータをセットして

params_import = [
    { 'time' : '2012年' , 'url' : 'https://www.mlit.go.jp/common/001277672.xls' },
    { 'time' : '2013年' , 'url' : 'https://www.mlit.go.jp/common/001277688.xls' },
    { 'time' : '2014年' , 'url' : 'https://www.mlit.go.jp/common/001277703.xls' },
    { 'time' : '2015年' , 'url' : 'https://www.mlit.go.jp/common/001277717.xls' },
    { 'time' : '2016年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448500.xlsx' },
    { 'time' : '2017年' , 'url' : 'https://www.mlit.go.jp/common/001277753.xls' },
    { 'time' : '2018年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448546.xlsx' },
    { 'time' : '2019年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448583.xlsx' },
    { 'time' : '2020年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448038.xlsx' },
]

実行

df_import = get_merge_dataframe(params_import)
df_import

移出

パラメータをセットして

params_emigration = [
    { 'time' : '2012年' , 'url' : 'https://www.mlit.go.jp/common/001277673.xls' },
    { 'time' : '2013年' , 'url' : 'https://www.mlit.go.jp/common/001277689.xls' },
    { 'time' : '2014年' , 'url' : 'https://www.mlit.go.jp/common/001277704.xls' },
    { 'time' : '2015年' , 'url' : 'https://www.mlit.go.jp/common/001277718.xls' },
    { 'time' : '2016年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448501.xlsx' },
    { 'time' : '2017年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448522.xlsx' },
    { 'time' : '2018年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448547.xlsx' },
    { 'time' : '2019年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448584.xlsx' },
    { 'time' : '2020年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448039.xlsx' },
]

実行

df_emigration = get_merge_dataframe(params_emigration)
df_emigration

移入

パラメータをセットして

params_immigration = [
    { 'time' : '2012年' , 'url' : 'https://www.mlit.go.jp/common/001277674.xls' },
    { 'time' : '2013年' , 'url' : 'https://www.mlit.go.jp/common/001277690.xls' },
    { 'time' : '2014年' , 'url' : 'https://www.mlit.go.jp/common/001277705.xls' },
    { 'time' : '2015年' , 'url' : 'https://www.mlit.go.jp/common/001380154.xlsx' },
    { 'time' : '2016年' , 'url' : 'https://www.mlit.go.jp/common/001380116.xlsx' },
    { 'time' : '2017年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448523.xlsx' },
    { 'time' : '2018年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448548.xlsx' },
    { 'time' : '2019年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448585.xlsx' },
    { 'time' : '2020年' , 'url' : 'https://www.mlit.go.jp/k-toukei/content/001448040.xlsx' },
]

実行

df_immigration = get_merge_dataframe(params_immigration)
df_immigration

港湾統計のExcelデータをPython（GoogleColab）で集計してみる

はじめに

取扱貨物量（総計）

取扱貨物量（品目別）

輸出

輸入

移出

移入

いいなと思ったら応援しよう！