ランダムフォレスト

2022年2月12日 23:07

〇決定木を弱学習器とする集団学習アルゴリズムの１つ

　－分類や回帰のためのアルゴリズム
　－決定木をたくさん作って多数決を取る手法

〇決定木分析とは

観測された変数の中から”目的変数”に影響する”説明変数”を明らかにし、樹木状のモデルを作成する分析手法のことです。
・利用シーン　→　いろんなパターンのクロス集計を見るのは大変。
・結果の見方　→　上から順番に説明変数を確認する。

ー過学習に陥りやすい。

https://www.youtube.com/watch?v=Zj-ed9oKF4s&t=357s

〇弱学習とは

名前の通り「弱い学習器」を意味し、予測精度の低い手法。

〇集団学習とは

精度が高くない複数の結果を組み合わせることで精度を向上させる機械学習方法の１つ

〇ランダムフォレスト

複数の異なる決定木をまとめた、大きな1つの学習器
ー決定木の過学習に陥ることを解消。
ー分類問題：多数決を取る。
ー回帰問題：平均値や代表値を取る。
ー異なるデータで決定木を作ることで木の多様性を保つ。

ー個別に取り出したデータで決定木を作成する。
　→この時説明変数をランダムに設定。
ー木の数が少ないと、過学習を起こしてしまう。

〇ランダムフォレストの使用

scikit-learn.ensembleの中から

分類問題の場合：RandomForestClassifier

回帰問題の場合：RandomForestRegressor

【RandomForestClassifier()、またはRandomForestRegressor()の引数】
・引数n_estimators:決定木の個数を指定
・引数criterion:決定木のデータ分割の指標('gini-cg', 'entropy'）(デフォルトはgini)
・引数max_depth: 決定木の深さの最大値を指定
・引数n_jobs: 計算に使うジョブの数
・引数random_state: 乱数
・引数verbose: モデル構築の過程のメッセージを出すかどうか（デフォルトは0）

【クラスのメソッド】
・fit(X, y[, sample_weight]):学習の実行
・get_params([deep]):学習の際に用いたパラメータの取得
・predict(X): 予測の実行
・score(X, y[, sample_weight]): 決定係数の算出
※feature_importances_: 説明変数の特徴量重要度
・set_params(**params): パラメータの設定の確認

〇ランダムフォレストを使った分類

In [1]: # ライブラリのインポート
       import pandas as pd
       from sklearn.datasets import load_breast_cancer
       from sklearn.ensemble import RandomForestClassifier
       from sklearn.model_selection import train_test_split
       from sklearn.metrics import accuracy_score

       # データのロード
       breast = load_breast_cancer()
       df_breast = pd.DataFrame(data=breast.data,columns=breast.feature_names)
       df_breast['target'] = breast.target

       # インスタンス作成
       clf = RandomForestClassifier()

       # 説明変数
       X = df_breast[breast.feature_names].values

       # 目的変数target
       Y = df_breast['target'].values

       # データの分割
       X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

       # 予測モデルを作成
       clf.fit(X_train, y_train)

       # 精度算出
       accuracy_score(y_test,clf.predict(X_test))

Out[1]: 0.9473684210526315

〇特徴量重要度を取得

In [2]: clf.feature_importances_

Out[2]: array([0.02737869, 0.01205152, 0.05916724, 0.07277569, 0.00863201,・・・,0.01103733, 0.021426  , 0.11471481, 0.00877436, 0.0051457 ])

〇特徴量重要度をDataFrame形式に変更し、重要度が高い説明変数を抜きだす。

In [3]: feature_importance = pd.DataFrame({'feature':breast.feature_names,'importances':clf.feature_importances_}).sort_values(by="importances", ascending=False)
       feature_importance.head()

Out[3]:                  feature  importances
       22       worst perimeter     0.158892
       20          worst radius     0.131708
       23            worst area     0.125950
       27  worst concave points     0.114715
       7    mean concave points     0.083558

〇ランダムフォレストを使った回帰

# ライブラリのインポート
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# データのロード
boston = load_boston()
df_boston = pd.DataFrame(data=boston.data,columns=boston.feature_names)
df_boston['target'] = boston.target

# インスタンス作成
clf = RandomForestRegressor(random_state=0)

# 説明変数
X = df_boston[boston.feature_names].values

# 目的変数target
Y = df_boston['target'].values

# データの分割
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2,random_state=0)

# 予測モデルを作成
clf.fit(X_train, y_train)

〇特徴量重要度を可視化（seaborn)

In [1]: # ライブラリのインポート
       import matplotlib.pyplot as plt
       import seaborn as sns

       # 特徴量重要度
       feature_importance = pd.DataFrame({'feature':boston.feature_names,'importances':clf.feature_importances_}).sort_values(by="importances", ascending=False)

       # 可視化
       sns.barplot(x="importances", y="feature", data=feature_importance.head())
       plt.show()