Linear RegressionモデルのAlgorithmと予測

2022年4月28日 21:32

まず、機械学習界のCOBOLと言われるLMSをやってみましょう。
2つのFeatures$${x_1, x_2}$$を持つ$${X\in \mathbb{R^{m\times 2}}}$$とOutput$${Y\in \mathbb{R^{m}}}$$のDatasetを利用して、Linear Regression Modelの機械学習過程を、備忘のためここに整理します。
※　私の現状理解をベースに記述しているので、理解が進めば適宜加筆修正していきます

１．LMSのCost Function

$${\theta}$$はベクトル$${\theta \in \mathbb{R^{3}}}$$でパラメタ（$${\theta_0,\theta_1,\theta_2}$$）。XはInputのXに$${x_0 = 1}$$を追加したものです
$${h_\theta(x)=\theta^Tx}$$は、$${\theta}$$をtransposeしているので、$${\theta_0X_0+\theta_1X_1+\theta_2X_2}$$を表せます
Cost Function $${J(\theta)}$$は、
- $${J(\theta)=\frac{1}{2}\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)}))^2}$$
以降、$${J(\theta)}$$を最小化するような$${\theta}$$を求めます

２．Gradiant Descent

$${J(\theta)}$$を$${\theta \in \mathbb{R^{3}}}$$について微分してゼロになる$${\theta}$$を求めます。具体的には、$${\bigtriangledown_\theta J(\theta)=0}$$を満たす$${\theta}$$を求めます
$${\theta}$$は、以下のWidrow-Hoff learning rule(LMS: Least mean squares)を使って求めます：
- $${\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta)}$$（１）
まず、$${\theta_j}$$の一つだけについて、$${(x, y)}$$という１データについて考える。Cost Functionの$${J(\theta)}$$を$${\theta_j}$$で偏微分します
- $${\frac{\partial J(\theta)}{\partial \theta_j}= \frac{\partial}{\partial \theta_j}\frac{1}{2}(y-h_\theta(x))^2}$$
  合成関数の微分法を利用して、
  $${=2\cdot\frac{1}{2}(y-h_\theta(x))\frac{\partial}{\partial \theta_j}(y-h_\theta(x))}$$
  $${=(y-h_\theta(x))\frac{\partial}{\partial \theta_j}(y-\sum_{i=0}^d\theta_i x_i)}$$
  $${=(h_\theta(x)-y)x_j}$$
  を得られます
これを（１）のLMSに当てはめ、データ$${(i)}$$を、
- $${\theta_j := \theta_j - \alpha(h_\theta(x^{(i)})-y^{(i)})x_j}$$に当てはめて$${\theta_j}$$を学習させます
$${\theta_j}$$をデータで学習させたら、$${h_\theta(x)=\theta^Tx}$$を使ってyを予測してみましょう

３．Train model

LMSをKaggleのHouse Price問題データを使って予測してみましょう
KaggleのJupyterを開いて、データを取得します

train = np.loadtxt('/kaggle/input/house-prices-dataset/train.csv',dtype='int',delimiter=',',skiprows=1,usecols=(46,51,80))
train.shape
x = train[:,0:2]
y = train[:,2]

LMSに使うInputデータは、
- X1: GrLivArea: Above grade (ground) living area square feet
- X2: Bedroom: Number of bedrooms above basement level
- Y: SalePrice - the property's sale price in dollars
  の3つを選びました。usecols=(46,51,80)で3つを抽出しました

import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x[:,0],y)
plt.xlabel('x1:GrLivArea')
plt.ylabel('Y:house price')
plt.figure()
plt.scatter(x[:,1],y)
plt.xlabel('x2:Bedroom')
plt.ylabel('y:house price')

new_x = np.zeros((x.shape[0], x.shape[1] + 1), dtype=x.dtype)
new_x[:, 0] = 1
new_x[:, 1:] = x
n,d = new_x.shape
theta = np.zeros(d)
for i in range(n):
    theta -= 1e-9*(theta@new_x[i]-y[i])*new_x[i]

$${x_0}$$として、すべて1の値を追加
$${\alpha}$$は、1e-9として設定
xの全件を一周だけ学習させました。その結果
- $${\theta_0 =0.07}$$
- $${\theta_1 =112.43}$$
- $${\theta_2 =0.18}$$が得られました

4．Predict data

せっかくなので、得られた$${\theta}$$を使ってPredictをしてみましょう
Kaggleが提供してくれたテストデータでPredictionsを作成

test = np.loadtxt('/kaggle/input/house-prices-dataset/test.csv',dtype='int',delimiter=',',skiprows=1,usecols=(46,51,0))
test.shape
ID = test[:,2]
x_test = test[:,0:2]

new_x_test = np.zeros((x_test.shape[0], x_test.shape[1] + 1), dtype=x_test.dtype)
new_x_test[:, 0] = 1
new_x_test[:, 1:] = x_test

predictions = new_x_test.dot(theta)

output = pd.DataFrame({'Id': ID, 'SalePrice': predictions})
output.to_csv('sample_submission1.csv', index=False)

得られたデータをSubmitした結果は、スコアが0.29171で4,401人中3,869位でした。

…次はNural Networkでやってみましょう！