データの秘密を解く者たち - AIエンジニアの挑戦

Yoshito Kamizato@生成AI企業BtoBマーケター

2024年11月14日 10:03

朝もやの立ち込めるオフィスで、中村はコーヒーを啜りながらモニターに向かっていた。

「中村さん、この医療データの分析、うまくいかないんです」

新人の鈴木が声をかけてきた。画面には、患者の検査データが広がっている。血圧、心拍数、体温、血液検査の値など、数百の項目が並んでいた。

「なるほど。これは次元が多すぎて、本質が見えにくくなっているね」

中村は静かに説明を始めた。

「こういうデータは、原石を磨いて宝石にするように、不要な部分を取り除いていく必要があるんだ」

鈴木は興味深そうに耳を傾けた。

「Pythonを使って、まずはPCAを試してみよう。scikit-learnというライブラリを使えば簡単に実装できるんだ」

中村は手慣れた様子でコードを入力し始めた。

「まず、必要なライブラリをインポートするところから始めるよ。NumPyとscikit-learnのPCAモジュールを使うんだ」

中村は画面に向かって説明しながら入力していく。

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

「データの前処理も重要なんだ。PCAを適用する前に、各特徴量を標準化する必要がある。これをスケーリングと呼んでいるんだよ」

鈴木は熱心にメモを取りながら画面を覗き込んだ。

「StandardScalerを使って、平均を0、分散を1に標準化していくんだ」

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

「次に、PCAのインスタンスを作成して、データを変換していく。ここでは説明割合を指定して、どれだけの情報量を保持するか決めるんだ」

pca = PCA(n_components=0.95)  # 95%の分散を保持
X_pca = pca.fit_transform(X_scaled)

「この0.95という値は、元のデータの情報量の95%を保持するという意味なんだ。この値を調整することで、圧縮率と情報保持のバランスを取ることができる」

中村は実行結果を確認しながら続けた。

「見てごらん。元のデータは数百次元あったけど、これで重要な特徴だけを残して大幅に次元を削減できた。各主成分がどれだけデータの分散を説明しているか、確認してみよう」

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

「この結果を見ると、最初の数個の主成分で、データの大部分の変動を説明できていることが分かるんだ。これが次元削減の威力さ」

鈴木は感心したように頷いた。

「では、この結果を可視化してみましょう」

中村は続けて可視化のコードを入力した。

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
plt.plot(cumulative_variance_ratio)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Explained Variance Ratio vs Number of Components')
plt.grid(True)

「このグラフを見ると、どの程度の次元数でデータの本質を捉えられているかが一目で分かるんだ。機械学習では、このように視覚化して理解を深めることも重要だよ」

中村の丁寧な説明に、鈴木の目は光を宿していた。

「最後に、2次元に圧縮したデータを散布図で表示してみよう」

plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA Results')

「これで数百次元あったデータが、2次元の平面上に投影されたんだ。データの分布や関係性が視覚的に理解しやすくなったでしょう」

鈴木は感動したように画面を見つめていた。

中村は手際よくコードを入力していく。

「PCAは主成分分析という手法で、データの中で最も重要な特徴を見つけ出すんだ。医療データの場合、症状の相関関係を見つけるのに特に有効なんだよ」

画面には散布図が表示された。

「見てごらん。このグラフは、何百もの検査項目を2次元に圧縮したものだけど、似たような症状を持つ患者さんが、自然とグループを形成しているのが分かるだろう」

鈴木は目を輝かせた。

「すごいです。でも、なぜPCAを選んだんですか」

中村は微笑んで答えた。

「データの性質によって、使うべき手法は変わってくるんだ。今回のデータは線形的な相関が強いからPCAが適している。もし非線形な関係が強ければ、t-SNEやUMAPといった手法を使うことになる」

午後のミーティングでは、チームメンバーに分析結果を共有した。

「次元削減を使うことで、患者さんの症状パターンが明確になりました。これを基に、より効果的な治療法の提案ができそうです」

部長が感心したように頷く。

「素晴らしい成果だ。ところで、計算コストは大丈夫なのかい」

「はい。PCAは比較的軽量な手法なので、リアルタイムでの分析も可能です。ただし、より詳細な分析が必要な場合は、t-SNEを使うことも検討しています」

部長は興味を示した様子で前のめりになった。

「具体的にどういう違いがあるのかね」

中村はホワイトボードに立ち上がり、図を描きながら説明を始めた。

「PCAとt-SNEでは、データの捉え方が根本的に異なるんです。PCAは計算量がO(p×n)程度で済むのに対し、t-SNEはO(n²)の計算量が必要になります。nはデータ数、pは特徴量の数ですね」

中村はPythonで実際の処理時間の比較を示した。

import time
from sklearn.manifold import TSNE

# PCAの実行時間測定
start_time = time.time()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
pca_time = time.time() - start_time

# t-SNEの実行時間測定
start_time = time.time()
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
tsne_time = time.time() - start_time

実行結果が表示された。

PCA processing time: 0.0234 seconds
t-SNE processing time: 12.8756 seconds

「このように、PCAは0.02秒程度で処理が完了するのに対し、t-SNEは約13秒かかっています。これは今回のデータサイズが比較的小さい場合でも、このくらいの差が出るんです」

中村は続けて、データサイズによる処理時間の変化も示した。

# データサイズを変えて処理時間を比較
sizes = [1000, 2000, 5000, 10000]
pca_times = []
tsne_times = []

for size in sizes:
    # サンプルデータの作成
    X_subset = X_scaled[:size]
    
    # PCA
    start_time = time.time()
    pca.fit_transform(X_subset)
    pca_times.append(time.time() - start_time)
    
    # t-SNE
    start_time = time.time()
    tsne.fit_transform(X_subset)
    tsne_times.append(time.time() - start_time)

# 結果の表示
for size, p_time, t_time in zip(sizes, pca_times, tsne_times):
    print(f"\nData size: {size}")
    print(f"PCA time: {p_time:.4f} seconds")
    print(f"t-SNE time: {t_time:.4f} seconds")
    print(f"Ratio (t-SNE/PCA): {t_time/p_time:.1f}x")

Data size: 1000
PCA time: 0.0156 seconds
t-SNE time: 3.2451 seconds
Ratio (t-SNE/PCA): 208.0x

Data size: 2000
PCA time: 0.0198 seconds
t-SNE time: 13.4521 seconds
Ratio (t-SNE/PCA): 679.4x

Data size: 5000
PCA time: 0.0354 seconds
t-SNE time: 84.6723 seconds
Ratio (t-SNE/PCA): 2392.4x

Data size: 10000
PCA time: 0.0687 seconds
t-SNE time: 342.8934 seconds
Ratio (t-SNE/PCA): 4991.2x

「ご覧ください。データサイズが大きくなるにつれて、PCAとt-SNEの処理時間の差は指数関数的に広がっていきます。特に1万件を超えるデータになると、t-SNEは5分以上かかるのに対し、PCAは0.07秒程度で済むんです」

部長は顎に手を当てながら、じっと結果を見つめた。

「なるほど。リアルタイム処理が必要な場合は、PCAを使わざるを得ないということか」

「はい。ただし、夜間バッチなど時間的な制約が緩い場合は、t-SNEによる詳細な分析も有効です。特に異常検知や患者グループの分類など、より細かいパターンの発見が必要な場合は、計算時間をかけてでもt-SNEを使う価値があると考えています」

中村は新しいコードを入力した。

# 結果の可視化比較
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# PCAの結果
scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
ax1.set_title('PCA Visualization')

# t-SNEの結果
scatter2 = ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
ax2.set_title('t-SNE Visualization')

plt.colorbar(scatter1, ax=ax1, label='Patient Group')
plt.colorbar(scatter2, ax=ax2, label='Patient Group')

「このように、t-SNEは局所的な構造をより良く保持できます。特に、患者さんの症状グループの分類がより明確になっているのがお分かりいただけるでしょうか」

中村はモニターに表示された2つの可視化結果を指しながら、詳しい説明を始めた。

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# 仮想的な患者データの生成
np.random.seed(42)
n_patients = 1000
n_features = 100

# 3つの異なる患者グループを作成
group1 = np.random.normal(0, 1, (n_patients//3, n_features))
group2 = np.random.normal(3, 1, (n_patients//3, n_features))
group3 = np.random.normal(-2, 1, (n_patients//3, n_features))

X = np.vstack([group1, group2, group3])
labels = np.array(['Group A'] * (n_patients//3) + 
                 ['Group B'] * (n_patients//3) + 
                 ['Group C'] * (n_patients//3))

# データの標準化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCAとt-SNEの実行
pca = PCA(n_components=2)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)

X_pca = pca.fit_transform(X_scaled)
X_tsne = tsne.fit_transform(X_scaled)

# 可視化
plt.figure(figsize=(15, 6))

# PCAの結果
plt.subplot(121)
unique_labels = np.unique(labels)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels)))
for label, color in zip(unique_labels, colors):
    mask = labels == label
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=[color], label=label, alpha=0.6)
plt.title('PCA Visualization')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.legend()

# t-SNEの結果
plt.subplot(122)
for label, color in zip(unique_labels, colors):
    mask = labels == label
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], c=[color], label=label, alpha=0.6)
plt.title('t-SNE Visualization')
plt.xlabel('First t-SNE Component')
plt.ylabel('Second t-SNE Component')
plt.legend()

plt.tight_layout()
plt.show()

画面には2つの散布図が表示された。

「ここで重要なのは、データの分布の違いです。PCAの結果を見ると、確かにグループ分けはできていますが、境界が曖昧で重なりが多いですね」

中村はPCAの図を指さした。

「これは、PCAが線形変換であることに起因します。PCAは、データの分散が最大になる方向を見つけて、その方向に射影するだけなんです」

次にt-SNEの図に視線を移した。

「一方、t-SNEの結果を見てください。各グループがより明確に分離されているのが分かります。これは、t-SNEが局所的な距離関係を保持しようとする性質を持つためです」

さらに詳しい分析結果も示した。

# クラスター間の距離を計算
from sklearn.metrics import pairwise_distances

def calculate_cluster_separation(X, labels):
    unique_labels = np.unique(labels)
    centroids = []
    for label in unique_labels:
        mask = labels == label
        centroids.append(np.mean(X[mask], axis=0))
    
    distances = pairwise_distances(centroids)
    return np.mean(distances[distances > 0])

pca_separation = calculate_cluster_separation(X_pca, labels)
tsne_separation = calculate_cluster_separation(X_tsne, labels)

print(f"Average cluster separation in PCA: {pca_separation:.2f}")
print(f"Average cluster separation in t-SNE: {tsne_separation:.2f}")

Average cluster separation in PCA: 4.82
Average cluster separation in t-SNE: 78.54

「この数値が示すように、t-SNEの方がクラスター間の分離度が高くなっています。これは実際の医療現場での活用を考えると、非常に重要な特徴なんです」

部長が興味深そうに質問した。

「具体的にはどういった場面で有効なんだ」

「例えば、新しい患者さんのデータが入力された時、その患者さんがどのグループに属するかを判断する際に役立ちます。t-SNEの結果なら、より確実なグループ分類が可能になるんです」

中村は新しいデータポイントを追加した図を表示した。

# 新しいデータポイントの追加
new_patient = np.random.normal(3, 1, (1, n_features))
new_patient_scaled = scaler.transform(new_patient)

# 既存のモデルに新しいデータを追加
X_with_new = np.vstack([X_scaled, new_patient_scaled])
X_pca_new = pca.transform(new_patient_scaled)
X_tsne_new = TSNE(n_components=2).fit_transform(X_with_new)[-1:]

# 可視化の更新
plt.figure(figsize=(15, 6))

# PCA with new point
plt.subplot(121)
for label, color in zip(unique_labels, colors):
    mask = labels == label
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=[color], label=label, alpha=0.6)
plt.scatter(X_pca_new[:, 0], X_pca_new[:, 1], c='red', marker='*', s=200, label='New Patient')
plt.title('PCA - New Patient Classification')
plt.legend()

# t-SNE with new point
plt.subplot(122)
for label, color in zip(unique_labels, colors):
    mask = labels == label
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], c=[color], label=label, alpha=0.6)
plt.scatter(X_tsne_new[:, 0], X_tsne_new[:, 1], c='red', marker='*', s=200, label='New Patient')
plt.title('t-SNE - New Patient Classification')
plt.legend()

plt.tight_layout()
plt.show()

「このように、t-SNEを使うことで、新しい患者さんがどのグループに属するかをより正確に判断できるんです。これは診断支援システムの精度向上に直結する重要な特徴といえます」

会議室内には、データ分析の可能性に対する新たな期待が満ちていた。

部長は図を食い入るように見つめた。

「ただし、t-SNEにも注意点があります。パラメータの設定が結果に大きく影響するんです」

中村は具体的なパラメータの説明を始めた。

# パラメータを変えたt-SNEの実験
tsne_params = [
    {'perplexity': 5, 'learning_rate': 200},
    {'perplexity': 30, 'learning_rate': 200},
    {'perplexity': 50, 'learning_rate': 200}
]

fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for ax, params in zip(axes, tsne_params):
    tsne = TSNE(n_components=2, random_state=42, **params)
    X_tsne = tsne.fit_transform(X_scaled)
    ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
    ax.set_title(f"Perplexity: {params['perplexity']}")

「perplexityという値一つとっても、設定によって結果の見え方が大きく変わります。これは局所的な近傍関係をどの程度重視するかを決めるパラメータなんです」

部長は深く頷いた。

「そうか。つまり、リアルタイム性と精度のトレードオフということだな」

「はい。現在のシステムではPCAを基本としつつ、夜間バッチでt-SNEによる詳細分析を実行する二段構えの設計を考えています。これにより、迅速な判断と精密な分析の両立が図れると考えています」

会議室には、データ分析の新たな可能性に対する期待が満ちていた。

夕暮れ時、中村は鈴木に最後のアドバイスを送った。

「機械学習エンジニアの仕事は、データの中から真実を見つけ出すことなんだ。次元削減は、その道具の一つにすぎない。大切なのは、目的に応じて適切な手法を選択できる判断力さ」

中村は席に座り直し、Jupyterノートブックを開いた。

「具体例を見せた方が分かりやすいだろう。例えば、この3つのデータセットを見てみよう」

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE, MDS, Isomap
import matplotlib.pyplot as plt

# 3つの異なる特性を持つデータセットを生成
np.random.seed(42)

# 1. 線形な関係を持つデータ
n_samples = 1000
t = np.random.normal(0, 1, n_samples)
linear_data = np.column_stack([
    t,
    2*t + np.random.normal(0, 0.1, n_samples),
    -0.5*t + np.random.normal(0, 0.1, n_samples),
    3*t + np.random.normal(0, 0.1, n_samples)
])

# 2. 非線形な関係を持つデータ（スパイラル）
theta = np.linspace(0, 4*np.pi, n_samples)
r = theta + np.random.normal(0, 0.1, n_samples)
nonlinear_data = np.column_stack([
    r * np.cos(theta),
    r * np.sin(theta),
    r + np.random.normal(0, 0.1, n_samples),
    theta + np.random.normal(0, 0.1, n_samples)
])

# 3. クラスター構造を持つデータ
n_per_cluster = n_samples // 3
cluster_data = np.vstack([
    np.random.normal(0, 1, (n_per_cluster, 4)),
    np.random.normal(5, 1, (n_per_cluster, 4)),
    np.random.normal(-5, 1, (n_per_cluster, 4))
])

# データの標準化
scaler = StandardScaler()
datasets = {
    'Linear': scaler.fit_transform(linear_data),
    'Nonlinear': scaler.fit_transform(nonlinear_data),
    'Clustered': scaler.fit_transform(cluster_data)
}

# 各種次元削減手法を適用
methods = {
    'PCA': PCA(n_components=2),
    't-SNE': TSNE(n_components=2, random_state=42),
    'Kernel PCA': KernelPCA(n_components=2, kernel='rbf'),
    'MDS': MDS(n_components=2, random_state=42),
    'Isomap': Isomap(n_components=2)
}

# 結果の可視化
plt.figure(figsize=(20, 15))
for i, (data_name, data) in enumerate(datasets.items()):
    for j, (method_name, method) in enumerate(methods.items()):
        plt.subplot(3, 5, i*5 + j + 1)
        transformed = method.fit_transform(data)
        plt.scatter(transformed[:, 0], transformed[:, 1], alpha=0.5)
        plt.title(f'{data_name} - {method_name}')
        plt.axis('equal')
plt.tight_layout()
plt.show()

「見てごらん。データの特性によって、各手法の効果は大きく異なるんだ。線形データに対してはPCAが最も適していて、情報の損失も最小限で済むんだよ」

中村は続けて、各手法の性能を定量的に評価した。

from sklearn.metrics import silhouette_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

def evaluate_reduction(X, X_reduced, y=None):
    # 近傍構造の保持率を計算
    knn = KNeighborsClassifier(n_neighbors=5)
    X_train, X_test = train_test_split(range(len(X)), test_size=0.2)
    
    knn.fit(X[X_train], X[X_train])
    original_neighbors = knn.kneighbors(X[X_test])[1]
    
    knn.fit(X_reduced[X_train], X[X_train])
    reduced_neighbors = knn.kneighbors(X_reduced[X_test])[1]
    
    neighbor_preservation = np.mean([
        len(set(o) & set(r)) / len(o)
        for o, r in zip(original_neighbors, reduced_neighbors)
    ])
    
    results = {
        'Neighbor Preservation': neighbor_preservation
    }
    
    if y is not None:
        results['Silhouette Score'] = silhouette_score(X_reduced, y)
    
    return results

# クラスターデータの場合の評価
labels = np.repeat([0, 1, 2], n_samples//3)
for method_name, method in methods.items():
    results = evaluate_reduction(
        datasets['Clustered'],
        method.fit_transform(datasets['Clustered']),
        labels
    )
    print(f"\n{method_name} results:")
    for metric, value in results.items():
        print(f"{metric}: {value:.3f}")

「この評価結果を見てごらん。クラスター構造を持つデータに対してはt-SNEが最も高い性能を示しているんだ。一方で、計算時間は...」

中村は計算時間の比較も表示した。

import time

for method_name, method in methods.items():
    start_time = time.time()
    method.fit_transform(datasets['Clustered'])
    elapsed_time = time.time() - start_time
    print(f"{method_name} computation time: {elapsed_time:.3f} seconds")

PCA computation time: 0.003 seconds
t-SNE computation time: 8.245 seconds
Kernel PCA computation time: 0.156 seconds
MDS computation time: 3.678 seconds
Isomap computation time: 0.892 seconds

「このように、手法の選択には様々なトレードオフが存在するんだ。データの特性、計算コスト、求められる精度、リアルタイム性の要件... これらを総合的に判断して、最適な手法を選択する必要があるんだよ」

鈴木は真剣な表情で頷いた。

「つまり、技術そのものよりも、その技術をいつ、どのように使うかの判断が重要だということですね」

「その通りだ。機械学習エンジニアの真価は、そうした判断力にこそあるんだよ。技術は道具に過ぎない。その道具をどう使いこなすかが、私たちの仕事の本質なんだ」

※この物語はフィクションです。

データの秘密を解く者たち - AIエンジニアの挑戦

いいなと思ったら応援しよう！