【sklearn】ホールドアウト法 – train_test_splitの使い方

ホールドアウト法の概要
train_test_splitの実装コード
引数の一覧とコートを用いた具体的説明

ホールドアウト法とは
1. ホールドアウト法（Holdout method）
2. クロスバリデーションとの違い
ホールドアウト法の実装　|　train_test_split
train_test_splitの引数

ホールドアウト法とは

ホールドアウト法（Holdout method）

機械学習モデルの評価やパフォーマンス推定に使用されるデータ分割手法です。

ホールドアウト法ではデータセットを1回だけ分割し、1つのテストセットでモデルを評価します

ホールドアウト法は比較的簡単に実装でき、データセットが大きい場合に効率的ですが、データセットの一部がトレーニングセットまたはテストセットに偏ってしまう可能性があります。

クロスバリデーションとの違い

別の方法としてクロスバリデーションがあります。

データセットを複数の部分セット（フォールド）に分割し、それぞれをテストセットとして使用してモデルを評価する方法です。
これにより、データ全体をより効果的に活用することができます。

データセットのサイズや目的に応じて、ホールドアウト法とクロスバリデーションのどちらを使用するかを選択することが重要です。

大規模なデータセットやモデルの学習時間を節約したい場合は、ホールドアウト法を使用することが適しています。

一方で、小規模なデータセットやモデルの安定性を確認したい場合は、クロスバリデーションを使用することが一般的です。

ホールドアウト法の実装　|　train_test_split

ホールドアウト法は、sklearnの「train_test_split」を使うことで簡単に実装することができます。

以下のように、元のデータ「X」と「ｙ」に対して、X_train, X_test, y_train, y_test に分割して学習、評価を行います。

test_sizeなどの引数の設定を変えることで、データの分割割合やランダム性をコントロールできます。

from sklearn.model_selection import train_test_split

# データをトレーニングセットとテストセットに分割する（ホールドアウト法）
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# トレーニングセットでモデルを学習
model.fit(X_train, y_train)

# テストセットでモデルを評価
score = model.score(X_test, y_test)

train_test_splitの引数

test_size

test_sizeはテストデータと学習データの割合を設定します。

例えば、test_sizeが0.2だとデータの20%がテストセットに割り当てられ、80%が学習データに割り当てられます。

以下は、テストサイズを変化させた場合のデータ分割を比較した例です。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# アイリスデータセットをロードする
iris = load_iris()
X = iris.data  # 特徴量データ
y = iris.target  # クラスラベル

# テストサイズを変化させた場合のデータ分割を比較する
test_sizes = [0.2, 0.3, 0.4]  # テストサイズのリスト

for test_size in test_sizes:
    # データをトレーニングセットとテストセットに分割する
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42, stratify=y)

    # トレーニングデータとテストデータのサイズを表示する
    train_data = len(X_train)
    test_data = len(X_test)
    print("Test size:", test_size)
    print("Test Data:", test_data)
    print("Train Data:", train_data)
    print("--------------------")

Test size: 0.2
Test Data: 30
Train Data: 120
--------------------
Test size: 0.3
Test Data: 45
Train Data: 105
--------------------
Test size: 0.4
Test Data: 60
Train Data: 90
--------------------

150個のデータが設定した割合で分割されていることが確認できます。

一般的には、0.2や0.3が設定されることが多いです。

random_state

random_stateに固定の整数値を指定することで、実行ごとに同じランダムシードを使用することができます。

これにより、実行の再現性や結果の比較が容易になります。

以下は、random_stateの有無を比較した場合のデータ分割を比較した例です。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# アイリスデータセットをロードする
iris = load_iris()
X = iris.data  # 特徴量データ
y = iris.target  # クラスラベル

# ランダムシードなしでデータを分割する
X_train_no_random, X_test_no_random, y_train_no_random, y_test_no_random = train_test_split(X, y, test_size=0.2, stratify=y)
# ランダムシードありでデータを分割する
X_train_with_random, X_test_with_random, y_train_with_random, y_test_with_random = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("random_stateなし_1回目:\n", X_train_no_random[:3])
print("random_stateあり_1回目:\n",X_train_with_random[:3])
print("--------------------")

#上の処理をもう一度実行する

# ランダムシードなしでデータを分割する
X_train_no_random, X_test_no_random, y_train_no_random, y_test_no_random = train_test_split(X, y, test_size=0.2, stratify=y)
# ランダムシードありでデータを分割する
X_train_with_random, X_test_with_random, y_train_with_random, y_test_with_random = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("random_stateなし_2回目:\n", X_train_no_random[:3])
print("random_stateあり_2回目:\n",X_train_with_random[:3])

random_stateなし_1回目:
 [[6.4 2.7 5.3 1.9]
 [5.8 2.7 4.1 1. ]
 [7.7 3.  6.1 2.3]]
random_stateあり_1回目:
 [[4.4 2.9 1.4 0.2]
 [4.9 2.5 4.5 1.7]
 [6.8 2.8 4.8 1.4]]
--------------------
random_stateなし_2回目:
 [[6.7 3.  5.  1.7]
 [7.2 3.  5.8 1.6]
 [6.1 2.8 4.  1.3]]
random_stateあり_2回目:
 [[4.4 2.9 1.4 0.2]
 [4.9 2.5 4.5 1.7]
 [6.8 2.8 4.8 1.4]]

random_stateなしの場合は１回目と２回目の実行結果が異なるのに対し、random_stateありでは実行結果が同じです。
つまり、random_stateを引数に入れることで毎回同じ条件で分割評価をすることができます。

stratify

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# アイリスデータセットをロードする
iris = load_iris()
X = iris.data  # 特徴量データ
y = iris.target  # クラスラベル

# stratifyなしでデータを分割する
X_train_no_stratify, X_test_no_stratify, y_train_no_stratify, y_test_no_stratify = train_test_split(X, y, test_size=0.2, random_state=42)

# stratifyありでデータを分割する
X_train_stratify, X_test_stratify, y_train_stratify, y_test_stratify = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# サンプル割合を表示する
train_counts_no_stratify = np.bincount(y_train_no_stratify)
train_proportions_no_stratify = train_counts_no_stratify / len(y_train_no_stratify)

test_counts_no_stratify = np.bincount(y_test_no_stratify)
test_proportions_no_stratify = test_counts_no_stratify / len(y_test_no_stratify)

train_counts_stratify = np.bincount(y_train_stratify)
train_proportions_stratify = train_counts_stratify / len(y_train_stratify)

test_counts_stratify = np.bincount(y_test_stratify)
test_proportions_stratify = test_counts_stratify / len(y_test_stratify)

print("StratifyなしのTrain Set:\n", train_proportions_no_stratify)
print("\nStratifyなしのTest Set:\n", test_proportions_no_stratify)
print("\nStratifyありのTrain Set:\n", train_proportions_stratify)
print("\nStratifyありのTest Set:\n", test_proportions_stratify)

StratifyなしのTrain Set:
 [0.33333333 0.34166667 0.325     ]

StratifyなしのTest Set:
 [0.33333333 0.3        0.36666667]

StratifyありのTrain Set:
 [0.33333333 0.33333333 0.33333333]

StratifyありのTest Set:
 [0.33333333 0.33333333 0.33333333]

Stratifyなしの場合、トレーニングセットとテストセットのクラスラベルの割合はランダムに分布しています。

Stratifyありの場合、トレーニングセットとテストセットのクラスラベルの割合は元のデータセットと同じ割合になっていることがわかります。

つまり、層化抽出法を使用することで、トレーニングセットとテストセットの両方に各クラスの代表的なデータが均等に含まれるようになります。

モデルの学習や評価がより公平で信頼性のあるものになり、モデルが特定のクラスに偏った予測をすることを防ぐことができます。