使用 Ray Tune 的超参数调整
使用 Ray Tune 的超参数调整
原文:https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
超参数调整可以使平均模型与高精度模型有所不同。 通常,选择不同的学习率或更改网络层大小等简单的事情可能会对模型表现产生巨大影响。
幸运的是,有一些工具可以帮助您找到最佳的参数组合。 Ray Tune 是用于分布式超参数调整的行业标准工具。 Ray Tune 包含最新的超参数搜索算法,与 TensorBoard 和其他分析库集成,并通过 Ray 的分布式机器学习引擎本地支持分布式训练。
在本教程中,我们将向您展示如何将 Ray Tune 集成到 PyTorch 训练工作流程中。 我们将扩展 PyTorch 文档的本教程,以训练 CIFAR10 图像分类器。
如您所见,我们只需要添加一些细微的修改即可。 特别是,我们需要
- 在函数中包装数据加载和训练,
- 使一些网络参数可配置,
- 添加检查点(可选),
- 并定义用于模型调整的搜索空间
要运行本教程,请确保已安装以下包:
ray[tune]
:分布式超参数调整库torchvision
:用于数据转换器
设置/导入
让我们从导入开始:
from functools import partial
import numpy as np
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler
建立 PyTorch 模型需要大多数导入产品。 Ray Tune 仅最后三个导入。
数据加载器
我们将数据加载器包装在它们自己的函数中,并传递一个全局数据目录。 这样,我们可以在不同的试验之间共享数据目录。
def load_data(data_dir="./data"):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform)
return trainset, testset
可配置的神经网络
我们只能调整那些可配置的参数。 在此示例中,我们可以指定全连接层的层大小:
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
训练函数
现在变得有趣了,因为我们对 PyTorch 文档中的示例进行了一些更改。
我们将训练脚本包装在函数train_cifar(config, checkpoint_dir=None, data_dir=None)
中。 可以猜到,config
参数将接收我们要训练的超参数。 checkpoint_dir
参数用于还原检查点。 data_dir
指定了我们加载和存储数据的目录,因此多次运行可以共享同一数据源。
net = Net(config["l1"], config["l2"])
if checkpoint_dir:
model_state, optimizer_state = torch.load(
os.path.join(checkpoint_dir, "checkpoint"))
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
优化器的学习率也可以配置:
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
我们还将训练数据分为训练和验证子集。 因此,我们训练了 80% 的数据,并计算了其余 20% 的验证损失。 我们遍历训练和测试集的批量大小也是可配置的。
通过DataParallel
添加(多)GPU 支持
图像分类主要受益于 GPU。 幸运的是,我们可以继续在 Ray Tune 中使用 PyTorch 的抽象。 因此,我们可以将模型包装在nn.DataParallel
中,以支持在多个 GPU 上进行数据并行训练:
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
通过使用device
变量,我们可以确保在没有 GPU 的情况下训练也能正常进行。 PyTorch 要求我们将数据显式发送到 GPU 内存,如下所示:
for i, data in enumerate(trainloader, 0):
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
该代码现在支持在 CPU,单个 GPU 和多个 GPU 上进行训练。 值得注意的是,Ray 还支持分数 GPU ,因此我们可以在试验之间共享 GPU,只要模型仍适合 GPU 内存即可。 我们稍后再讲。
与 Ray Tune 交流
最有趣的部分是与 Ray Tune 的交流:
with tune.checkpoint_dir(epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save((net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
在这里,我们首先保存一个检查点,然后将一些指标报告给 Ray Tune。 具体来说,我们将验证损失和准确率发送回 Ray Tune。 然后,Ray Tune 可以使用这些指标来决定哪种超参数配置可以带来最佳结果。 这些指标还可用于尽早停止效果不佳的试验,以避免浪费资源进行试验。
保存检查点是可选的,但是,如果我们想使用高级调度器,例如基于总体的训练,则有必要。 另外,通过保存检查点,我们可以稍后加载经过训练的模型并在测试集上对其进行验证。
完整的训练函数
完整的代码示例如下所示:
def train_cifar(config, checkpoint_dir=None, data_dir=None):
net = Net(config["l1"], config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
if checkpoint_dir:
model_state, optimizer_state = torch.load(
os.path.join(checkpoint_dir, "checkpoint"))
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs])
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
valloader = torch.utils.data.DataLoader(
val_subset,
batch_size=int(config["batch_size"]),
shuffle=True,
num_workers=8)
for epoch in range(10): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
running_loss / epoch_steps))
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
with tune.checkpoint_dir(epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save((net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
print("Finished Training")
如您所见,大多数代码直接来自原始示例。
测试集准确率
通常,机器学习模型的表现是在保持测试集上使用尚未用于训练模型的数据进行测试的。 我们还将其包装在一个函数中:
def test_accuracy(net, device="cpu"):
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
该函数还需要一个device
参数,因此我们可以在 GPU 上进行测试集验证。
配置搜索空间
最后,我们需要定义 Ray Tune 的搜索空间。 这是一个例子:
config = {
"l1": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
tune.sample_from()
函数使您可以定义自己的采样方法以获得超参数。 在此示例中,l1
和l2
参数应为 4 到 256 之间的 2 的幂,因此应为 4、8、16、32、64、128 或 256。lr
(学习率)应在 0.0001 和 0.1 之间均匀采样。 最后,批量大小可以在 2、4、8 和 16 之间选择。
现在,在每次试用中,Ray Tune 都会从这些搜索空间中随机抽取参数组合。 然后它将并行训练许多模型,并在其中找到表现最佳的模型。 我们还使用ASHAScheduler
,它将尽早终止效果不佳的测试。
我们用functools.partial
包装train_cifar
函数以设置常量data_dir
参数。 我们还可以告诉 Ray Tune 每个审判应提供哪些资源:
gpus_per_trial = 2
# ...
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter,
checkpoint_at_end=True)
您可以指定 CPU 的数量,例如增加 PyTorch DataLoader
实例的num_workers
。 在每次试用中,选定数量的 GPU 对 PyTorch 都是可见的。 试用版无法访问未要求使用 GPU 的 GPU,因此您不必担心使用同一组资源进行两次试用。
在这里,我们还可以指定分数 GPU,因此gpus_per_trial=0.5
之类的东西完全有效。 然后,试用版将彼此共享 GPU。 您只需要确保模型仍然适合 GPU 内存即可。
训练完模型后,我们将找到表现最好的模型,并从检查点文件中加载训练后的网络。 然后,我们获得测试仪的准确率,并通过打印报告一切。
完整的main
函数如下:
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
data_dir = os.path.abspath("./data")
load_data(data_dir)
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16])
}
scheduler = ASHAScheduler(
metric="loss",
mode="min",
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
reporter = CLIReporter(
# parameter_columns=["l1", "l2", "lr", "batch_size"],
metric_columns=["loss", "accuracy", "training_iteration"])
result = tune.run(
partial(train_cifar, data_dir=data_dir),
resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
config=config,
num_samples=num_samples,
scheduler=scheduler,
progress_reporter=reporter)
best_trial = result.get_best_trial("loss", "min", "last")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.last_result["loss"]))
print("Best trial final validation accuracy: {}".format(
best_trial.last_result["accuracy"]))
best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if gpus_per_trial > 1:
best_trained_model = nn.DataParallel(best_trained_model)
best_trained_model.to(device)
best_checkpoint_dir = best_trial.checkpoint.value
model_state, optimizer_state = torch.load(os.path.join(
best_checkpoint_dir, "checkpoint"))
best_trained_model.load_state_dict(model_state)
test_acc = test_accuracy(best_trained_model, device)
print("Best trial test set accuracy: {}".format(test_acc))
if __name__ == "__main__":
# You can change the number of GPUs per trial here:
main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
出:
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /var/lib/jenkins/workspace/beginner_source/data/cifar-10-python.tar.gz
Extracting /var/lib/jenkins/workspace/beginner_source/data/cifar-10-python.tar.gz to /var/lib/jenkins/workspace/beginner_source/data
Files already downloaded and verified
== Status ==
Memory usage on this node: 4.0/240.1 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 2/32 CPUs, 0/2 GPUs, 0.0/157.71 GiB heap, 0.0/49.37 GiB objects
Result logdir: /var/lib/jenkins/ray_results/DEFAULT_2021-01-05_20-23-08
Number of trials: 1/10 (1 RUNNING)
+---------------------+----------+-------+--------------+------+------+-------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr |
|---------------------+----------+-------+--------------+------+------+-------------|
| DEFAULT_d3304_00000 | RUNNING | | 2 | 4 | 16 | 0.000111924 |
+---------------------+----------+-------+--------------+------+------+-------------+
[2m[36m(pid=1588)[0m Files already downloaded and verified
[2m[36m(pid=1568)[0m Files already downloaded and verified
[2m[36m(pid=1504)[0m Files already downloaded and verified
[2m[36m(pid=1575)[0m Files already downloaded and verified
[2m[36m(pid=1494)[0m Files already downloaded and verified
[2m[36m(pid=1572)[0m Files already downloaded and verified
[2m[36m(pid=1567)[0m Files already downloaded and verified
[2m[36m(pid=1585)[0m Files already downloaded and verified
[2m[36m(pid=1565)[0m Files already downloaded and verified
[2m[36m(pid=1505)[0m Files already downloaded and verified
[2m[36m(pid=1588)[0m Files already downloaded and verified
[2m[36m(pid=1568)[0m Files already downloaded and verified
[2m[36m(pid=1504)[0m Files already downloaded and verified
[2m[36m(pid=1575)[0m Files already downloaded and verified
[2m[36m(pid=1494)[0m Files already downloaded and verified
[2m[36m(pid=1572)[0m Files already downloaded and verified
[2m[36m(pid=1567)[0m Files already downloaded and verified
[2m[36m(pid=1565)[0m Files already downloaded and verified
[2m[36m(pid=1585)[0m Files already downloaded and verified
[2m[36m(pid=1505)[0m Files already downloaded and verified
[2m[36m(pid=1585)[0m [1, 2000] loss: 2.307
[2m[36m(pid=1568)[0m [1, 2000] loss: 2.226
[2m[36m(pid=1565)[0m [1, 2000] loss: 2.141
[2m[36m(pid=1505)[0m [1, 2000] loss: 2.339
[2m[36m(pid=1504)[0m [1, 2000] loss: 2.042
[2m[36m(pid=1572)[0m [1, 2000] loss: 2.288
[2m[36m(pid=1567)[0m [1, 2000] loss: 2.047
[2m[36m(pid=1575)[0m [1, 2000] loss: 2.316
[2m[36m(pid=1494)[0m [1, 2000] loss: 2.322
[2m[36m(pid=1588)[0m [1, 2000] loss: 2.289
[2m[36m(pid=1585)[0m [1, 4000] loss: 1.154
[2m[36m(pid=1505)[0m [1, 4000] loss: 1.170
[2m[36m(pid=1565)[0m [1, 4000] loss: 0.939
[2m[36m(pid=1568)[0m [1, 4000] loss: 1.102
[2m[36m(pid=1504)[0m [1, 4000] loss: 0.916
[2m[36m(pid=1572)[0m [1, 4000] loss: 1.156
Result for DEFAULT_d3304_00003:
accuracy: 0.226
date: 2021-01-05_20-23-37
done: false
experiment_id: d4b00469893d498ea65a729df202882a
experiment_tag: 3_batch_size=16,l1=32,l2=4,lr=0.0012023
hostname: 1a844a452371
iterations_since_restore: 1
loss: 2.083958268547058
node_ip: 172.17.0.2
pid: 1588
should_checkpoint: true
time_since_restore: 27.169169902801514
time_this_iter_s: 27.169169902801514
time_total_s: 27.169169902801514
timestamp: 1609878217
timesteps_since_restore: 0
training_iteration: 1
trial_id: d3304_00003
== Status ==
Memory usage on this node: 9.2/240.1 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: -2.083958268547058
Resources requested: 20/32 CPUs, 0/2 GPUs, 0.0/157.71 GiB heap, 0.0/49.37 GiB objects
Result logdir: /var/lib/jenkins/ray_results/DEFAULT_2021-01-05_20-23-08
Number of trials: 10/10 (10 RUNNING)
+---------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|---------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------|
| DEFAULT_d3304_00000 | RUNNING | | 2 | 4 | 16 | 0.000111924 | | | |
| DEFAULT_d3304_00001 | RUNNING | | 8 | 16 | 32 | 0.077467 | | | |
| DEFAULT_d3304_00002 | RUNNING | | 4 | 8 | 128 | 0.00436986 | | | |
| DEFAULT_d3304_00003 | RUNNING | 172.17.0.2:1588 | 16 | 32 | 4 | 0.00120234 | 2.08396 | 0.226 | 1 |
| DEFAULT_d3304_00004 | RUNNING | | 4 | 16 | 32 | 0.016474 | | | |
| DEFAULT_d3304_00005 | RUNNING | | 4 | 128 | 64 | 0.00757252 | | | |
| DEFAULT_d3304_00006 | RUNNING | | 2 | 64 | 256 | 0.00177236 | | | |
| DEFAULT_d3304_00007 | RUNNING | | 8 | 8 | 8 | 0.000155891 | | | |
| DEFAULT_d3304_00008 | RUNNING | | 2 | 16 | 64 | 0.0310199 | | | |
| DEFAULT_d3304_00009 | RUNNING | | 4 | 4 | 32 | 0.0175239 | | | |
+---------------------+----------+-----------------+--------------+------+------+-------------+---------+------------+----------------------+
[2m[36m(pid=1567)[0m [1, 4000] loss: 0.943
[2m[36m(pid=1494)[0m [1, 4000] loss: 1.155
[2m[36m(pid=1575)[0m [1, 4000] loss: 1.162
[2m[36m(pid=1585)[0m [1, 6000] loss: 0.768
[2m[36m(pid=1505)[0m [1, 6000] loss: 0.780
[2m[36m(pid=1565)[0m [1, 6000] loss: 0.582
[2m[36m(pid=1504)[0m [1, 6000] loss: 0.587
[2m[36m(pid=1568)[0m [1, 6000] loss: 0.770
[2m[36m(pid=1572)[0m [1, 6000] loss: 0.771
[2m[36m(pid=1567)[0m [1, 6000] loss: 0.615
Result for DEFAULT_d3304_00007:
accuracy: 0.1011
date: 2021-01-05_20-23-51
done: true
experiment_id: 947614a8c2a74533be128b929f363bd1
experiment_tag: 7_batch_size=8,l1=8,l2=8,lr=0.00015589
hostname: 1a844a452371
iterations_since_restore: 1
loss: 2.3038805620193483
node_ip: 172.17.0.2
pid: 1494
should_checkpoint: true
time_since_restore: 41.69914960861206
time_this_iter_s: 41.69914960861206
time_total_s: 41.69914960861206
timestamp: 1609878231
timesteps_since_restore: 0
training_iteration: 1
trial_id: d3304_00007
**篇幅过长 不变展示
== Status ==
Memory usage on this node: 4.0/240.1 GiB
Using AsyncHyperBand: num_stopped=10
Bracket: Iter 8.000: -1.4607074356479388 | Iter 4.000: -1.4950430885698218 | Iter 2.000: -1.7041921138763427 | Iter 1.000: -2.301384049916267
Resources requested: 0/32 CPUs, 0/2 GPUs, 0.0/157.71 GiB heap, 0.0/49.37 GiB objects
Result logdir: /var/lib/jenkins/ray_results/DEFAULT_2021-01-05_20-23-08
Number of trials: 10/10 (10 TERMINATED)
+---------------------+------------+-------+--------------+------+------+-------------+---------+------------+----------------------+
| Trial name | status | loc | batch_size | l1 | l2 | lr | loss | accuracy | training_iteration |
|---------------------+------------+-------+--------------+------+------+-------------+---------+------------+----------------------|
| DEFAULT_d3304_00000 | TERMINATED | | 2 | 4 | 16 | 0.000111924 | 1.8605 | 0.2724 | 2 |
| DEFAULT_d3304_00001 | TERMINATED | | 8 | 16 | 32 | 0.077467 | 2.32186 | 0.1017 | 1 |
| DEFAULT_d3304_00002 | TERMINATED | | 4 | 8 | 128 | 0.00436986 | 1.72073 | 0.3849 | 4 |
| DEFAULT_d3304_00003 | TERMINATED | | 16 | 32 | 4 | 0.00120234 | 1.22223 | 0.5619 | 10 |
| DEFAULT_d3304_00004 | TERMINATED | | 4 | 16 | 32 | 0.016474 | 2.31342 | 0.102 | 1 |
| DEFAULT_d3304_00005 | TERMINATED | | 4 | 128 | 64 | 0.00757252 | 1.77391 | 0.3647 | 2 |
| DEFAULT_d3304_00006 | TERMINATED | | 2 | 64 | 256 | 0.00177236 | 1.6539 | 0.467 | 8 |
| DEFAULT_d3304_00007 | TERMINATED | | 8 | 8 | 8 | 0.000155891 | 2.30388 | 0.1011 | 1 |
| DEFAULT_d3304_00008 | TERMINATED | | 2 | 16 | 64 | 0.0310199 | 2.33698 | 0.0983 | 1 |
| DEFAULT_d3304_00009 | TERMINATED | | 4 | 4 | 32 | 0.0175239 | 2.31098 | 0.101 | 1 |
+---------------------+------------+-------+--------------+------+------+-------------+---------+------------+----------------------+
Best trial config: {'l1': 32, 'l2': 4, 'lr': 0.0012023396319256663, 'batch_size': 16}
Best trial final validation loss: 1.2222298237800597
Best trial final validation accuracy: 0.5619
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.5537
如果运行代码,则示例输出如下所示:
为了避免浪费资源,大多数审判已提早停止。 效果最好的试验的验证准确率约为 58%,可以在测试仪上进行确认。
就是这样了! 您现在可以调整 PyTorch 模型的参数。
脚本的总运行时间:(14 分钟 43.400 秒)
下载 Python 源码:hyperparameter_tuning_tutorial.py
下载 Jupyter 笔记本:hyperparameter_tuning_tutorial.ipynb
由 Sphinx 画廊生成的画廊