1. Introduction

1.1 Preface

本系列博文是和鲸社区的活动《20天吃掉那只PyTorch》学习的笔记，本篇为系列笔记的第六篇—— Pytorch 的高阶 API。该专栏是 Github 上 2.8K 星的项目，在学习该书的过程中可以参考阅读《Python深度学习》一书的第一部分"深度学习基础"内容。

《Python深度学习》这本书是 Keras 之父 Francois Chollet 所著，该书假定读者无任何机器学习知识，以Keras 为工具，使用丰富的范例示范深度学习的最佳实践，该书通俗易懂，全书没有一个数学公式，注重培养读者的深度学习直觉。

《Python深度学习》一书的第一部分的 4 个章节内容如下，预计读者可以在 20 小时之内学完。

什么是深度学习
神经网络的数学基础
神经网络入门
机器学习基础

本系列博文的大纲如下：

一、PyTorch的建模流程
二、PyTorch的核心概念
三、PyTorch的层次结构
四、PyTorch的低阶API
五、PyTorch的中阶API
六、PyTorch的高阶API

最后，本博文提供所使用的全部数据，读者可以从下述连接中下载数据：

Download Now

1.2 Pytorch的高阶API

Pytorch 没有官方的高阶 API。一般通过 nn.Module 来构建模型并编写自定义训练循环。

为了更加方便地训练模型，本书的作者编写了仿 keras 的 Pytorch 模型接口：torchkeras，作为 Pytorch 的高阶 API。

本章我们主要详细介绍 Pytorch 的高阶 API 如下相关的内容。

构建模型的3种方法(继承 nn.Module 基类，使用 nn.Sequential，辅助应用模型容器)
训练模型的3种方法(脚本风格，函数风格，torchkeras.Model 类风格)
使用 GPU 训练模型(单 GPU 训练，多 GPU 训练)

2. 构建模型的3种方法

可以使用以下3种方式构建模型：

继承 nn.Module 基类构建自定义模型。
使用 nn.Sequential 按层顺序构建模型。
继承 nn.Module 基类构建模型并辅助应用模型容器进行封装(nn.Sequential, nn.ModuleList, nn.ModuleDict)。

其中第 1 种方式最为常见，第 2 种方式最简单，第 3 种方式最为灵活也较为复杂。推荐使用第 1 种方式构建模型。

1
2
3

import torch
from torch import nn
from torchkeras import summary

2.1 继承nn.Module基类构建自定义模型

以下是继承 nn.Module 基类构建自定义模型的一个范例。模型中的用到的层一般在__init__ 函数中定义，然后在 forward 方法中定义模型的正向传播逻辑。

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3)
        self.pool1 = nn.MaxPool2d(kernel_size = 2,stride = 2)
        self.conv2 = nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5)
        self.pool2 = nn.MaxPool2d(kernel_size = 2,stride = 2)
        self.dropout = nn.Dropout2d(p = 0.1)
        self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1))
        self.flatten = nn.Flatten()
        self.linear1 = nn.Linear(64,32)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(32,1)
        self.sigmoid = nn.Sigmoid()

    def forward(self,x):
        x = self.conv1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.dropout(x)
        x = self.adaptive_pool(x)
        x = self.flatten(x)
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        y = self.sigmoid(x)
        return y

net = Net()
print(net)

Results:

Net(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout): Dropout2d(p=0.1, inplace=False)
  (adaptive_pool): AdaptiveMaxPool2d(output_size=(1, 1))
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=64, out_features=32, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=32, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

1	summary(net,input_shape= (3,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             896
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                    [-1, 1]              33
          Sigmoid-11                    [-1, 1]               0
================================================================
Total params: 54,273
Trainable params: 54,273
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.011719
Forward/backward pass size (MB): 0.359634
Params size (MB): 0.207035
Estimated Total Size (MB): 0.578388
----------------------------------------------------------------

2.2 使用nn.Sequential按层顺序构建模型

使用 nn.Sequential 按层顺序构建模型无需定义 forward 方法。仅仅适合于简单的模型。

以下是使用 nn.Sequential 搭建模型的一些等价方法。

2.2.1 利用 `add_module` 方法

net = nn.Sequential()
net.add_module("conv1",nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3))
net.add_module("pool1",nn.MaxPool2d(kernel_size = 2,stride = 2))
net.add_module("conv2",nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5))
net.add_module("pool2",nn.MaxPool2d(kernel_size = 2,stride = 2))
net.add_module("dropout",nn.Dropout2d(p = 0.1))
net.add_module("adaptive_pool",nn.AdaptiveMaxPool2d((1,1)))
net.add_module("flatten",nn.Flatten())
net.add_module("linear1",nn.Linear(64,32))
net.add_module("relu",nn.ReLU())
net.add_module("linear2",nn.Linear(32,1))
net.add_module("sigmoid",nn.Sigmoid())

print(net)

Results:

Sequential(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout): Dropout2d(p=0.1, inplace=False)
  (adaptive_pool): AdaptiveMaxPool2d(output_size=(1, 1))
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=64, out_features=32, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=32, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

2.2.2 利用变长参数

这种方式构建时不能给每个层指定名称。

net = nn.Sequential(
    nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3),
    nn.MaxPool2d(kernel_size = 2,stride = 2),
    nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
    nn.MaxPool2d(kernel_size = 2,stride = 2),
    nn.Dropout2d(p = 0.1),
    nn.AdaptiveMaxPool2d((1,1)),
    nn.Flatten(),
    nn.Linear(64,32),
    nn.ReLU(),
    nn.Linear(32,1),
    nn.Sigmoid()
)

print(net)

Results:

Sequential(
  (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
  (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (4): Dropout2d(p=0.1, inplace=False)
  (5): AdaptiveMaxPool2d(output_size=(1, 1))
  (6): Flatten(start_dim=1, end_dim=-1)
  (7): Linear(in_features=64, out_features=32, bias=True)
  (8): ReLU()
  (9): Linear(in_features=32, out_features=1, bias=True)
  (10): Sigmoid()
)

2.2.3 利用 `OrderedDict`

from collections import OrderedDict

net = nn.Sequential(OrderedDict(
          [("conv1",nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3)),
            ("pool1",nn.MaxPool2d(kernel_size = 2,stride = 2)),
            ("conv2",nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5)),
            ("pool2",nn.MaxPool2d(kernel_size = 2,stride = 2)),
            ("dropout",nn.Dropout2d(p = 0.1)),
            ("adaptive_pool",nn.AdaptiveMaxPool2d((1,1))),
            ("flatten",nn.Flatten()),
            ("linear1",nn.Linear(64,32)),
            ("relu",nn.ReLU()),
            ("linear2",nn.Linear(32,1)),
            ("sigmoid",nn.Sigmoid())
          ])
        )
print(net)

Results:

Sequential(
  (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout): Dropout2d(p=0.1, inplace=False)
  (adaptive_pool): AdaptiveMaxPool2d(output_size=(1, 1))
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=64, out_features=32, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=32, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

1	summary(net,input_shape= (3,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             896
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                    [-1, 1]              33
          Sigmoid-11                    [-1, 1]               0
================================================================
Total params: 54,273
Trainable params: 54,273
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.011719
Forward/backward pass size (MB): 0.359634
Params size (MB): 0.207035
Estimated Total Size (MB): 0.578388
----------------------------------------------------------------

2.3 继承nn.Module基类构建模型

继承 nn.Module 基类构建模型并辅助应用模型容器进行封装。

当模型的结构比较复杂时，我们可以应用模型容器(nn.Sequential, nn.ModuleList, nn.ModuleDict)对模型的部分结构进行封装。这样做会让模型整体更加有层次感，有时候也能减少代码量。

Note：在下面的范例中我们每次仅仅使用一种模型容器，但实际上这些模型容器的使用是非常灵活的，可以在一个模型中任意组合任意嵌套使用。

2.3.1 nn.Sequential 作为模型容器

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1))
        )
        self.dense = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,1),
            nn.Sigmoid()
        )
    def forward(self,x):
        x = self.conv(x)
        y = self.dense(x)
        return y

net = Net()
print(net)

Results:

Net(
  (conv): Sequential(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Dropout2d(p=0.1, inplace=False)
    (5): AdaptiveMaxPool2d(output_size=(1, 1))
  )
  (dense): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=64, out_features=32, bias=True)
    (2): ReLU()
    (3): Linear(in_features=32, out_features=1, bias=True)
    (4): Sigmoid()
  )
)

2.3.2 nn.ModuleList 作为模型容器

注意下面中的 ModuleList 不能用 Python 中的列表代替。

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,1),
            nn.Sigmoid()]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x
net = Net()
print(net)

Results：

Net(
  (layers): ModuleList(
    (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Dropout2d(p=0.1, inplace=False)
    (5): AdaptiveMaxPool2d(output_size=(1, 1))
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=64, out_features=32, bias=True)
    (8): ReLU()
    (9): Linear(in_features=32, out_features=1, bias=True)
    (10): Sigmoid()
  )
)

1	summary(net,input_shape= (3,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             896
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                    [-1, 1]              33
          Sigmoid-11                    [-1, 1]               0
================================================================
Total params: 54,273
Trainable params: 54,273
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.011719
Forward/backward pass size (MB): 0.359634
Params size (MB): 0.207035
Estimated Total Size (MB): 0.578388
----------------------------------------------------------------

2.3.3 nn.ModuleDict 作为模型容器

注意下面中的 ModuleDict 不能用 Python 中的字典代替。

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.layers_dict = nn.ModuleDict({"conv1":nn.Conv2d(in_channels=3,out_channels=32,kernel_size = 3),
               "pool": nn.MaxPool2d(kernel_size = 2,stride = 2),
               "conv2":nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
               "dropout": nn.Dropout2d(p = 0.1),
               "adaptive":nn.AdaptiveMaxPool2d((1,1)),
               "flatten": nn.Flatten(),
               "linear1": nn.Linear(64,32),
               "relu":nn.ReLU(),
               "linear2": nn.Linear(32,1),
               "sigmoid": nn.Sigmoid()
              })
    def forward(self,x):
        layers = ["conv1","pool","conv2","pool","dropout","adaptive",
                  "flatten","linear1","relu","linear2","sigmoid"]
        for layer in layers:
            x = self.layers_dict[layer](x)
        return x
net = Net()
print(net)

Results：

Net(
  (layers_dict): ModuleDict(
    (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1))
    (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
    (dropout): Dropout2d(p=0.1, inplace=False)
    (adaptive): AdaptiveMaxPool2d(output_size=(1, 1))
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (linear1): Linear(in_features=64, out_features=32, bias=True)
    (relu): ReLU()
    (linear2): Linear(in_features=32, out_features=1, bias=True)
    (sigmoid): Sigmoid()
  )
)

1	summary(net,input_shape= (3,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             896
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                    [-1, 1]              33
          Sigmoid-11                    [-1, 1]               0
================================================================
Total params: 54,273
Trainable params: 54,273
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.011719
Forward/backward pass size (MB): 0.359634
Params size (MB): 0.207035
Estimated Total Size (MB): 0.578388
----------------------------------------------------------------

3. 训练模型的3种方法

Pytorch 通常需要用户编写自定义训练循环，训练循环的代码风格因人而异。

有 3 类典型的训练循环代码风格：

脚本形式训练循环
函数形式训练循环
类形式训练循环

下面以 minist 数据集的分类模型的训练为例，演示这 3 种训练模型的风格。其中类形式训练循环我们会使用 torchkeras.Model 和 torchkeras.LightModel 这两种方法。

3.1 Prepare data

import torch
from torch import nn

import torchvision
from torchvision import transforms

transform = transforms.Compose([transforms.ToTensor()])

data_dir = '../data/'
ds_train = torchvision.datasets.MNIST(root=data_dir + "minist/",train=True,download=True,transform=transform)
ds_valid = torchvision.datasets.MNIST(root=data_dir + "minist/",train=False,download=True,transform=transform)

dl_train =  torch.utils.data.DataLoader(ds_train, batch_size=128, shuffle=True, num_workers=1)
dl_valid =  torch.utils.data.DataLoader(ds_valid, batch_size=128, shuffle=False, num_workers=1)

print(len(ds_train))
print(len(ds_valid))

Results：

60000
10000

查看部分样本

#查看部分样本
from matplotlib import pyplot as plt

plt.figure(figsize=(8,8))
for i in range(9):
    img,label = ds_train[i]
    img = torch.squeeze(img)
    ax=plt.subplot(3,3,i+1)
    ax.imshow(img.numpy())
    ax.set_title("label = %d"%label)
    ax.set_xticks([])
    ax.set_yticks([])
plt.show()

Results:

3.2 脚本风格

脚本风格的训练循环最为常见。

net = nn.Sequential()
net.add_module("conv1",nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3))
net.add_module("pool1",nn.MaxPool2d(kernel_size = 2,stride = 2))
net.add_module("conv2",nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5))
net.add_module("pool2",nn.MaxPool2d(kernel_size = 2,stride = 2))
net.add_module("dropout",nn.Dropout2d(p = 0.1))
net.add_module("adaptive_pool",nn.AdaptiveMaxPool2d((1,1)))
net.add_module("flatten",nn.Flatten())
net.add_module("linear1",nn.Linear(64,32))
net.add_module("relu",nn.ReLU())
net.add_module("linear2",nn.Linear(32,10))

print(net)

Results：

Sequential(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
  (pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (dropout): Dropout2d(p=0.1, inplace=False)
  (adaptive_pool): AdaptiveMaxPool2d(output_size=(1, 1))
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=64, out_features=32, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=32, out_features=10, bias=True)
)

1	summary(net,input_shape=(1,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------

Model

import datetime
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

def accuracy(y_pred,y_true):
    y_pred_cls = torch.argmax(nn.Softmax(dim=1)(y_pred),dim=1).data
    return accuracy_score(y_true,y_pred_cls)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=net.parameters(),lr = 0.01)
metric_func = accuracy
metric_name = "accuracy"

epochs = 3
log_step_freq = 100

dfhistory = pd.DataFrame(columns = ["epoch","loss",metric_name,"val_loss","val_"+metric_name])
print("Start Training...")
nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print("=========="*8 + "%s"%nowtime)

for epoch in range(1,epochs+1):  
    # 1，训练循环-------------------------------------------------
    net.train()
    loss_sum = 0.0
    metric_sum = 0.0
    step = 1

    for step, (features,labels) in enumerate(dl_train, 1):

        # 梯度清零
        optimizer.zero_grad()

        # 正向传播求损失
        predictions = net(features)
        loss = loss_func(predictions,labels)
        metric = metric_func(predictions,labels)

        # 反向传播求梯度
        loss.backward()
        optimizer.step()

        # 打印batch级别日志
        loss_sum += loss.item()
        metric_sum += metric.item()
        if step%log_step_freq == 0:   
            print(("[step = %d] loss: %.3f, "+metric_name+": %.3f") %
                  (step, loss_sum/step, metric_sum/step))

    # 2，验证循环-------------------------------------------------
    net.eval()
    val_loss_sum = 0.0
    val_metric_sum = 0.0
    val_step = 1

    for val_step, (features,labels) in enumerate(dl_valid, 1):
        with torch.no_grad():
            predictions = net(features)
            val_loss = loss_func(predictions,labels)
            val_metric = metric_func(predictions,labels)

        val_loss_sum += val_loss.item()
        val_metric_sum += val_metric.item()

    # 3，记录日志-------------------------------------------------
    info = (epoch, loss_sum/step, metric_sum/step,
            val_loss_sum/val_step, val_metric_sum/val_step)
    dfhistory.loc[epoch-1] = info

    # 打印epoch级别日志
    print(("\nEPOCH = %d, loss = %.3f,"+ metric_name + \
          "  = %.3f, val_loss = %.3f, "+"val_"+ metric_name+" = %.3f")
          %info)
    nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("\n"+"=========="*8 + "%s"%nowtime)

print('Finished Training...')

Results:

Start Training...
================================================================================2022-02-07 19:51:52  

[step = 100] loss: 0.625, accuracy: 0.800
[step = 200] loss: 0.398, accuracy: 0.874
[step = 300] loss: 0.314, accuracy: 0.902
[step = 400] loss: 0.267, accuracy: 0.917

EPOCH = 1, loss = 0.244,accuracy  = 0.924, val_loss = 0.105, val_accuracy = 0.968

================================================================================2022-02-07 19:52:59
[step = 100] loss: 0.109, accuracy: 0.966
[step = 200] loss: 0.112, accuracy: 0.966
[step = 300] loss: 0.110, accuracy: 0.966
[step = 400] loss: 0.106, accuracy: 0.968

EPOCH = 2, loss = 0.104,accuracy  = 0.969, val_loss = 0.063, val_accuracy = 0.981

================================================================================2022-02-07 19:53:59
[step = 100] loss: 0.083, accuracy: 0.974
[step = 200] loss: 0.085, accuracy: 0.974
[step = 300] loss: 0.085, accuracy: 0.974
[step = 400] loss: 0.090, accuracy: 0.973

EPOCH = 3, loss = 0.090,accuracy  = 0.973, val_loss = 0.063, val_accuracy = 0.983

================================================================================2022-02-07 19:55:01
Finished Training...

3.3 函数风格

该风格在脚本形式上作了简单的函数封装。

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x
net = Net()
print(net)

Results:

Net(
  (layers): ModuleList(
    (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Dropout2d(p=0.1, inplace=False)
    (5): AdaptiveMaxPool2d(output_size=(1, 1))
    (6): Flatten(start_dim=1, end_dim=-1)
    (7): Linear(in_features=64, out_features=32, bias=True)
    (8): ReLU()
    (9): Linear(in_features=32, out_features=10, bias=True)
  )
)

1	summary(net,input_shape=(1,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------

Model-Train_step

import datetime
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

def accuracy(y_pred,y_true):
    y_pred_cls = torch.argmax(nn.Softmax(dim=1)(y_pred),dim=1).data
    return accuracy_score(y_true,y_pred_cls)

model = net
model.optimizer = torch.optim.SGD(model.parameters(),lr = 0.01)
model.loss_func = nn.CrossEntropyLoss()
model.metric_func = accuracy
model.metric_name = "accuracy"

def train_step(model,features,labels):
    # 训练模式，dropout层发生作用
    model.train()

    # 梯度清零
    model.optimizer.zero_grad()

    # 正向传播求损失
    predictions = model(features)
    loss = model.loss_func(predictions,labels)
    metric = model.metric_func(predictions,labels)

    # 反向传播求梯度
    loss.backward()
    model.optimizer.step()

    return loss.item(),metric.item()

@torch.no_grad()
def valid_step(model,features,labels):

    # 预测模式，dropout层不发生作用
    model.eval()

    predictions = model(features)
    loss = model.loss_func(predictions,labels)
    metric = model.metric_func(predictions,labels)

    return loss.item(), metric.item()

# 测试train_step效果
features,labels = next(iter(dl_train))
train_step(model,features,labels)

Results:

  (2.3197388648986816, 0.09375)

Model-Train_model

def train_model(model,epochs,dl_train,dl_valid,log_step_freq):

    metric_name = model.metric_name
    dfhistory = pd.DataFrame(columns = ["epoch","loss",metric_name,"val_loss","val_"+metric_name])
    print("Start Training...")
    nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("=========="*8 + "%s"%nowtime)

    for epoch in range(1,epochs+1):  
        # 1，训练循环-------------------------------------------------
        loss_sum = 0.0
        metric_sum = 0.0
        step = 1

        for step, (features,labels) in enumerate(dl_train, 1):

            loss,metric = train_step(model,features,labels)

            # 打印batch级别日志
            loss_sum += loss
            metric_sum += metric
            if step%log_step_freq == 0:   
                print(("[step = %d] loss: %.3f, "+metric_name+": %.3f") %
                      (step, loss_sum/step, metric_sum/step))

        # 2，验证循环-------------------------------------------------
        val_loss_sum = 0.0
        val_metric_sum = 0.0
        val_step = 1

        for val_step, (features,labels) in enumerate(dl_valid, 1):

            val_loss,val_metric = valid_step(model,features,labels)

            val_loss_sum += val_loss
            val_metric_sum += val_metric

        # 3，记录日志-------------------------------------------------
        info = (epoch, loss_sum/step, metric_sum/step,
                val_loss_sum/val_step, val_metric_sum/val_step)
        dfhistory.loc[epoch-1] = info

        # 打印epoch级别日志
        print(("\nEPOCH = %d, loss = %.3f,"+ metric_name + \
              "  = %.3f, val_loss = %.3f, "+"val_"+ metric_name+" = %.3f")
              %info)
        nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print("\n"+"=========="*8 + "%s"%nowtime)

    print('Finished Training...')
    return dfhistory

epochs = 3
dfhistory = train_model(model,epochs,dl_train,dl_valid,log_step_freq = 100)

Results：

Start Training...
================================================================================2022-02-07 19:55:04
[step = 100] loss: 2.301, accuracy: 0.111
[step = 200] loss: 2.292, accuracy: 0.131
[step = 300] loss: 2.284, accuracy: 0.153
[step = 400] loss: 2.274, accuracy: 0.181

EPOCH = 1, loss = 2.266,accuracy  = 0.206, val_loss = 2.201, val_accuracy = 0.421

================================================================================2022-02-07 19:56:02
[step = 100] loss: 2.185, accuracy: 0.411
[step = 200] loss: 2.153, accuracy: 0.434
[step = 300] loss: 2.110, accuracy: 0.454
[step = 400] loss: 2.046, accuracy: 0.472

EPOCH = 2, loss = 1.994,accuracy  = 0.483, val_loss = 1.559, val_accuracy = 0.684

================================================================================2022-02-07 19:57:04
[step = 100] loss: 1.487, accuracy: 0.603
[step = 200] loss: 1.387, accuracy: 0.622
[step = 300] loss: 1.293, accuracy: 0.643
[step = 400] loss: 1.212, accuracy: 0.663

EPOCH = 3, loss = 1.158,accuracy  = 0.678, val_loss = 0.696, val_accuracy = 0.854

================================================================================2022-02-07 19:57:59
Finished Training...

3.4 类风格 torchkeras.Model

此处使用 torchkeras.Model 构建模型，并调用 compile 方法和 fit 方法训练模型。使用该形式训练模型非常简洁明了。

import torchkeras

class CnnModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x
model = torchkeras.Model(CnnModel())
print(model)

Results：

Model(
  (net): CnnModel(
    (layers): ModuleList(
      (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
      (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (4): Dropout2d(p=0.1, inplace=False)
      (5): AdaptiveMaxPool2d(output_size=(1, 1))
      (6): Flatten(start_dim=1, end_dim=-1)
      (7): Linear(in_features=64, out_features=32, bias=True)
      (8): ReLU()
      (9): Linear(in_features=32, out_features=10, bias=True)
    )
  )
)

1	model.summary(input_shape=(1,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------

Model

from sklearn.metrics import accuracy_score

def accuracy(y_pred,y_true):
    y_pred_cls = torch.argmax(nn.Softmax(dim=1)(y_pred),dim=1).data
    return accuracy_score(y_true.numpy(),y_pred_cls.numpy())

model.compile(loss_func = nn.CrossEntropyLoss(),
             optimizer= torch.optim.Adam(model.parameters(),lr = 0.02),
             metrics_dict={"accuracy":accuracy})

dfhistory = model.fit(3,dl_train = dl_train, dl_val=dl_valid, log_step_freq=100)

Results:

Start Training ...

================================================================================2022-02-07 19:57:59
{'step': 100, 'loss': 1.004, 'accuracy': 0.653}
{'step': 200, 'loss': 0.631, 'accuracy': 0.787}
{'step': 300, 'loss': 0.49, 'accuracy': 0.837}
{'step': 400, 'loss': 0.417, 'accuracy': 0.863}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   1   | 0.387 |  0.874   |  0.142   |    0.957     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 19:58:54
{'step': 100, 'loss': 0.162, 'accuracy': 0.95}
{'step': 200, 'loss': 0.169, 'accuracy': 0.95}
{'step': 300, 'loss': 0.162, 'accuracy': 0.952}
{'step': 400, 'loss': 0.157, 'accuracy': 0.953}

 +-------+------+----------+----------+--------------+
| epoch | loss | accuracy | val_loss | val_accuracy |
+-------+------+----------+----------+--------------+
|   2   | 0.16 |  0.953   |  0.111   |    0.971     |
+-------+------+----------+----------+--------------+

================================================================================2022-02-07 19:59:50
{'step': 100, 'loss': 0.145, 'accuracy': 0.959}
{'step': 200, 'loss': 0.16, 'accuracy': 0.954}
{'step': 300, 'loss': 0.161, 'accuracy': 0.954}
{'step': 400, 'loss': 0.164, 'accuracy': 0.954}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   3   | 0.161 |  0.955   |  0.132   |    0.965     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:00:51
Finished Training...

3.5 类风格 torchkeras.LightModel

下面示范 torchkeras.LightModel 的使用范例，详细用法可以参照

import torchkeras
import pytorch_lightning as pl
import torchmetrics as metrics

class CnnNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x

class Model(torchkeras.LightModel):

    #loss,and optional metrics
    def shared_step(self,batch)->dict:
        x, y = batch
        prediction = self(x)
        loss = nn.CrossEntropyLoss()(prediction,y)
        preds = torch.argmax(nn.Softmax(dim=1)(prediction),dim=1).data
        acc = metrics.functional.accuracy(preds, y)
        dic = {"loss":loss,"acc":acc}
        return dic

    #optimizer,and optional lr_scheduler
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.0001)
        return {"optimizer":optimizer,"lr_scheduler":lr_scheduler}

pl.seed_everything(1234)
net = CnnNet()
model = Model(net)

torchkeras.summary(model,input_shape=(1,32,32))
print(model)

Results:

Global seed set to 1234
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------
Model(
  (net): CnnNet(
    (layers): ModuleList(
      (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
      (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (4): Dropout2d(p=0.1, inplace=False)
      (5): AdaptiveMaxPool2d(output_size=(1, 1))
      (6): Flatten(start_dim=1, end_dim=-1)
      (7): Linear(in_features=64, out_features=32, bias=True)
      (8): ReLU()
      (9): Linear(in_features=32, out_features=10, bias=True)
    )
  )
)

ckpt_cb = pl.callbacks.ModelCheckpoint(monitor='val_loss')

# set gpus=0 will use cpu，
# set gpus=1 will use 1 gpu
# set gpus=2 will use 2gpus
# set gpus = -1 will use all gpus
# you can also set gpus = [0,1] to use the  given gpus
# you can even set tpu_cores=2 to use two tpus

trainer = pl.Trainer(max_epochs=10,gpus = 0, callbacks=[ckpt_cb])

trainer.fit(model,dl_train,dl_valid)

Results:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name | Type   | Params
--------------------------------
0 | net  | CnnNet | 54.0 K
--------------------------------
54.0 K    Trainable params
0         Non-trainable params
54.0 K    Total params
0.216     Total estimated model params size (MB)

...

4. 使用GPU训练模型

深度学习的训练过程常常非常耗时，一个模型训练几个小时是家常便饭，训练几天也是常有的事情，有时候甚至要训练几十天。

训练过程的耗时主要来自于两个部分，一部分来自 数据准备，另一部分来自 参数迭代。

数据准备

当数据准备过程还是模型训练时间的主要瓶颈时，我们可以使用更多进程来准备数据。
参数迭代

当参数迭代过程成为训练时间的主要瓶颈时，我们通常的方法是应用GPU来进行加速。

Pytorch 中使用 GPU 加速模型非常简单，只要将模型和数据移动到 GPU 上。核心代码只有以下几行。

定义模型

1 2	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model.to(device) # 移动模型到cuda

训练模型

1 2	features = features.to(device) # 移动数据到cuda labels = labels.to(device) # 或者 labels = labels.cuda() if torch.cuda.is_available() else labels

如果要使用多个 GPU 训练模型，也非常简单。只需要在将模型设置为数据并行风格模型。则模型移动到 GPU 上之后，会在每一个 GPU 上拷贝一个副本，并把数据平分到各个 GPU 上进行训练。

4.1 GPU 相关基本操作

在 Colab 笔记本中：修改->笔记本设置->硬件加速器中选择 GPU

注：以下代码只能在 Colab 上才能正确执行。可点击如下链接，直接在 colab 中运行范例代码。《torch使用gpu训练模型》:https://colab.research.google.com/drive/1FDmi44-U3TFRCt9MwGn4HIj2SaaWIjHu?usp=sharing

1 2	import torch from torch import nn

查看 gpu 信息

# 1，查看gpu信息
if_cuda = torch.cuda.is_available()
print("if_cuda=",if_cuda)

gpu_count = torch.cuda.device_count()
print("gpu_count=",gpu_count)

Resutls:

 if_cuda= False
 gpu_count= 0

将张量在 gpu 和 cpu 间移动

由于本机 gpu 未正常配置，下述代码会报错。

# 2，将张量在gpu和cpu间移动
tensor = torch.rand((100,100))
tensor_gpu = tensor.to("cuda:0") # 或者 tensor_gpu = tensor.cuda()
print(tensor_gpu.device)
print(tensor_gpu.is_cuda)

tensor_cpu = tensor_gpu.to("cpu") # 或者 tensor_cpu = tensor_gpu.cpu()
print(tensor_cpu.device)

Results:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    
    <ipython-input-24-cb760e986663> in <module>
          1 # 2，将张量在gpu和cpu间移动
          2 tensor = torch.rand((100,100))
    ----> 3 tensor_gpu = tensor.to("cuda:0") # 或者 tensor_gpu = tensor.cuda()
          4 print(tensor_gpu.device)
          5 print(tensor_gpu.is_cuda)
          
          D:\Programs\Anaconda\lib\site-packages\torch\cuda\__init__.py in _lazy_init()
        170         # This function throws if there's a driver initialization error, no GPUs
        171         # are found or any other error occurs
    --> 172         torch._C._cuda_init()
        173         # Some of the queued calls may reentrantly call _lazy_init();
        174         # we need to just return without initializing in that case.
    
    RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

3. 将模型中的全部张量移动到gpu上

    ```python
    # 3，将模型中的全部张量移动到gpu上
    net = nn.Linear(2,1)
    print(next(net.parameters()).is_cuda)
    net.to("cuda:0") # 将模型中的全部参数张量依次到GPU上，注意，无需重新赋值为 net = net.to("cuda:0")
    print(next(net.parameters()).is_cuda)
    print(next(net.parameters()).device)

Results：

False
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)

<ipython-input-25-f51cc87d7898> in <module>
      2 net = nn.Linear(2,1)
      3 print(next(net.parameters()).is_cuda)
----> 4 net.to("cuda:0") # 将模型中的全部参数张量依次到GPU上，注意，无需重新赋值为 net = net.to("cuda:0")
      5 print(next(net.parameters()).is_cuda)
      6 print(next(net.parameters()).device)

D:\Programs\Anaconda\lib\site-packages\torch\nn\modules\module.py in to(self, *args, **kwargs)
    850             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
    851
--> 852         return self._apply(convert)
    853
    854     def register_backward_hook(

D:\Programs\Anaconda\lib\site-packages\torch\nn\modules\module.py in _apply(self, fn)
    550                 # `with torch.no_grad():`
    551                 with torch.no_grad():
--> 552                     param_applied = fn(param)
    553                 should_use_set_data = compute_should_use_set_data(param, param_applied)
    554                 if should_use_set_data:

D:\Programs\Anaconda\lib\site-packages\torch\nn\modules\module.py in convert(t)
    848                 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
    849                             non_blocking, memory_format=convert_to_format)
--> 850             return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
    851
    852         return self._apply(convert)

D:\Programs\Anaconda\lib\site-packages\torch\cuda\__init__.py in _lazy_init()
    170         # This function throws if there's a driver initialization error, no GPUs
    171         # are found or any other error occurs
--> 172         torch._C._cuda_init()
    173         # Some of the queued calls may reentrantly call _lazy_init();
    174         # we need to just return without initializing in that case.

RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.

创建支持多个gpu数据并行的模型

# 4，创建支持多个gpu数据并行的模型
linear = nn.Linear(2,1)
print(next(linear.parameters()).device)

model = nn.DataParallel(linear)
print(model.device_ids)
print(next(model.module.parameters()).device)

#注意保存参数时要指定保存model.module的参数
torch.save(model.module.state_dict(), data_dir + "model_parameter.pkl")

linear = nn.Linear(2,1)
linear.load_state_dict(torch.load(data_dir + "model_parameter.pkl"))

Results：

cpu
[]
cpu

<All keys matched successfully>

清空 cuda 缓存

# 5，清空cuda缓存

# 该方法在cuda超内存时十分有用
torch.cuda.empty_cache()

4.2 矩阵乘法范例

下面分别使用 CPU 和 GPU 作一个矩阵乘法，并比较其计算效率。

1
2
3

import time
import torch
from torch import nn

使用 cpu

# 使用cpu
a = torch.rand((10000,200))
b = torch.rand((200,10000))
tic = time.time()
c = torch.matmul(a,b)
toc = time.time()

print(toc-tic)
print(a.device)
print(b.device)

Results:

  0.6119661331176758
  cpu
  cpu

使用 gpu

# 使用gpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
a = torch.rand((10000,200),device = device) #可以指定在GPU上创建张量
b = torch.rand((200,10000)) #也可以在CPU上创建张量后移动到GPU上
b = b.to(device) #或者 b = b.cuda() if torch.cuda.is_available() else b
tic = time.time()
c = torch.matmul(a,b)
toc = time.time()
print(toc-tic)
print(a.device)
print(b.device)

Results:

  0.43799781799316406
  cpu
  cpu

4.3 线性回归范例

下面对比使用 CPU 和 GPU 训练一个线性回归模型的效率

4.3.1 使用CPU

Prepare data

# 准备数据
n = 1000000 #样本数量

X = 10*torch.rand([n,2])-5.0  #torch.rand是均匀分布
w0 = torch.tensor([[2.0,-3.0]])
b0 = torch.tensor([[10.0]])
Y = X@w0.t() + b0 + torch.normal( 0.0,2.0,size = [n,1])  # @表示矩阵乘法,增加正态扰动

Define model

# 定义模型
class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn_like(w0))
        self.b = nn.Parameter(torch.zeros_like(b0))
    #正向传播
    def forward(self,x):
        return x@self.w.t() + self.b

linear = LinearRegression()

Train model

# 训练模型
optimizer = torch.optim.Adam(linear.parameters(),lr = 0.1)
loss_func = nn.MSELoss()

def train(epoches):
    tic = time.time()
    for epoch in range(epoches):
        optimizer.zero_grad()
        Y_pred = linear(X)
        loss = loss_func(Y_pred,Y)
        loss.backward()
        optimizer.step()
        if epoch%50==0:
            print({"epoch":epoch,"loss":loss.item()})
    toc = time.time()
    print("time used:",toc-tic)

train(500)

Results：

{'epoch': 0, 'loss': 3.998589515686035}
{'epoch': 50, 'loss': 3.9990580081939697}
{'epoch': 100, 'loss': 3.998594045639038}
{'epoch': 150, 'loss': 3.998589515686035}
{'epoch': 200, 'loss': 3.998589515686035}
{'epoch': 250, 'loss': 3.998589515686035}
{'epoch': 300, 'loss': 3.998589277267456}
{'epoch': 350, 'loss': 3.998589515686035}
{'epoch': 400, 'loss': 3.998589515686035}
{'epoch': 450, 'loss': 3.998589515686035}
time used: 11.279652833938599

4.3.2 使用 `gpu`

Note：由于本机未正常跑配置 gpu，因此使用 cpu 跑通，代码逻辑无问题。若想复现代码，直接使用下述代码即可。

Prepare data

# 准备数据
n = 1000000 #样本数量
X = 10*torch.rand([n,2])-5.0  #torch.rand是均匀分布
w0 = torch.tensor([[2.0,-3.0]])
b0 = torch.tensor([[10.0]])
Y = X@w0.t() + b0 + torch.normal( 0.0,2.0,size = [n,1])  # @表示矩阵乘法,增加正态扰动
# 移动到GPU上
print("torch.cuda.is_available() = ", torch.cuda.is_available())
X = X.cuda()
Y = Y.cuda()
print("X.device:",X.device)
print("Y.device:",Y.device)

Results:

1
2
3

torch.cuda.is_available() =  False
X.device: cpu
Y.device: cpu

Define model

# 定义模型
class LinearRegression(nn.Module):
    def __init__(self):
        super().__init__()
        self.w = nn.Parameter(torch.randn_like(w0))
        self.b = nn.Parameter(torch.zeros_like(b0))
    #正向传播
    def forward(self,x):
        return x@self.w.t() + self.b

linear = LinearRegression()

# 移动模型到GPU上
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
linear.to(device)

#查看模型是否已经移动到GPU上
print("if on cuda:",next(linear.parameters()).is_cuda)

Results:

  if on cuda: False

Train model

# 训练模型
optimizer = torch.optim.Adam(linear.parameters(),lr = 0.1)
loss_func = nn.MSELoss()

def train(epoches):
    tic = time.time()
    for epoch in range(epoches):
        optimizer.zero_grad()
        Y_pred = linear(X)
        loss = loss_func(Y_pred,Y)
        loss.backward()
        optimizer.step()
        if epoch%50==0:
            print({"epoch":epoch,"loss":loss.item()})
    toc = time.time()
    print("time used:",toc-tic)

train(500)

Results:

{'epoch': 0, 'loss': 185.2291717529297}
{'epoch': 50, 'loss': 33.15369415283203}
{'epoch': 100, 'loss': 9.041170120239258}
{'epoch': 150, 'loss': 4.488009929656982}
{'epoch': 200, 'loss': 4.019914150238037}
{'epoch': 250, 'loss': 3.9960811138153076}
{'epoch': 300, 'loss': 3.9955568313598633}
{'epoch': 350, 'loss': 3.995553493499756}
{'epoch': 400, 'loss': 3.995553493499756}
{'epoch': 450, 'loss': 3.995553493499756}
time used: 11.048993825912476

4.4 torchkeras.Model 使用单 GPU 范例

下面演示使用 torchkeras.Model 来应用 GPU 训练模型的方法。

其对应的CPU训练模型代码参见《6-2,训练模型的 3 种方法》，本例仅需要在它的基础上增加一行代码，在 model.compile 时指定 device 即可。

Prepare data

1	!pip install -U torchkeras

import torch
from torch import nn

import torchvision
from torchvision import transforms

import torchkeras

transform = transforms.Compose([transforms.ToTensor()])

ds_train = torchvision.datasets.MNIST(root=data_dir + "minist/",train=True,download=True,transform=transform)
ds_valid = torchvision.datasets.MNIST(root=data_dir + "minist/",train=False,download=True,transform=transform)

dl_train =  torch.utils.data.DataLoader(ds_train, batch_size=128, shuffle=True, num_workers=1)
dl_valid =  torch.utils.data.DataLoader(ds_valid, batch_size=128, shuffle=False, num_workers=1)

print(len(ds_train))
print(len(ds_valid))

Results:

 60000
 10000

Note：上述代码中，num_worders 如果设置为 4，会在模型训练过程中，报如下错误：

1	RuntimeError: DataLoader worker (pid(s) 9320, 8476, 8504, 4600) exited unexpectedly

为了避免上述错误，在上述代码中将参数 num_workers 设置为 1。

#查看部分样本
from matplotlib import pyplot as plt

plt.figure(figsize=(8,8))
for i in range(9):
    img,label = ds_train[i]
    img = torch.squeeze(img)
    ax=plt.subplot(3,3,i+1)
    ax.imshow(img.numpy())
    ax.set_title("label = %d"%label)
    ax.set_xticks([])
    ax.set_yticks([])
plt.show()

Results:

Define model

class CnnModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x

net = CnnModel()
model = torchkeras.Model(net)
model.summary(input_shape=(1,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------

Train model

from sklearn.metrics import accuracy_score

def accuracy(y_pred,y_true):
    y_pred_cls = torch.argmax(nn.Softmax(dim=1)(y_pred),dim=1).data
    return accuracy_score(y_true.cpu().numpy(),y_pred_cls.cpu().numpy())
    # 注意此处要将数据先移动到cpu上，然后才能转换成numpy数组

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model.compile(loss_func = nn.CrossEntropyLoss(),
             optimizer= torch.optim.Adam(model.parameters(),lr = 0.02),
             metrics_dict={"accuracy":accuracy},device = device) # 注意此处compile时指定了device

dfhistory = model.fit(3,dl_train = dl_train, dl_val=dl_valid, log_step_freq=100)

Results:

Start Training ...
================================================================================2022-02-07 20:25:25
{'step': 100, 'loss': 0.865, 'accuracy': 0.712}
{'step': 200, 'loss': 0.548, 'accuracy': 0.821}
{'step': 300, 'loss': 0.426, 'accuracy': 0.863}
{'step': 400, 'loss': 0.362, 'accuracy': 0.885}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   1   | 0.333 |  0.895   |  0.092   |    0.973     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:26:24
{'step': 100, 'loss': 0.158, 'accuracy': 0.953}
{'step': 200, 'loss': 0.163, 'accuracy': 0.952}
{'step': 300, 'loss': 0.161, 'accuracy': 0.953}
{'step': 400, 'loss': 0.158, 'accuracy': 0.955}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   2   | 0.154 |  0.956   |   0.13   |    0.964     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:27:20
{'step': 100, 'loss': 0.117, 'accuracy': 0.967}
{'step': 200, 'loss': 0.13, 'accuracy': 0.964}
{'step': 300, 'loss': 0.129, 'accuracy': 0.964}
{'step': 400, 'loss': 0.138, 'accuracy': 0.963}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   3   | 0.142 |  0.962   |  0.108   |    0.973     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:28:18
Finished Training...

Evaluate model

import matplotlib.pyplot as plt

def plot_metric(dfhistory, metric):
    train_metrics = dfhistory[metric]
    val_metrics = dfhistory['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()

plot_metric(dfhistory,"loss")

Results:

1	plot_metric(dfhistory,"accuracy")

Results:

Predict

1	model.predict(dl_valid)[0:10]

Results:

tensor([[ -1.5347,  13.6357,  11.0183,  -1.2204,   6.1134,  -8.5289,  -1.1970,
          23.4888,  -1.5858,   6.7930],
        [  5.7372,   6.6389,  21.1408,   8.8913,  -8.1413,   1.0067,  -0.5886,
           7.1115,   3.5795,  -5.1366],
        [-10.7933,  32.6671,   6.6553,  -9.0977,   5.0349,   3.5976,   0.2760,
           4.0876,  -0.5267,   0.7889],
        [ 16.0825,   2.9832,   1.0242,  -1.5304,   3.6435,  -0.0766,   5.2385,
          -1.9654,   5.8100,   4.1158],
        [ -5.6585,   6.8265,   5.9029, -16.5844,  20.1805,  -4.4995,   6.3734,
           8.1729,  -6.3171,   3.0607],
        [-11.3707,  42.0231,   9.0211, -11.6561,  10.3878,   2.4693,   2.1876,
           8.7440,  -2.0002,   3.4919],
        [ -3.1595,   3.6438,   3.2809,  -9.1543,  11.1751,  -2.1898,   3.0125,
           4.4434,  -3.1949,   1.9520],
        [  1.0515,  -4.7944,   0.4023,   0.7020,   2.4049,  -1.9072,  -2.7254,
           2.2177,   1.7448,   6.5435],
        [  0.6616,  -1.2879,  -0.6804,  -2.7895,   0.9016,   7.1269,   2.9752,
           0.0648,   0.6270,  -0.4624],
        [  1.1252, -10.2878,  -0.3109,   1.9929,   6.1019,  -2.8175,  -6.0583,
           4.7262,   4.1898,  15.1963]])

Save model

# save the model parameters
torch.save(model.state_dict(), data_dir + "model_parameter.pkl")

model_clone = torchkeras.Model(CnnModel())
model_clone.load_state_dict(torch.load(data_dir + "model_parameter.pkl"))

model_clone.compile(loss_func = nn.CrossEntropyLoss(),
             optimizer= torch.optim.Adam(model.parameters(),lr = 0.02),
             metrics_dict={"accuracy":accuracy},device = device) # 注意此处compile时指定了device

model_clone.evaluate(dl_valid)

Results:

 {'val_loss': 0.10813457745549569, 'val_accuracy': 0.9733979430379747}

4.5. torchkeras.Model 使用多 GPU 范例

注：以下范例需要在有多个GPU的机器上跑。如果在单 GPU 的机器上跑，也能跑通，但是实际上使用的是单个 GPU。

Prepare data

import torch
from torch import nn

import torchvision
from torchvision import transforms

import torchkeras

transform = transforms.Compose([transforms.ToTensor()])

ds_train = torchvision.datasets.MNIST(root= data_dir + "minist/",train=True,download=True,transform=transform)
ds_valid = torchvision.datasets.MNIST(root=data_dir + "minist/",train=False,download=True,transform=transform)

dl_train =  torch.utils.data.DataLoader(ds_train, batch_size=128, shuffle=True, num_workers=1)
dl_valid =  torch.utils.data.DataLoader(ds_valid, batch_size=128, shuffle=False, num_workers=1)

print(len(ds_train))
print(len(ds_valid))

Results:

 60000
 10000

Define model

class CnnModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)  
        return x

net = nn.DataParallel(CnnModule())  #Attention this line!!!
model = torchkeras.Model(net)

model.summary(input_shape=(1,32,32))

Results:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------

Training model

from sklearn.metrics import accuracy_score

def accuracy(y_pred,y_true):
    y_pred_cls = torch.argmax(nn.Softmax(dim=1)(y_pred),dim=1).data
    return accuracy_score(y_true.cpu().numpy(),y_pred_cls.cpu().numpy())
    # 注意此处要将数据先移动到cpu上，然后才能转换成numpy数组

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.compile(loss_func = nn.CrossEntropyLoss(),
             optimizer= torch.optim.Adam(model.parameters(),lr = 0.02),
             metrics_dict={"accuracy":accuracy},device = device) # 注意此处compile时指定了device

dfhistory = model.fit(3,dl_train = dl_train, dl_val=dl_valid, log_step_freq=100)

Results:

Start Training ...
================================================================================2022-02-07 20:36:45
{'step': 100, 'loss': 0.919, 'accuracy': 0.683}
{'step': 200, 'loss': 0.57, 'accuracy': 0.808}
{'step': 300, 'loss': 0.442, 'accuracy': 0.853}
{'step': 400, 'loss': 0.381, 'accuracy': 0.876}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   1   | 0.352 |  0.886   |  0.166   |    0.951     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:38:01
{'step': 100, 'loss': 0.157, 'accuracy': 0.956}
{'step': 200, 'loss': 0.146, 'accuracy': 0.958}
{'step': 300, 'loss': 0.144, 'accuracy': 0.959}
{'step': 400, 'loss': 0.146, 'accuracy': 0.958}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   2   | 0.141 |   0.96   |  0.095   |    0.971     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:39:01
{'step': 100, 'loss': 0.109, 'accuracy': 0.969}
{'step': 200, 'loss': 0.139, 'accuracy': 0.963}
{'step': 300, 'loss': 0.147, 'accuracy': 0.962}
{'step': 400, 'loss': 0.144, 'accuracy': 0.963}

 +-------+-------+----------+----------+--------------+
| epoch |  loss | accuracy | val_loss | val_accuracy |
+-------+-------+----------+----------+--------------+
|   3   | 0.149 |  0.962   |  0.136   |     0.97     |
+-------+-------+----------+----------+--------------+

================================================================================2022-02-07 20:40:01
Finished Training...

Evaluate model

import matplotlib.pyplot as plt
  
def plot_metric(dfhistory, metric):
    train_metrics = dfhistory[metric]
    val_metrics = dfhistory['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()
  
plot_metric(dfhistory, "loss")

Results:

1	plot_metric(dfhistory,"accuracy")

Results:

Predict

1	model.predict(dl_valid)[0:10]

Results:

tensor([[ -4.9494,   2.9972,   1.9058,  -3.8375,  -1.3504,  -7.7621, -53.7149,
          29.6964, -10.8678,  11.1160],
        [  3.9030,   4.9350,  11.2674,   5.4047,  -1.6845,   2.4025,   3.5151,
           5.4424,   1.9472,   1.8328],
        [ -6.1424,  50.0179,  -1.1996, -29.9945, -16.0382,  -4.8336,   6.6327,
          -1.6870, -14.6170,  -7.5207],
        [ 19.0990,  -1.2968,  -1.3169,  -3.4305,  -0.0882,  -3.0442,  -1.3612,
          -8.9234,   0.4975,   2.5006],
        [-14.6392,  11.3554,   1.8696,  -9.9511,  24.8780,   1.3471,  14.9980,
          10.8629,   2.3685,   7.3764],
        [-10.8968,  62.1849,  -1.2042, -35.0280, -17.4673,  -5.7874,   8.7916,
           1.5346, -17.9422,  -9.1584],
        [ -8.9447,   6.1596,   1.5767,  -5.9862,  14.6574,   1.3512,   8.4579,
           6.3231,   1.6134,   4.7799],
        [ -0.5005,  -7.3161,   0.0935,   0.7475,   1.5700,   1.9768,  -3.4311,
          -0.7360,   0.9882,   6.2118],
        [  0.3860,   0.2817,   0.3428,  -0.1901,  -0.2304,   0.3337,  -0.5194,
          -0.0692,   0.5595,   0.1426],
        [ -1.6844, -17.9149,  -0.2369,   2.0809,   4.0330,   4.4971,  -7.4476,
          -1.6516,   1.5586,  14.7376]])

Save

# save the model parameters
torch.save(model.net.module.state_dict(), data_dir + "model_parameter.pkl")

net_clone = CnnModel()
net_clone.load_state_dict(torch.load(data_dir + "model_parameter.pkl"))

model_clone = torchkeras.Model(net_clone)
model_clone.compile(loss_func = nn.CrossEntropyLoss(),
             optimizer= torch.optim.Adam(model.parameters(),lr = 0.02),
             metrics_dict={"accuracy":accuracy},device = device)
model_clone.evaluate(dl_valid)

Results:

 {'val_loss': 0.13572600440570165, 'val_accuracy': 0.9695411392405063}

4.6 torchkeras.LightModel使用GPU/TPU范例

使用 torchkeras.LightModel 可以非常容易地将训练模式从 cpu 切换到单个 gpu，多个 gpu 乃至多个 tpu。

Prepare data

import torch
from torch import nn

import torchvision
from torchvision import transforms

import torchkeras

transform = transforms.Compose([transforms.ToTensor()])

ds_train = torchvision.datasets.MNIST(root=data_dir + "minist/",train=True,download=True,transform=transform)
ds_valid = torchvision.datasets.MNIST(root=data_dir + "minist/",train=False,download=True,transform=transform)

dl_train =  torch.utils.data.DataLoader(ds_train, batch_size=128, shuffle=True, num_workers=1)
dl_valid =  torch.utils.data.DataLoader(ds_valid, batch_size=128, shuffle=False, num_workers=1)

print(len(ds_train))
print(len(ds_valid))

Results:

 60000
 10000

Define model

import torchkeras
import pytorch_lightning as pl
import torchmetrics as metrics

class CnnNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Conv2d(in_channels=1,out_channels=32,kernel_size = 3),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Conv2d(in_channels=32,out_channels=64,kernel_size = 5),
            nn.MaxPool2d(kernel_size = 2,stride = 2),
            nn.Dropout2d(p = 0.1),
            nn.AdaptiveMaxPool2d((1,1)),
            nn.Flatten(),
            nn.Linear(64,32),
            nn.ReLU(),
            nn.Linear(32,10)]
        )
    def forward(self,x):
        for layer in self.layers:
            x = layer(x)
        return x

class Model(torchkeras.LightModel):

    #loss,and optional metrics
    def shared_step(self,batch)->dict:
        x, y = batch
        prediction = self(x)
        loss = nn.CrossEntropyLoss()(prediction,y)
        preds = torch.argmax(nn.Softmax(dim=1)(prediction),dim=1).data
        acc = metrics.functional.accuracy(preds, y)
        dic = {"loss":loss,"acc":acc}
        return dic

    #optimizer,and optional lr_scheduler
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.0001)
        return {"optimizer":optimizer,"lr_scheduler":lr_scheduler}

pl.seed_everything(1234)
net = CnnNet()
model = Model(net)

torchkeras.summary(model,input_shape=(1,32,32))
print(model)

Results:

Global seed set to 1234
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 30, 30]             320
         MaxPool2d-2           [-1, 32, 15, 15]               0
            Conv2d-3           [-1, 64, 11, 11]          51,264
         MaxPool2d-4             [-1, 64, 5, 5]               0
         Dropout2d-5             [-1, 64, 5, 5]               0
 AdaptiveMaxPool2d-6             [-1, 64, 1, 1]               0
           Flatten-7                   [-1, 64]               0
            Linear-8                   [-1, 32]           2,080
              ReLU-9                   [-1, 32]               0
           Linear-10                   [-1, 10]             330
================================================================
Total params: 53,994
Trainable params: 53,994
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.003906
Forward/backward pass size (MB): 0.359695
Params size (MB): 0.205971
Estimated Total Size (MB): 0.569572
----------------------------------------------------------------
Model(
  (net): CnnNet(
    (layers): ModuleList(
      (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
      (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (4): Dropout2d(p=0.1, inplace=False)
      (5): AdaptiveMaxPool2d(output_size=(1, 1))
      (6): Flatten(start_dim=1, end_dim=-1)
      (7): Linear(in_features=64, out_features=32, bias=True)
      (8): ReLU()
      (9): Linear(in_features=32, out_features=10, bias=True)
    )
  )
)

Training model

ckpt_cb = pl.callbacks.ModelCheckpoint(monitor='val_loss')

# set gpus=0 will use cpu，
# set gpus=1 will use 1 gpu
# set gpus=2 will use 2gpus
# set gpus = -1 will use all gpus
# you can also set gpus = [0,1] to use the  given gpus
# you can even set tpu_cores=2 to use two tpus

trainer = pl.Trainer(max_epochs=10,gpus = 0, callbacks=[ckpt_cb])

trainer.fit(model,dl_train,dl_valid)

Results:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name | Type   | Params
--------------------------------
0 | net  | CnnNet | 54.0 K
--------------------------------
54.0 K    Trainable params
0         Non-trainable params
54.0 K    Total params
0.216     Total estimated model params size (MB)

Validation sanity check: 0it [00:00, ?it/s]
Global seed set to 1234

================================================================================2022-02-07 20:43:14
epoch =  0
{'val_loss': 2.3015918731689453, 'val_acc': 0.125}

...

Validating: 0it [00:00, ?it/s]

================================================================================2022-02-07 20:57:32
epoch =  9
{'val_loss': 0.06869181245565414, 'val_acc': 0.9824960231781006}
{'loss': 0.060179587453603745, 'acc': 0.9833089113235474}

Note：可以通过 gpus 参数调节使用的 gpu 数量，由于本机 gpu 未配置，此处将其赋制为零。

Evaluate model

import pandas as pd
  
history = model.history
dfhistory = pd.DataFrame(history)
dfhistory

Results:

	val_loss	val_acc	loss	acc	epoch
0	0.110828	0.964498	0.312312	0.898893	0
1	0.082492	0.975870	0.115197	0.964164	1
2	0.069638	0.979134	0.094972	0.971149	2
3	0.063109	0.981804	0.087392	0.973736	3
4	0.085854	0.975079	0.081205	0.975974	4
5	0.070785	0.981013	0.078273	0.977257	5
6	0.063673	0.983089	0.077390	0.977867	6
7	0.074923	0.980914	0.068249	0.980172	7
8	0.071038	0.982298	0.066152	0.981304	8
9	0.068692	0.982496	0.060180	0.983309	9

Visualization

import matplotlib.pyplot as plt

def plot_metric(dfhistory, metric):
    train_metrics = dfhistory[metric]
    val_metrics = dfhistory['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()

plot_metric(dfhistory,"loss")

Results:

1	plot_metric(dfhistory,"acc")

Results:

Test

1 2	results = trainer.test(model, test_dataloaders=dl_valid, verbose = False) print(results[0])

Results:

1 2	Testing: 0it [00:00, ?it/s] {'test_loss': 0.06942175328731537, 'test_acc': 0.9822999835014343}

Predict

def predict(model,dl):
    model.eval()
    preds = torch.cat([model.forward(t[0].to(model.device)) for t in dl])

    result = torch.argmax(nn.Softmax(dim=1)(preds),dim=1).data
    return(result.data)

result = predict(model,dl_valid)
result

Results:

 tensor([7, 2, 1,  ..., 4, 5, 6])

Save and load

print(ckpt_cb.best_model_score)
model.load_from_checkpoint(ckpt_cb.best_model_path)

best_net  = model.net
torch.save(best_net.state_dict(),data_dir + "net.pt")

Results:

 tensor(0.0638)

net_clone = CnnNet()
net_clone.load_state_dict(torch.load(data_dir + "net.pt"))
model_clone = Model(net_clone)
trainer = pl.Trainer()
result = trainer.test(model_clone,test_dataloaders=dl_valid, verbose = False)

print(result)

Results:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

Testing: 0it [00:00, ?it/s]
[{'test_loss': 0.06942175328731537, 'test_acc': 0.9822999835014343}]

独孤诗人的学习驿站

Eat-pytorch-6-hyperAPI

1. Introduction

1.1 Preface

1.2 Pytorch的高阶API

2. 构建模型的3种方法

2.1 继承nn.Module基类构建自定义模型

2.2 使用nn.Sequential按层顺序构建模型

2.2.1 利用 `add_module` 方法

2.2.2 利用变长参数

2.2.3 利用 `OrderedDict`

2.3 继承nn.Module基类构建模型

2.3.1 nn.Sequential 作为模型容器

2.3.2 nn.ModuleList 作为模型容器

2.3.3 nn.ModuleDict 作为模型容器

3. 训练模型的3种方法

3.1 Prepare data

3.2 脚本风格

3.3 函数风格

3.4 类风格 torchkeras.Model

3.5 类风格 torchkeras.LightModel

4. 使用GPU训练模型

4.1 GPU 相关基本操作

4.2 矩阵乘法范例

4.3 线性回归范例

4.3.1 使用CPU

4.3.2 使用 `gpu`

4.4 torchkeras.Model 使用单 GPU 范例

4.5. torchkeras.Model 使用多 GPU 范例

4.6 torchkeras.LightModel使用GPU/TPU范例

1. Introduction

1.1 Preface

1.2 Pytorch的高阶API

2. 构建模型的3种方法

2.1 继承nn.Module基类构建自定义模型

2.2 使用nn.Sequential按层顺序构建模型

2.2.1 利用 add_module 方法

2.2.2 利用变长参数

2.2.3 利用 OrderedDict

2.3 继承nn.Module基类构建模型

2.3.1 nn.Sequential 作为模型容器

2.3.2 nn.ModuleList 作为模型容器

2.3.3 nn.ModuleDict 作为模型容器

3. 训练模型的3种方法

3.1 Prepare data

3.2 脚本风格

3.3 函数风格

3.4 类风格 torchkeras.Model

3.5 类风格 torchkeras.LightModel

4. 使用GPU训练模型

4.1 GPU 相关基本操作

4.2 矩阵乘法范例

4.3 线性回归范例

4.3.1 使用CPU

4.3.2 使用 gpu

4.4 torchkeras.Model 使用单 GPU 范例

4.5. torchkeras.Model 使用多 GPU 范例

4.6 torchkeras.LightModel使用GPU/TPU范例

2.2.1 利用 `add_module` 方法

2.2.3 利用 `OrderedDict`

4.3.2 使用 `gpu`