What is Pytorch?

A machine learning framework in python.
Two main features:
- N-dimension Tensor(Matrix) computation (Like numpy) on GPUs
- Automatic differentiation for training deep neural networks

Tensor

High-dimensional matrices (arrays)

Check its shape with .shape

Create tensor:

Directly from data:

x = torch.tensor([[1, -1], [-1, 1]]) 
x = torch.from_numpy(np.array([[1, -1], [-1, 1]]))
"""
tensor([[1., -1.], 
        [-1., 1.]])
"""

Tensor of constant zeros & ones:

x = torch.zeros([2, 2])
"""
tensor([[0., 0.], 
        [0., 0.]])
"""
x = torch.ones([1, 2, 5])
"""
tensor([
        [[1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.]]
        ])
"""

Tensor Operations:

Basic arithmetic
- Addition: z = x + y
- Subtraction: z = x - y
- Summation: y = x.sum()
- Mean: y = x.mean()
- Power: y = x.pow(2)
Transpose: transpose two specified dimensions

x = torch.zeros([2, 3])
x.shape # torch.Size([2, 3])
x = x.transpose(0, 1)
x.shape # torch.Size([3, 2])

Squeeze: remove the specified dimension with length == 1

x = torch.zeros([1, 2, 3])
x.shape # torch.Size([1, 2, 3])
x = x.squeeze(0)
x.shape # torch.Size([2, 3])

Unsqueeze: expand a new dimension

x = torch.zeros([2, 3])
x.shape # torch.Size([2, 3])
x = x.unsqueeze(1) # Assert a dimension with length == 1 at the index
x.shape # torch.Size([2, 1, 3])

Cat: concatenate multiple tensors

x = torch.zeros([2, 1, 3])
y = torch.zeros([2, 3, 3])
z = torch.zeros([2, 2, 3])
w = torch.cat([x, y, z], dim=1)
w.shape # torch.Size([2, 6, 3])

Stack：stack multiple tensor with same size

x = torch.zeros([2, 3])
y = torch.zeros([2, 3])
z = torch.stack([x, y], dim=0)
z.shape # torch.Size([1, 2, 3])

Permute：rearrange the dimensions

x = torch.zeros([3, 5, 7])
x = x.permute(1, 2, 0)
x.shape # torch.Size([5, 7, 3])

bmm: a 3D matrix multiple another one.

x = torch.zeros([10, 8, 6]) # (B, M, N)
y = torch.zeros([10, 6, 9]) # (B, N, K)
# Must have same B, and N1 == N2
z = torch.bmm(x, y) # (B, M, K)
z.shape # torch.Size([10, 8, 9])

Data Type

Data type	dtype	Notes
32-bit floating point	`torch.float32`	默认浮点类型，训练常用
64-bit floating point	`torch.float64`	高精度数值计算
16-bit floating point	`torch.float16`	混合精度 / GPU
bfloat 16 floating point	`torch.bfloat16`	无独立类名
8-bit integer (unsigned)	`torch.uint8`	常用于图像像素
8-bit integer (signed)	`torch.int8`	量化
16-bit integer (signed)	`torch.int16`	很少用
32-bit integer (signed)	`torch.int32`	CUDA 索引
64-bit integer (signed)	`torch.int64`	索引 / 标签
Boolean	`torch.bool`	mask
Complex (64-bit)	`torch.complex64`	FFT
Complex (128-bit)	`torch.complex128`	FFT

Casting:

.to()

import torch
t = torch.randn(3, 4)
t_double = t.to(torch.float64)
t_int = t.to(torch.int32)

“Shortcut” Method

目标类型	快捷方法	对应 PyTorch 类型
Float	`t.float()`	`torch.float32`
Double	`t.double()`	`torch.float64`
Half	`t.half()`	`torch.float16`
Int	`t.int()`	`torch.int32`
Long	`t.long()`	`torch.int64`
Byte	`t.byte()`	`torch.uint8`
Bool	`t.bool()`	`torch.bool`

x = torch.ones(2, 2)
float32 x = x.long()

Device

Tensors & modules will be computed with CPU by default
Use .to() to move tensors to appropriate devices

x = x.to("cpu")
x = x.to("cuda")

GPU:
- Check if your computer has NVIDIA GPU by torch.cuda.is_available()
- Multiple GPUs: specify "cuda:0", "cuda:1", "cuda:2", …

Gradient Calculation

x = torch.tensor([[1., 0.], [-1., 1.]], requires_grad=True)
z = x.pow(2).sum()
z.backward()
x.grad # tensor([[ 2., 0.], [-2., 2.]])

Let $x \in R^{m \times n}$

Define

z = i = 1 \sum m j = 1 \sum n x_{ij}^{2}

Gradient of a single element
For any element $x_{ij}$ :

\frac{\partial z}{\partial x _{ij}} = \frac{\partial}{\partial x _{ij}} (k = 1 \sum m l = 1 \sum n x_{k l}^{2}) = 2 x_{ij}

Gradient in matrix form
Therefore, the gradient with respect to $x$ is

\nabla_{x} z = 2 x

Numerical substitution
Given

x = [1 - 1 01]

Then

\nabla_{x} z = 2 x = [2 - 2 02]

Backpropagation from the Autograd perspective

Computational graph

x pow (2) y sum z

Chain rule

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}

Where:

\frac{\partial z}{\partial y _{ij}} = 1, \frac{\partial y _{ij}}{\partial x _{ij}} = 2 x_{ij}

Thus:

\frac{\partial z}{\partial x _{ij}} = 2 x_{ij}

Leaf Tensor and Autograd

Definition

A leaf tensor is a parameter tensor that

is created directly by the user,
has requires_grad=True, and
is not the result of another operation.
While Non-leaf tensors are intermediate results

Only leaf tensors have their .grad populated automatically after backward().
If it is not leaf tensors even though it participates in backpropagation, it does not keep .grad unless you explicitly request it by .retain_grad()

Notices

Gradients accumulate

z.backward()
z.backward()

$x . grad = 2 x + 2 x = 4 x$

So, always clear gradients: optimizer.zero_grad()

Only floating / complex tensors can require gradients
torch.tensor([1, 2, 3], requires_grad=True) # invalid
.grad is not part of the computation graph
- .grad is a buffer
- It does not track gradients itself

The Training Procedures

Load Data

torch.utils.data.Dataset
torch.utils.data.DataLoader

Dataset: Stores data samples and expected values
DataLoader: groups data in batches, enables multiprocessing

You should override it.

E.g.

dataset = MyDataset(file)
dataloader = DataLoader(dataset , batch_size, shuffle=True)
# shuffle: True for training and False for testing

How to override?

from torch.utils.data import Dataset, DataLoader
 
class MyDataset(Dataset): 
    def __init__(self, file):  # Read data & preprocess
        self.data = ... 
    
    def __getitem__(self, index): # Returns one sample at a time
        return self.data[index] 
        
    def __len__ (self): # Returns the size of the dataset
        return len(self.data) 
 
 
dataset = MyDataset(file) 
dataloader = DataLoader(dataset, batch_size=5, shuffle=False)

Define Neural Network

Use torch.nn.Module (The super class of all models)

model.parameters(): Return all parameters
model.children(), Model.Modules(): Manage its children modules
torch.save(model.state_dict(), 'model.pth'), model.load_state_dict(torch.load('model.pth')): Save and load parameters.
model.train(), model.eval(): Change modes.

E.g.

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MyNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
 
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
 
model = MyNet(10, 20, 1)
x = torch.randn(5, 10)  # batch_size=5, input_size=10
y = model(x)
print(y.shape)  # torch.Size([5, 1])

torch.nn – Network Layers

nn.Linear(in_features, out_features)

layer = torch.nn.Linear(32, 64)
layer.weight.shape # torch.Size([64, 32])
layer.bias.shape # torch.Size([64])

torch.nn – Non-Linear Activation Functions
- nn.Sigmoid()
- nn.ReLU()

More examples:

import torch.nn as nn
 
class MyModel(nn.Module): # Inherit from nn.Module
    def __init__(self): # Initialize your model & define layers
        super(MyModel, self).__init__()
        self.net = nn.Sequential(
                    nn.Linear(10, 32), 
                    nn.Sigmoid(), 
                    nn.Linear(32, 1)
                ) 
    
    def forward(self, x): # Compute output of your NN
        return self.net(x)

Equals to:

import torch.nn as nn
 
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer1 = nn.Linear(10, 32)
        self.layer2 = nn.Sigmoid()
        self.layer3 = nn.Linear(32,1)
    
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        return out

Training and Testing

torch.nn – Loss Functions
- Mean Squared Error (for regression tasks): criterion = nn.MSELoss()
- Cross Entropy (for classification tasks) criterion = nn.CrossEntropyLoss()
- loss = criterion(model_output, expected_value)
torch.optim – Optimization Algorithms
- For every batch of data:
  1. Call optimizer.zero_grad() to reset gradient of model parameters.
  2. Call loss.backward() to backpropagate gradients of prediction loss.
  3. Call optimizer.step() to adjust model parameters
- SGD: torch.optim.SGD(model.parameters(), lr, momentum = 0)

Entire Procedure

"""
Setup
"""
train_dataset = MyDataset(file1)
validation_dataset = MyDataset(file2)
test_dataset = MyDataset(file3) # read data via MyDataset
tr_set = DataLoader(train_dataset, 16, shuffle=True)
dv_set = DataLoader(validation_dataset, 16, shuffle=False)
tt_set = DataLoader(test_dataset, 16, shuffle=False)
# put dataset into Dataloader
model = MyModel().to(device) # construct model and move to device (cpu/cuda)
criterion = nn.MSELoss() # set loss function
optimizer = torch.optim.SGD(model.parameters(), 0.1) # set optimizer

"""
Training Loops
"""
for epoch in range(n_epochs): # iterate n_epochs
    model.train() # set model to train mode
    for x, y in tr_set: # iterate through the dataloader
        optimizer.zero_grad() # set gradient to zero
        x, y = x.to(device), y.to(device) # move data to device (cpu/cuda)
        pred = model(x) # forward pass (compute output)
        loss = criterion(pred, y) # compute loss
        loss.backward() # compute gradient (backpropagation)
        optimizer.step() # update model with optimizer

"""
Validation Loops
"""
model.eval() # set model to evaluation mode
total_loss = 0 
for x, y in dv_set: # iterate through the dataloader
    x, y = x.to(device), y.to(device) 
    with torch.no_grad(): 
        pred = model(x) 
        loss = criterion(pred, y) 
    total_loss += loss.cpu().item() * len(x)  
    avg_loss = total_loss / len(dv_set.dataset)

"""
Testing Loops
"""
model.eval() # set model to evaluation mode
preds = [] 
for x in tt_set: # iterate through the dataloader
    x = x.to(device) # move data to device (cpu/cuda)
    with torch.no_grad(): # disable gradient calculation
        pred = model(x) # forward pass (compute output)
        preds.append(pred.cpu()) # collect prediction

model.eval(): Changes behaviour of some model layers, such as dropout and batch normalization.
with torch.no_grad(): Disable Autograd engine. Prevents calculations from being added into gradient computation graph. Usually used to prevent accidental training on validation/testing data.

"""
Saving and Loading
"""
torch.save(model.state_dict(), path) # Saving
 
ckpt = torch.load(path)
model.load_state_dict(ckpt) # Loading

シリウスの砂

探索

Pytorch

What is Pytorch?

Tensor

Create tensor:

Tensor Operations:

Data Type

Device

Gradient Calculation

Leaf Tensor and Autograd

Definition

Notices

The Training Procedures

Load Data

Define Neural Network

Training and Testing

Entire Procedure

Tools

Torchvision

Models

RNN

Optimizers

LambdaLR

Optimizer Wrappers

Customized Loss Function

关系图谱

目录

反向链接