What is Pytorch?

  • A machine learning framework in python.
  • Two main features:
    • N-dimension Tensor(Matrix) computation (Like numpy) on GPUs
    • Automatic differentiation for training deep neural networks

Tensor

High-dimensional matrices (arrays)

Check its shape with .shape

Create tensor:

  • Directly from data:
x = torch.tensor([[1, -1], [-1, 1]]) 
x = torch.from_numpy(np.array([[1, -1], [-1, 1]]))
"""
tensor([[1., -1.], 
        [-1., 1.]])
"""
  • Tensor of constant zeros & ones:
x = torch.zeros([2, 2])
"""
tensor([[0., 0.], 
        [0., 0.]])
"""
x = torch.ones([1, 2, 5])
"""
tensor([
        [[1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.]]
        ])
"""

Tensor Operations:

  1. Basic arithmetic

    • Addition: z = x + y
    • Subtraction: z = x - y
    • Summation: y = x.sum()
    • Mean: y = x.mean()
    • Power: y = x.pow(2)
  2. Transpose: transpose two specified dimensions

x = torch.zeros([2, 3])
x.shape # torch.Size([2, 3])
x = x.transpose(0, 1)
x.shape # torch.Size([3, 2])
  1. Squeeze: remove the specified dimension with length == 1
x = torch.zeros([1, 2, 3])
x.shape # torch.Size([1, 2, 3])
x = x.squeeze(0)
x.shape # torch.Size([2, 3])
  1. Unsqueeze: expand a new dimension
x = torch.zeros([2, 3])
x.shape # torch.Size([2, 3])
x = x.unsqueeze(1) # Assert a dimension with length == 1 at the index
x.shape # torch.Size([2, 1, 3])
  1. Cat: concatenate multiple tensors
x = torch.zeros([2, 1, 3])
y = torch.zeros([2, 3, 3])
z = torch.zeros([2, 2, 3])
w = torch.cat([x, y, z], dim=1)
w.shape # torch.Size([2, 6, 3])

  1. Stack:stack multiple tensor with same size
x = torch.zeros([2, 3])
y = torch.zeros([2, 3])
z = torch.stack([x, y], dim=0)
z.shape # torch.Size([1, 2, 3])
  1. Permute:rearrange the dimensions
x = torch.zeros([3, 5, 7])
x = x.permute(1, 2, 0)
x.shape # torch.Size([5, 7, 3])
  1. bmm: a 3D matrix multiple another one.
x = torch.zeros([10, 8, 6]) # (B, M, N)
y = torch.zeros([10, 6, 9]) # (B, N, K)
# Must have same B, and N1 == N2
z = torch.bmm(x, y) # (B, M, K)
z.shape # torch.Size([10, 8, 9])

Data Type

Data typedtypeNotes
32-bit floating pointtorch.float32默认浮点类型,训练常用
64-bit floating pointtorch.float64 高精度数值计算
16-bit floating pointtorch.float16混合精度 / GPU
bfloat 16 floating pointtorch.bfloat16无独立类名
8-bit integer (unsigned)torch.uint8常用于图像像素
8-bit integer (signed)torch.int8量化
16-bit integer (signed)torch.int16很少用
32-bit integer (signed)torch.int32CUDA 索引
64-bit integer (signed)torch.int64索引 / 标签
Booleantorch.boolmask
Complex (64-bit)torch.complex64FFT
Complex (128-bit)torch.complex128FFT

Casting:

  • .to()
import torch
t = torch.randn(3, 4)
t_double = t.to(torch.float64)
t_int = t.to(torch.int32)
  • “Shortcut” Method
目标类型快捷方法对应 PyTorch 类型
Floatt.float()torch.float32
Doublet.double()torch.float64
Halft.half()torch.float16
Intt.int()torch.int32
Longt.long()torch.int64
Bytet.byte()torch.uint8
Boolt.bool()torch.bool
x = torch.ones(2, 2)
float32 x = x.long()

Device

Tensors & modules will be computed with CPU by default
Use .to() to move tensors to appropriate devices

x = x.to("cpu")
x = x.to("cuda")
  • GPU:
    • Check if your computer has NVIDIA GPU by torch.cuda.is_available()
    • Multiple GPUs: specify "cuda:0", "cuda:1", "cuda:2", …

Gradient Calculation

x = torch.tensor([[1., 0.], [-1., 1.]], requires_grad=True)
z = x.pow(2).sum()
z.backward()
x.grad # tensor([[ 2., 0.], [-2., 2.]])

Let

Define

Gradient of a single element
For any element :

Gradient in matrix form
Therefore, the gradient with respect to is

Numerical substitution
Given

Then

  • Backpropagation from the Autograd perspective

Computational graph

Chain rule

Where:

Thus:

Leaf Tensor and Autograd

Definition

A leaf tensor is a parameter tensor that

  1. is created directly by the user,
  2. has requires_grad=True, and
  3. is not the result of another operation.
    While Non-leaf tensors are intermediate results
  • Only leaf tensors have their .grad populated automatically after backward().
  • If it is not leaf tensors even though it participates in backpropagation, it does not keep .grad unless you explicitly request it by .retain_grad()

Notices

  1. Gradients accumulate
z.backward()
z.backward()

So, always clear gradients: optimizer.zero_grad()

  1. Only floating / complex tensors can require gradients
    torch.tensor([1, 2, 3], requires_grad=True) # invalid

  2. .grad is not part of the computation graph

    • .grad is a buffer
    • It does not track gradients itself

The Training Procedures

Load Data

torch.utils.data.Dataset
torch.utils.data.DataLoader
  • Dataset: Stores data samples and expected values
  • DataLoader: groups data in batches, enables multiprocessing

You should override it.

E.g.

dataset = MyDataset(file)
dataloader = DataLoader(dataset , batch_size, shuffle=True)
# shuffle: True for training and False for testing

How to override?

from torch.utils.data import Dataset, DataLoader
 
class MyDataset(Dataset): 
    def __init__(self, file):  # Read data & preprocess
        self.data = ... 
    
    def __getitem__(self, index): # Returns one sample at a time
        return self.data[index] 
        
    def __len__ (self): # Returns the size of the dataset
        return len(self.data) 
 
 
dataset = MyDataset(file) 
dataloader = DataLoader(dataset, batch_size=5, shuffle=False)

Define Neural Network

Use torch.nn.Module (The super class of all models)

  • model.parameters(): Return all parameters
  • model.children(), Model.Modules(): Manage its children modules
  • torch.save(model.state_dict(), 'model.pth'), model.load_state_dict(torch.load('model.pth')): Save and load parameters.
  • model.train(), model.eval(): Change modes.

E.g.

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MyNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
 
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x
 
model = MyNet(10, 20, 1)
x = torch.randn(5, 10)  # batch_size=5, input_size=10
y = model(x)
print(y.shape)  # torch.Size([5, 1])
  • torch.nn – Network Layers
    • nn.Linear(in_features, out_features)
      • layer = torch.nn.Linear(32, 64)
        layer.weight.shape # torch.Size([64, 32])
        layer.bias.shape # torch.Size([64])
  • torch.nn – Non-Linear Activation Functions
    • nn.Sigmoid()
    • nn.ReLU()

More examples:

import torch.nn as nn
 
class MyModel(nn.Module): # Inherit from nn.Module
    def __init__(self): # Initialize your model & define layers
        super(MyModel, self).__init__()
        self.net = nn.Sequential(
                    nn.Linear(10, 32), 
                    nn.Sigmoid(), 
                    nn.Linear(32, 1)
                ) 
    
    def forward(self, x): # Compute output of your NN
        return self.net(x)

Equals to:

import torch.nn as nn
 
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer1 = nn.Linear(10, 32)
        self.layer2 = nn.Sigmoid()
        self.layer3 = nn.Linear(32,1)
    
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        return out

Training and Testing

  • torch.nn – Loss Functions
    • Mean Squared Error (for regression tasks): criterion = nn.MSELoss()
    • Cross Entropy (for classification tasks) criterion = nn.CrossEntropyLoss()
    • loss = criterion(model_output, expected_value)
  • torch.optim – Optimization Algorithms
    • For every batch of data:
      1. Call optimizer.zero_grad() to reset gradient of model parameters.
      2. Call loss.backward() to backpropagate gradients of prediction loss.
      3. Call optimizer.step() to adjust model parameters
    • SGD: torch.optim.SGD(model.parameters(), lr, momentum = 0)

Entire Procedure

"""
Setup
"""
train_dataset = MyDataset(file1)
validation_dataset = MyDataset(file2)
test_dataset = MyDataset(file3) # read data via MyDataset
tr_set = DataLoader(train_dataset, 16, shuffle=True)
dv_set = DataLoader(validation_dataset, 16, shuffle=False)
tt_set = DataLoader(test_dataset, 16, shuffle=False)
# put dataset into Dataloader
model = MyModel().to(device) # construct model and move to device (cpu/cuda)
criterion = nn.MSELoss() # set loss function
optimizer = torch.optim.SGD(model.parameters(), 0.1) # set optimizer
"""
Training Loops
"""
for epoch in range(n_epochs): # iterate n_epochs
    model.train() # set model to train mode
    for x, y in tr_set: # iterate through the dataloader
        optimizer.zero_grad() # set gradient to zero
        x, y = x.to(device), y.to(device) # move data to device (cpu/cuda)
        pred = model(x) # forward pass (compute output)
        loss = criterion(pred, y) # compute loss
        loss.backward() # compute gradient (backpropagation)
        optimizer.step() # update model with optimizer
"""
Validation Loops
"""
model.eval() # set model to evaluation mode
total_loss = 0 
for x, y in dv_set: # iterate through the dataloader
    x, y = x.to(device), y.to(device) 
    with torch.no_grad(): 
        pred = model(x) 
        loss = criterion(pred, y) 
    total_loss += loss.cpu().item() * len(x)  
    avg_loss = total_loss / len(dv_set.dataset)
"""
Testing Loops
"""
model.eval() # set model to evaluation mode
preds = [] 
for x in tt_set: # iterate through the dataloader
    x = x.to(device) # move data to device (cpu/cuda)
    with torch.no_grad(): # disable gradient calculation
        pred = model(x) # forward pass (compute output)
        preds.append(pred.cpu()) # collect prediction
  • model.eval(): Changes behaviour of some model layers, such as dropout and batch normalization.
  • with torch.no_grad(): Disable Autograd engine. Prevents calculations from being added into gradient computation graph. Usually used to prevent accidental training on validation/testing data.
"""
Saving and Loading
"""
torch.save(model.state_dict(), path) # Saving
 
ckpt = torch.load(path)
model.load_state_dict(ckpt) # Loading

Tools

Torchvision

Models

RNN

Optimizers

LambdaLR

Optimizer Wrappers

Customized Loss Function