pytorch 自动求导
Kong Liangqian Lv6

内容来自于https://zhuanlan.zhihu.com/p/67184419

requires_grad简介

Pytorch中的变量可以分为需要求导的,和不需要求导的,可以通过requires_grad来查看一个张量是否需要求导,一个简单的张量默认是不需要求导的

1
2
3
4
In [4]: inp = torch.tensor([1,2,3])

In [5]: inp.requires_grad
Out[5]: False

但是在nn中的所有的参数默认是需要求导的,比如

1
2
3
4
5
6
7
8
9
10
11
12
In [17]: con1 = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=2,stride=1,padding=0)

In [18]: con1.weight
Out[18]:
Parameter containing:
tensor([[[[-0.3100, -0.3029],
[-0.4663, -0.0912]]]], requires_grad=True)

In [19]: con1.bias
Out[19]:
Parameter containing:
tensor([0.4589], requires_grad=True)

在张量间的计算过程中,如果在所有输入中,有一个输入需要求导,那么输出一定会需要求导;相反,只有当所有输入都不需要求导的时候,输出才会不需要。

requires_grad 冻结参数

在训练的过程中冻结部分网络,让这些层的参数不再更新,这在迁移学习中很有用处

1
2
3
4
5
6
7
8
9
10
11
12
13
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
param.requires_grad = False

# 用一个新的 fc 层来取代之前的全连接层
# 因为新构建的 fc 层的参数默认 requires_grad=True
model.fc = nn.Linear(512, 100)

# 只更新 fc 层的参数
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)

# 通过这样,我们就冻结了 resnet 前边的所有层,
# 在训练过程中只更新最后的 fc 层中的参数。

grad计算举例

$y = \sum_i 2x_i$

1
2
3
4
5
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)
y = torch.sum(x * 2)

y.backward()
print(x.grad)

结果

1
tensor([2., 2., 2.])

对应运算

$y = 2x$

1
2
3
4
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)
y = x * 2

y.backward()

结果

1
RuntimeError: grad can be implicitly created only for scalar outputs

这是因为,在backward()这样操作,只有当y是标量才可以进行求导,而我们给出的y是三维的,因此不可以求导

代码修改

1
2
3
4
5
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)
y = x * 2

y.backward(torch.tensor([1, 1, 1)]))
print(x.grad)

结果

1
tensor([2., 2., 2.])

实际运算

在backward里面添加的数,为梯度的系数,如果是

1
2
y.backward(torch.tensor([1,2,2]))
x.grad

结果

1
tensor([2., 4., 4.])

$y = ax$

注意,该例子里,求导对象是变量$a$, 而不是$x$

1
2
3
4
5
6
x = torch.tensor([-1.0, 0.0, 1.0])
a = torch.tensor(2.0, requires_grad=True)
y = torch.sum(x * a)

y.backward()
print(a.grad)

结果

1
tensor(0)

对应运算

torch.min

min 函数表示取其中一个数,其对应位置的grad为1, 其他的全部为0。

1
2
3
4
5
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)
y = torch.min(x * 2)

y.backward()
print(x.grad)

结果

1
tensor([2., 0., 0.])

对应运算

取x的最小值,假设最小值为为x1=-1.0.那么上述y即等价于

因此

注意

上述例子只可以求导一次,看下面例子

1
2
3
4
5
6
7
8
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)

y = x * 2

y.backward(torch.tensor([1,2,2]), retain_graph=True)
y.backward(torch.tensor([1,2,2]), retain_graph=True)

print(x.grad)

结果

1
tensor([4., 8., 8.])

第二次求导会迭代第一次的结果。

/如果求导两次,必须在第二次求导前使变量的导数为0

1
2
3
4
5
6
7
8
9
x = torch.tensor([-1.0, 0.0, 1.0], requires_grad=True)

y = x * 2

y.backward(torch.tensor([1,2,2]), retain_graph=True)
x.grad.zero_()
y.backward(torch.tensor([1,2,2]), retain_graph=True)

print(x.grad)

结果

1
tensor([2., 4., 4.])

案例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# # -*- coding: utf-8 -*-
import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# For this example, the output y is a linear function of (x, x^2, x^3), so
# we can consider it as a linear layer neural network. Let's prepare the
# tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# In the above code, x.unsqueeze(-1) has shape (2000, 1), and p has shape
# (3,), for this case, broadcasting semantics will apply to obtain a tensor
# of shape (2000, 3)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.
model = torch.nn.Sequential(
torch.nn.Linear(3, 1),
torch.nn.Flatten(0, 1)
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for t in range(2000):

# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(xx)

# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())

# Zero the gradients before running the backward pass.
model.zero_grad()

# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()

# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad

# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

 Comments