以下链接是个人关于fast-reid(BoT行人重识别) 所有见解,如有错误欢迎大家指出,我会第一时间纠正。有兴趣的朋友可以加微信:17575010159 相互讨论技术。若是帮助到了你什么,一定要记得点赞!因为这是对我最大的鼓励。 文末附带 \color{blue}{文末附带} 文末附带 公众号 − \color{blue}{公众号 -} 公众号− 海量资源。 \color{blue}{ 海量资源}。 海量资源。 行人重识别02-00:fast-reid(BoT)-目录-史上最新无死角讲解
极度推荐的商业级项目: \color{red}{极度推荐的商业级项目:} 极度推荐的商业级项目:这是本人落地的行为分析项目,主要包含(1.行人检测,2.行人追踪,3.行为识别三大模块):行为分析(商用级别)00-目录-史上最新无死角讲解
前言fast-reid是一个行人重识别的框架,他是一个比较大的工程,其内部实现机制基本和detectron2一致,之前的 detectron2 本人讲解得不够细致,所以这里本人打算彻头彻尾,详详细细的给大家讲解一遍。本人这里的分析是一步步来的,因为本人已经比较了解。但是本人在第一次分析源码的时候,是使用反推的办法的,所以建议大家,在第一次分析源码的时候,最好也利用反推的方式(这里就不用了,顺着我的讲解来,理解就是分分钟的小事情),这是pytorch工程的一个标准模板: https://github.com/L1aoXingyu/Deep-Learning-Project-Template
hooks相信大家通过前面的博客,已经下载好了源码,并且已经跑了起来。首先我们找到 fastreid\engine\train_loop.py 中的如下源码:
class HookBase:
"""
这是一个Hook相关的基类,其可以使用class:`TrainerBase`进行注册,需要实现四个函数,
四个函数被调用的流程如下
Base class for hooks that can be registered with :class:`TrainerBase`.
Each hook can implement 4 methods. The way they are called is demonstrated
in the following snippet:
.. code-block:: python # 执行训练指令之后
hook.before_train() # 调用before_train()
for iter in range(start_iter, max_iter): # 开始进行迭代,训练数据
hook.before_step() # 调用before_step()
trainer.run_step() # 进行一个epoch的训练
hook.after_step() # 调用before_step()
hook.after_train() # 调用before_step()
Notes:
# 在hook的函数中,我们可以使用self.trainer去访问更多的属性,如迭代次数等等
1. In the hook method, users can access `self.trainer` to access more
properties about the context (e.g., current iteration).
# hook 中的before_step函数和after_step经常是可以相互代替的。如果不需要对时间进行追踪,
建议把需要的一些功能都在after_step中实现,如果和时间相关的一些功能,侧建议在before_step中实现
2. A hook that does something in :meth:`before_step` can often be
implemented equivalently in :meth:`after_step`.
If the hook takes non-trivial time, it is strongly recommended to
implement the hook in :meth:`after_step` instead of :meth:`before_step`.
The convention is that :meth:`before_step` should only take negligible time.
Following this convention will allow hooks that do care about the difference
between :meth:`before_step` and :meth:`after_step` (e.g., timer) to
function properly.
Attributes:
trainer: A weak reference to the trainer object. Set by the trainer when the hook is
registered.
"""
def before_train(self):
"""# 在第一次迭代之前调用
Called before the first iteration.
"""
pass
def after_train(self):
"""# 在最后一次迭代之后调用
Called after the last iteration.
"""
pass
def before_step(self):
"""# 在每次迭代之前调用
Called before each iteration.
"""
pass
def after_step(self):
"""# 在每次迭代之后调用
Called after each iteration.
"""
pass
可以看到,这里的HookBase仅仅是一个基类,继承他的类,至少包含了四个函数:
def before_train(self): # 在第一次迭代之前调用
def after_train(self): # 在最后一次迭代之后调用
def before_step(self): # 在每次迭代之前调用
def after_step(self): # 在每次迭代之后调用
并且呢,注释当中还做了如下解释:
code-block:: python # 执行训练指令之后
hook.before_train() # 调用before_train()
for iter in range(start_iter, max_iter): # 开始进行迭代,训练数据
hook.before_step() # 调用before_step()
trainer.run_step() # 进行一个epoch的训练
hook.after_step() # 调用before_step()
hook.after_train() # 调用before_step()
从这里我们可以大致知道hooks的结构。那么问题来了,我们什么时候需要去继承一个 HookBase 这样的类呢?
使用方式如果你已经接触过深度学习,并且自己编写过训练代码,肯定曾遇到过这样的情况: 1.时间打印(举例而已,源码不一定实现): 你需要去记录训练中的一些时间,然后打印出来。比如,在每次迭代之前,你需要记录当时间点start_time(可以把其实实现在before_step()函数中), 在每次迭代之后,也需要获得时间点end_time,然后利用end_time-start_time(可以实现在before_step或者after_step之中)获得一次迭代的消耗的时间,然后进行打印。这个时候,我们就需要去创建一个hook,其继承于HookBase。 2.学习率衰减(举例而已,源码不一定实现): 我们在每次迭代完成之后,都需要去判断目前的迭代次数,是否已经到达了学习率衰减的标准,如果已经达到了该标准,则需要进行学习率衰减(可以实现在after_step之中)。
还有太多的例子,这里就不再介绍了,后面我们在源码中能够看到很多这样的过程。
总的来说,在训练之前,迭代之前,迭代之后,训练之后需要做的一些工作,都可以继承于hook。那么现在问题来了,如果我们实现了一个继承 HookBase 的子类,其包含的四个函数:
def before_train(self): # 在第一次迭代之前调用
def after_train(self): # 在最后一次迭代之后调用
def before_step(self): # 在每次迭代之前调用
def after_step(self): # 在每次迭代之后调用
是在哪里被调用的呢?或者说,那句具体的代码调用了他?暂且不去理会,我们继续查看TrainerBase
TrainerBase在 fastreid\engine\train_loop.py 中的如下源码:
class TrainerBase:
"""
Base class for iterative trainer with hooks.
The only assumption we made here is: the training runs in a loop.
A subclass can implement what the loop is.
We made no assumptions about the existence of dataloader, optimizer, model, etc.
Attributes:
iter(int): the current iteration.
start_iter(int): The iteration to start with.
By convention the minimum possible value is 0.
max_iter(int): The iteration to end training.
storage(EventStorage): An EventStorage that's opened during the course of training.
"""
def __init__(self):
self._hooks = []
def register_hooks(self, hooks):
"""把创建的所有hook都注册到self._hooks之中,保存起来
Register hooks to the trainer. The hooks are executed in the order
they are registered.
Args:
hooks (list[Optional[HookBase]]): list of hooks
"""
hooks = [h for h in hooks if h is not None]
for h in hooks:
assert isinstance(h, HookBase)
# To avoid circular reference, hooks and trainer cannot own each other.
# This normally does not matter, but will cause memory leak if the
# involved objects contain __del__:
# See http://engineering.hearsaysocial.com/2013/06/16/circular-references-in-python/
h.trainer = weakref.proxy(self)
self._hooks.extend(hooks)
def train(self, start_iter: int, max_iter: int):
"""
Args:
start_iter, max_iter (int): See docs above
"""
# 用于log信息的打印
logger = logging.getLogger(__name__)
logger.info("Starting training from iteration {}".format(start_iter))
# 设置开始迭代次数,以及最大迭代次数
self.iter = self.start_iter = start_iter
self.max_iter = max_iter
# 创建一个用于事件保存的类,赋值给self.storage
with EventStorage(start_iter) as self.storage:
try:
# 循环调用所有hooks的before_train()函数
self.before_train()
for self.iter in range(start_iter, max_iter):
self.before_step()
self.run_step()
self.after_step()
except Exception:
logger.exception("Exception during training:")
finally:
self.after_train()
def before_train(self):
for h in self._hooks:
h.before_train()
def after_train(self):
for h in self._hooks:
h.after_train()
def before_step(self):
for h in self._hooks:
h.before_step()
def after_step(self):
for h in self._hooks:
h.after_step()
# this guarantees, that in each hook's after_step, storage.iter == trainer.iter
self.storage.step()
def run_step(self):
raise NotImplementedError
class SimpleTrainer(TrainerBase):
"""
针对最常见任务的简单训练,需要一个优化器,数据迭代器.
其做了如下规定
A simple trainer for the most common type of task:
single-cost single-optimizer single-data-source iterative optimization.
It assumes that every step, you:
# 使用从data_loader中获取的数据计算loss
1. Compute the loss with a data from the data_loader.
# 使用上面计算出来的loss进行反向传播
2. Compute the gradients with the above loss.
# 使用优化器的对模型进行更新
3. Update the model with the optimizer.
# 如果你想去做更多处理,可以继承这个类,但是你需要去实现 run_step,或者
去写一个属于自己的训练循环
If you want to do anything fancier than this,
either subclass TrainerBase and implement your own `run_step`,
or write your own training loop.
"""
def __init__(self, model, data_loader, optimizer):
"""
Args:
model: a torch Module. Takes a data from data_loader and returns a
dict of heads.
data_loader: an iterable. Contains data to be used to call model.
optimizer: a torch optimizer.
"""
super().__init__()
"""
We set the model to training mode in the trainer.
However it's valid to train a model that's in eval mode.
If you want your model (or a submodule of it) to behave
like evaluation during training, you can overwrite its train() method.
"""
model.train()
self.model = model
self.data_loader = data_loader
self._data_loader_iter = iter(data_loader)
self.optimizer = optimizer
def run_step(self):
"""
Implement the standard training logic described above.
"""
# 检测模型是否为训练模式
assert self.model.training, "[SimpleTrainer] model was changed to eval mode!"
# 记录迭代之前的时间点
start = time.perf_counter()
"""
If your want to do something with the data, you can wrap the dataloader.
"""
data = next(self._data_loader_iter)
# 一次迭代加载数据消耗的时间
data_time = time.perf_counter() - start
"""
如果你想对头部做些什么,你可以包装模型
If your want to do something with the heads, you can wrap the model.
"""
# 进行前向传播
outputs, targets = self.model(data)
# Compute loss,计算loss
if isinstance(self.model, DistributedDataParallel):
loss_dict = self.model.module.losses(outputs, targets)
else:
loss_dict = self.model.losses(outputs, targets)
# 对loss求和
losses = sum(loss_dict.values())
# 对loss进行检测,看是否符合标准,然后加载到loss_dict中
self._detect_anomaly(losses, loss_dict)
# 存储数据加载消耗的时间信息
metrics_dict = loss_dict
metrics_dict["data_time"] = data_time
self._write_metrics(metrics_dict)
"""
如果你需要累积梯度或者类似的东西,你可以做到使用自定义的' zero_grad() '方法包装优化器。
If you need accumulate gradients or something similar, you can
wrap the optimizer with your custom `zero_grad()` method.
"""
self.optimizer.zero_grad()
# 进行反向传播
losses.backward()
"""
如果需要梯度裁剪/缩放或其他处理,可以使用自定义的' step() '方法包装优化器
If you need gradient clipping/scaling or other processing, you can
wrap the optimizer with your custom `step()` method.
"""
self.optimizer.step()
可以看到SimpleTrainer(TrainerBase),中函数 run_step(self) 为一次迭代的大致流程: 1.加载数据 2.前向传播 3.计算loss 4.反向传播
并且可以在TrainerBase类中看到如下代码:
# 创建一个用于事件保存的类,赋值给self.storage
with EventStorage(start_iter) as self.storage:
try:
# 循环调用所有hooks的before_train()函数
self.before_train()
for self.iter in range(start_iter, max_iter):
self.before_step()
self.run_step()
self.after_step()
except Exception:
logger.exception("Exception during training:")
finally:
self.after_train()
这里的:
self.before_train()
self.before_step()
self.after_step()
self.after_train()
会循环调用所有 hooks 中对应的函数。
结语到这里为止,我们已经基本明白了 hooks机制,并且知道了模型训练的总体流程。但是似乎还缺了点什么,那就是数据迭代器,模型是如何构建的。训练过程中,又是如何对验证集进行评估的。下篇博客我会为大家进行详细的介绍。