行人重识别02-04：fast-reid(BoT)-pytorch编程规范(fast-reid为例)1-hooks机制了解

以下链接是个人关于fast-reid(BoT行人重识别) 所有见解，如有错误欢迎大家指出，我会第一时间纠正。有兴趣的朋友可以加微信：17575010159 相互讨论技术。若是帮助到了你什么，一定要记得点赞！因为这是对我最大的鼓励。文末附带 \color{blue}{文末附带} 文末附带公众号 − \color{blue}{公众号 -} 公众号− 海量资源。 \color{blue}{ 海量资源}。海量资源。行人重识别02-00：fast-reid(BoT)-目录-史上最新无死角讲解

极度推荐的商业级项目： \color{red}{极度推荐的商业级项目：} 极度推荐的商业级项目：这是本人落地的行为分析项目，主要包含（1.行人检测，2.行人追踪，3.行为识别三大模块）：行为分析(商用级别)00-目录-史上最新无死角讲解

前言

fast-reid是一个行人重识别的框架，他是一个比较大的工程，其内部实现机制基本和detectron2一致，之前的 detectron2 本人讲解得不够细致，所以这里本人打算彻头彻尾，详详细细的给大家讲解一遍。本人这里的分析是一步步来的，因为本人已经比较了解。但是本人在第一次分析源码的时候，是使用反推的办法的，所以建议大家，在第一次分析源码的时候，最好也利用反推的方式（这里就不用了，顺着我的讲解来，理解就是分分钟的小事情），这是pytorch工程的一个标准模板： https://github.com/L1aoXingyu/Deep-Learning-Project-Template

hooks

相信大家通过前面的博客，已经下载好了源码，并且已经跑了起来。首先我们找到 fastreid\engine\train_loop.py 中的如下源码：

class HookBase:
    """
    这是一个Hook相关的基类，其可以使用class:`TrainerBase`进行注册，需要实现四个函数，
    四个函数被调用的流程如下
    Base class for hooks that can be registered with :class:`TrainerBase`.
    Each hook can implement 4 methods. The way they are called is demonstrated
    in the following snippet:
    .. code-block:: python   # 执行训练指令之后
        hook.before_train()  # 调用before_train()
        for iter in range(start_iter, max_iter):   # 开始进行迭代，训练数据
            hook.before_step()  # 调用before_step()
            trainer.run_step()   # 进行一个epoch的训练
            hook.after_step()     # 调用before_step()
        hook.after_train()     # 调用before_step()
    Notes:
        # 在hook的函数中，我们可以使用self.trainer去访问更多的属性，如迭代次数等等
        1. In the hook method, users can access `self.trainer` to access more
           properties about the context (e.g., current iteration).

        # hook 中的before_step函数和after_step经常是可以相互代替的。如果不需要对时间进行追踪，
        建议把需要的一些功能都在after_step中实现，如果和时间相关的一些功能，侧建议在before_step中实现
        2. A hook that does something in :meth:`before_step` can often be
           implemented equivalently in :meth:`after_step`.
           If the hook takes non-trivial time, it is strongly recommended to
           implement the hook in :meth:`after_step` instead of :meth:`before_step`.
           The convention is that :meth:`before_step` should only take negligible time.
           Following this convention will allow hooks that do care about the difference
           between :meth:`before_step` and :meth:`after_step` (e.g., timer) to
           function properly.
    Attributes:
        trainer: A weak reference to the trainer object. Set by the trainer when the hook is
            registered.
    """

    def before_train(self):
        """# 在第一次迭代之前调用
        Called before the first iteration.
        """
        pass

    def after_train(self):
        """# 在最后一次迭代之后调用
        Called after the last iteration.
        """
        pass

    def before_step(self):
        """# 在每次迭代之前调用
        Called before each iteration.
        """
        pass

    def after_step(self):
        """# 在每次迭代之后调用
        Called after each iteration.
        """
        pass

可以看到，这里的HookBase仅仅是一个基类，继承他的类，至少包含了四个函数：

    def before_train(self):    # 在第一次迭代之前调用
    def after_train(self):      # 在最后一次迭代之后调用
    def before_step(self):   # 在每次迭代之前调用
    def after_step(self):      # 在每次迭代之后调用

并且呢，注释当中还做了如下解释：

 code-block:: python   # 执行训练指令之后
    hook.before_train()  # 调用before_train()
    for iter in range(start_iter, max_iter):   # 开始进行迭代，训练数据
        hook.before_step()  # 调用before_step()
        trainer.run_step()   # 进行一个epoch的训练
        hook.after_step()     # 调用before_step()
    hook.after_train()     # 调用before_step()

从这里我们可以大致知道hooks的结构。那么问题来了，我们什么时候需要去继承一个 HookBase 这样的类呢?

使用方式

如果你已经接触过深度学习，并且自己编写过训练代码，肯定曾遇到过这样的情况： 1.时间打印（举例而已，源码不一定实现）：你需要去记录训练中的一些时间，然后打印出来。比如，在每次迭代之前，你需要记录当时间点start_time（可以把其实实现在before_step()函数中）, 在每次迭代之后，也需要获得时间点end_time,然后利用end_time-start_time（可以实现在before_step或者after_step之中）获得一次迭代的消耗的时间，然后进行打印。这个时候，我们就需要去创建一个hook，其继承于HookBase。 2.学习率衰减（举例而已，源码不一定实现）：我们在每次迭代完成之后，都需要去判断目前的迭代次数，是否已经到达了学习率衰减的标准，如果已经达到了该标准，则需要进行学习率衰减（可以实现在after_step之中）。

还有太多的例子，这里就不再介绍了，后面我们在源码中能够看到很多这样的过程。

总的来说，在训练之前，迭代之前，迭代之后，训练之后需要做的一些工作，都可以继承于hook。那么现在问题来了，如果我们实现了一个继承 HookBase 的子类，其包含的四个函数：

    def before_train(self):    # 在第一次迭代之前调用
    def after_train(self):      # 在最后一次迭代之后调用
    def before_step(self):   # 在每次迭代之前调用
    def after_step(self):      # 在每次迭代之后调用

是在哪里被调用的呢?或者说，那句具体的代码调用了他?暂且不去理会，我们继续查看TrainerBase

TrainerBase

在 fastreid\engine\train_loop.py 中的如下源码：

class TrainerBase:
    """
    Base class for iterative trainer with hooks.
    The only assumption we made here is: the training runs in a loop.
    A subclass can implement what the loop is.
    We made no assumptions about the existence of dataloader, optimizer, model, etc.
    Attributes:
        iter(int): the current iteration.
        start_iter(int): The iteration to start with.
            By convention the minimum possible value is 0.
        max_iter(int): The iteration to end training.
        storage(EventStorage): An EventStorage that's opened during the course of training.
    """

    def __init__(self):
        self._hooks = []

    def register_hooks(self, hooks):
        """把创建的所有hook都注册到self._hooks之中，保存起来
        Register hooks to the trainer. The hooks are executed in the order
        they are registered.
        Args:
            hooks (list[Optional[HookBase]]): list of hooks
        """
        hooks = [h for h in hooks if h is not None]
        for h in hooks:
            assert isinstance(h, HookBase)
            # To avoid circular reference, hooks and trainer cannot own each other.
            # This normally does not matter, but will cause memory leak if the
            # involved objects contain __del__:
            # See http://engineering.hearsaysocial.com/2013/06/16/circular-references-in-python/
            h.trainer = weakref.proxy(self)
        self._hooks.extend(hooks)

    def train(self, start_iter: int, max_iter: int):
        """
        Args:
            start_iter, max_iter (int): See docs above
        """
        # 用于log信息的打印
        logger = logging.getLogger(__name__)
        logger.info("Starting training from iteration {}".format(start_iter))

        # 设置开始迭代次数，以及最大迭代次数
        self.iter = self.start_iter = start_iter
        self.max_iter = max_iter

        # 创建一个用于事件保存的类，赋值给self.storage
        with EventStorage(start_iter) as self.storage:
            try:
                # 循环调用所有hooks的before_train()函数
                self.before_train()
                for self.iter in range(start_iter, max_iter):
                    self.before_step()
                    self.run_step()
                    self.after_step()
            except Exception:
                logger.exception("Exception during training:")
            finally:
                self.after_train()

    def before_train(self):
        for h in self._hooks:
            h.before_train()

    def after_train(self):
        for h in self._hooks:
            h.after_train()

    def before_step(self):
        for h in self._hooks:
            h.before_step()

    def after_step(self):
        for h in self._hooks:
            h.after_step()
        # this guarantees, that in each hook's after_step, storage.iter == trainer.iter
        self.storage.step()

    def run_step(self):
        raise NotImplementedError


class SimpleTrainer(TrainerBase):
    """
    针对最常见任务的简单训练，需要一个优化器，数据迭代器.
    其做了如下规定
    A simple trainer for the most common type of task:
    single-cost single-optimizer single-data-source iterative optimization.
    It assumes that every step, you:
    # 使用从data_loader中获取的数据计算loss
    1. Compute the loss with a data from the data_loader.
    # 使用上面计算出来的loss进行反向传播
    2. Compute the gradients with the above loss.
    # 使用优化器的对模型进行更新
    3. Update the model with the optimizer.

    # 如果你想去做更多处理，可以继承这个类，但是你需要去实现 run_step，或者
    去写一个属于自己的训练循环
    If you want to do anything fancier than this,
    either subclass TrainerBase and implement your own `run_step`,
    or write your own training loop.
    """

    def __init__(self, model, data_loader, optimizer):
        """
        Args:
            model: a torch Module. Takes a data from data_loader and returns a
                dict of heads.
            data_loader: an iterable. Contains data to be used to call model.
            optimizer: a torch optimizer.
        """
        super().__init__()

        """
        We set the model to training mode in the trainer.
        However it's valid to train a model that's in eval mode.
        If you want your model (or a submodule of it) to behave
        like evaluation during training, you can overwrite its train() method.
        """
        model.train()

        self.model = model
        self.data_loader = data_loader
        self._data_loader_iter = iter(data_loader)
        self.optimizer = optimizer

    def run_step(self):
        """
        Implement the standard training logic described above.
        """
        # 检测模型是否为训练模式
        assert self.model.training, "[SimpleTrainer] model was changed to eval mode!"
        # 记录迭代之前的时间点
        start = time.perf_counter()

        """
        If your want to do something with the data, you can wrap the dataloader.
        """
        data = next(self._data_loader_iter)

        # 一次迭代加载数据消耗的时间
        data_time = time.perf_counter() - start

        """
        如果你想对头部做些什么，你可以包装模型
        If your want to do something with the heads, you can wrap the model.
        """
        # 进行前向传播
        outputs, targets = self.model(data)

        # Compute loss,计算loss
        if isinstance(self.model, DistributedDataParallel):
            loss_dict = self.model.module.losses(outputs, targets)
        else:
            loss_dict = self.model.losses(outputs, targets)

        # 对loss求和
        losses = sum(loss_dict.values())

        # 对loss进行检测，看是否符合标准，然后加载到loss_dict中
        self._detect_anomaly(losses, loss_dict)

        # 存储数据加载消耗的时间信息
        metrics_dict = loss_dict
        metrics_dict["data_time"] = data_time
        self._write_metrics(metrics_dict)

        """
        如果你需要累积梯度或者类似的东西，你可以做到使用自定义的' zero_grad() '方法包装优化器。
        If you need accumulate gradients or something similar, you can
        wrap the optimizer with your custom `zero_grad()` method.
        """
        self.optimizer.zero_grad()

        # 进行反向传播
        losses.backward()

        """
        如果需要梯度裁剪/缩放或其他处理，可以使用自定义的' step() '方法包装优化器
        If you need gradient clipping/scaling or other processing, you can
        wrap the optimizer with your custom `step()` method.
        """
        self.optimizer.step()

可以看到SimpleTrainer(TrainerBase)，中函数 run_step(self) 为一次迭代的大致流程： 1.加载数据 2.前向传播 3.计算loss 4.反向传播

并且可以在TrainerBase类中看到如下代码：

        # 创建一个用于事件保存的类，赋值给self.storage
        with EventStorage(start_iter) as self.storage:
            try:
                # 循环调用所有hooks的before_train()函数
                self.before_train()
                for self.iter in range(start_iter, max_iter):
                    self.before_step()
                    self.run_step()
                    self.after_step()
            except Exception:
                logger.exception("Exception during training:")
            finally:
                self.after_train()

这里的：

	self.before_train()
	self.before_step()
	self.after_step()
	self.after_train()

会循环调用所有 hooks 中对应的函数。

结语

到这里为止，我们已经基本明白了 hooks机制，并且知道了模型训练的总体流程。但是似乎还缺了点什么，那就是数据迭代器，模型是如何构建的。训练过程中，又是如何对验证集进行评估的。下篇博客我会为大家进行详细的介绍。

在这里插入图片描述

行人重识别02-04：fast-reid(BoT)-pytorch编程规范(fast-reid为例)1-hooks机制了解

[ 申请 ]友情链接：