【分布式-同步梯度更新】

静静喜欢大白发布时间：2020-08-18 18:18:55 ，浏览量：3

转载https://blog.csdn.net/qq_28626909/article/details/85003392 还可参考 https://www.dazhuanlan.com/2019/12/24/5e022023d8081/ https://www.javaroad.cn/questions/148517 https://www.jianshu.com/p/7fddb580ab65

在tensorflow的训练中，分布式可以大大的加快模型训练速度，但是分布式怎么分配和参数设定，都和SyncReplicasOptimizer这个函数有很大关系。

操作系统：Ubuntu16.04

运行环境：python3.6,nvidia384(4块)，tensorflow-gpu1.10+cuda+cudnn(根据自己实际gpu配置)

现在我们看看SyncReplicasOptimizer这个函数的源码解读

首先这个SyncReplicasOptimizer函数是专门出来分布式深度学习中的同步梯度下降的(要用异步梯度的可以直接无视)


  
    
     
    
    
     
      def __init__(self,
     
    

    
     
    
    
     
       opt: Any,
     
    

    
     
    
    
     
       replicas_to_aggregate: Any,
     
    

    
     
    
    
     
       total_num_replicas: Any = None,
     
    

    
     
    
    
     
       variable_averages: Any = None,
     
    

    
     
    
    
     
       variables_to_average: Any = None,
     
    

    
     
    
    
     
       use_locking: bool = False,
     
    

    
     
    
    
     
       name: str = "sync_replicas") -> 
      None
     
    

    
     
    
    
     
      Construct a sync_replicas optimizer.
     
    

    
     
    
    
     
      Params:
     
    

    
     
    
    
     
      opt – The actual optimizer that will be used to compute 
      and apply the gradients. Must be one of the Optimizer classes.
     
    

    
     
    
    
     
      replicas_to_aggregate – number of replicas to aggregate 
      for each variable update.
     
    

    
     
    
    
     
      total_num_replicas – Total number of tasks/workers/replicas, could be different 
      from replicas_to_aggregate. If total_num_replicas > replicas_to_aggregate: it 
      is backup_replicas + replicas_to_aggregate. If total_num_replicas replicas_to_aggregate: Replicas compute multiple batches per update to variables.
     
    

    
     
    
    
     
      variable_averages – Optional `ExponentialMovingAverage` object, used to maintain moving averages 
      for the variables passed 
      in `variables_to_average`.
     
    

    
     
    
    
     
      variables_to_average – a list of variables that need to be averaged. Only needed 
      if variable_averages 
      is passed 
      in.
     
    

    
     
    
    
     
      use_locking – If 
      True use locks 
      for update operation.
     
    

    
     
    
    
     
      name – string. Optional name of the returned operation.

opt：将用于计算和应用梯度的实际优化器。必须是Optimizer类之一。 variable_averages：可选的`ExponentialMovingAverage`对象，用于保持传入的变量的移动平均值 variables_to_average：需要平均的变量列表。只要如果传入variable_averages则需要。 use_locking：如果True使用锁定进行更新操作。 name：string。返回操作的可选名称。

除开opt之外，其他可以先不指定，我们这里重点看看replicas_to_aggregate和total_num_replicas这两个参数，这是非常重要的。

replicas_to_aggregate：是指我们所需要通过分布式的worker计算完成之后所需要的副本数，这里可以认为就是梯度的个数

total_num_replicas：是指我们指定出入计算样本的数量（几个机器就分配几个batch_size）

我们看看网上针对SyncReplicasOptimizer用比较多的代码这一段：

代码出处：https://blog.csdn.net/guotong1988/article/details/53927424


  
    
     
    
    
     
      #同步模式计算更新梯度
     
    

    
     
    
    
     
              rep_op = tf.train.SyncReplicasOptimizer(optimizer,
     
    

    
     
    
    
     
                                                      replicas_to_aggregate=len(
     
    

    
     
    
    
     
                                                        worker_hosts),
     
    

    
     
    
    
      
     
    

    
     
    
    
     
                                                      total_num_replicas=len(
     
    

    
     
    
    
     
                                                        worker_hosts),
     
    

    
     
    
    
     
                                                      use_locking=
      True)

我们可以看出，replicas_to_aggregate和 total_num_replicas都是等于集群中worker的数量的，但是源码中对于这两个参数又添加了额外的解释，(replicas_to_aggregate以下简称副本数，total_num_replicas以下简称计算数)


 
   
    
   
   
    
     replicas_to_aggregate: 
     number of replicas to aggregate for each variable
    
   

   
    
   
   
            
     update.
    
   

   
    
   
   
          
     total_num_replicas: 
     Total number of tasks/workers/replicas, could be
    
   

   
    
   
   
            
     different 
     from replicas_to_aggregate.
    
   

   
    
   
   
            
     If 
     total_num_replicas > replicas_to_aggregate: it is backup_replicas +
    
   

   
    
   
   
            
     replicas_to_aggregate.
    
   

   
    
   
   
            
     If 
     total_num_replicas < replicas_to_aggregate: Replicas compute
    
   

   
    
   
   
            
     multiple 
     batches per update to variables.

什么意思呢？意思就是说，副本数可以和计算数可以是不相等的。就好比有三个人交了三张试卷一样，三个人就是计算数，三张试卷就是副本数。

源码中说，If total_num_replicas > replicas_to_aggregate: it is backup_replicas + replicas_to_aggregate.

这是指如果计算数大于副本数，那么多的那个就是backup worker，即备份工作节点。

比如：四个人交了三张试卷，没交试卷的我们就认为是备份工作节点

那这又是什么玩意儿呢，图啥呢？

在同步梯度下降中，所有worker都要等大家一起计算完成之后才能进行计算，因此会出现等待机制。就像四个人做试卷，但是每个人的速度不一样，但是要同时交卷，所以我们要等待交卷最慢的那个人算完之后才能算完，这样很浪费算的很快的那个人的能力，所以就有了backup的提出，4个人交卷我们只要三个人的试卷，最慢的那个我们不要了，直接清除掉，开始下一轮的考试~

因此，加入backup worker的方法可以加速分布式同步梯度下法的训练模型，那么backup取几个呢？Revisiting Distributed Synchronous SGD这篇论文中认为是20：1的比例。

backup worker论文出处：链接：https://pan.baidu.com/s/1R22sfGTZcAy_5KEysJGX7Q 提取码：26zi

还有一种情况，那就是计算数大于副本数。当计算数小于副本数，那么他会多计算其他worker的样本，就好比三个人(给三人一人发一个试卷，分配的是没多余的)，但是要交四份试卷，那就先各算各的，先计算完成的再从已知的三份试卷中拿一个计算，然后再等着这三个人一共上交了四份试卷之后，这样就ok了。有什么用呢？我也不清楚，我们看下一步

当我们说这计算数和副本数的时候总是和机器的物理数量有关系（gpu数量），那么其实这两个都可以不等于机器的物理数量。例如：

gpu数量：3 那么len（worker）就是3

计算数：10 那么这10就指的是一次给分10个batch_size，然后让3个worker去算

副本数：8 那么这是个batch_size算完之后我只要8个梯度，然后平均，参数服务器的什么后续操作

如果有什么问题和不对的地方请留言指正，谢谢

关注

打赏

1688896170

查看更多评论

【分布式-同步梯度更新】

[ 申请 ]友情链接：