您当前的位置: 首页 > 

RuiH.AI

暂无认证

  • 0浏览

    0关注

    274博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

数值计算之 LBFGS

RuiH.AI 发布时间:2021-12-16 21:28:15 ,浏览量:0

数值计算之 LBFGS
  • 前言
  • 拟牛顿法
  • LBFGS算法
  • 代码示例
  • 后记

前言

本篇是非线性优化方法中的最后一个部分,LBFGS算法。

拟牛顿法

上篇记录了拟牛顿法的思路:通过迭代矩阵 G k G_{k} Gk​近似海森矩阵 H k − 1 H_k^{-1} Hk−1​,满足拟牛顿条件: G k + 1 ( J ( x k + 1 ) − J ( x k ) ) = x k + 1 − x k G_{k+1}(J({\bf x}_{k+1})-J({\bf x}_k))={\bf x}_{k+1} - {\bf x}_{k} \\ Gk+1​(J(xk+1​)−J(xk​))=xk+1​−xk​ 构造迭代表达式: G k + 1 = G k + P k + Q k G k + 1 y k = s k = G k y k + P k y k + Q k y k G_{k+1}=G_k+P_k+Q_k \\ G_{k+1}y_k = s_k=G_ky_k+P_ky_k+Q_ky_k Gk+1​=Gk​+Pk​+Qk​Gk+1​yk​=sk​=Gk​yk​+Pk​yk​+Qk​yk​

DFP算法用 G k G_k Gk​近似 H k − 1 H_k^{-1} Hk−1​: G k + 1 = G k + s k s k T s k T y k − G k y k y k T G k y k T G k y k G_{k+1} = G_k+ \frac{s_ks_k^T}{s_k^Ty_k} - \frac {G_ky_ky_k^TG_k}{y_k^TG_ky_k} \\ Gk+1​=Gk​+skT​yk​sk​skT​​−ykT​Gk​yk​Gk​yk​ykT​Gk​​

BFGS算法用 B k − 1 B_k^{-1} Bk−1​近似 H k − 1 H_k^{-1} Hk−1​: B k + 1 − 1 = ( I − s k y k T y k T s k ) B k − 1 ( I − y k s k T y k T s k ) + s k s k T y k T s k B_{k+1}^{-1} = (I - \frac{s_ky_k^T}{y_k^Ts_k})B_k^{-1}(I-\frac{y_ks_k^T}{y^T_ks_k}) + \frac{s_ks_k^T}{y_k^Ts_k} Bk+1−1​=(I−ykT​sk​sk​ykT​​)Bk−1​(I−ykT​sk​yk​skT​​)+ykT​sk​sk​skT​​

迭代过程中,需要存储近似矩阵 G k G_k Gk​或 B k B_k Bk​,当输入维度很大时,这个近似矩阵所需的内存非常惊人,比如输入维度为10000,则存储 G G G需要400MB,因此在大规模优化问题中,通常使用LBFGS算法来降低内存消耗

LBFGS算法

L-BFGS(low-memory BFGS)是BFGS的改进算法,主要思路是不存储BFGS迭代的 B k − 1 B_k^{-1} Bk−1​,而是存储计算 B k − 1 B_k^{-1} Bk−1​过程中的向量,通过这些向量恢复 B k − 1 B_k^{-1} Bk−1​,从而节省内存空间。

将BFGS的迭代 B k + 1 − 1 B_{k+1}^{-1} Bk+1−1​进行简写: B k + 1 − 1 = ( I − s k y k T y k T s k ) B k − 1 ( I − y k s k T y k T s k ) + s k s k T y k T s k s e t ρ k = 1 y k T s k , V k = ( I − s k y k T ) , B k − 1 = H k H k + 1 = V k H k V k T + ρ k s k s k T B_{k+1}^{-1} = (I - \frac{s_ky_k^T}{y_k^Ts_k})B_k^{-1}(I-\frac{y_ks_k^T}{y^T_ks_k}) + \frac{s_ks_k^T}{y_k^Ts_k} \\ \quad \\ set \quad \rho_k = \frac{1}{y^T_ks_k}, \quad V_k=(I-s_ky_k^T), \quad B^{-1}_k=H_k \\ \quad \\ H_{k+1}=V_{k}H_kV^T_k+\rho_ks_ks_k^T \\ \quad \\ Bk+1−1​=(I−ykT​sk​sk​ykT​​)Bk−1​(I−ykT​sk​yk​skT​​)+ykT​sk​sk​skT​​setρk​=ykT​sk​1​,Vk​=(I−sk​ykT​),Bk−1​=Hk​Hk+1​=Vk​Hk​VkT​+ρk​sk​skT​

然后将 H k + 1 H_{k+1} Hk+1​展开: H k + 1 = V k V k − 1 H k − 1 V k − 1 T V k T + V k ρ k − 1 s k − 1 s k − 1 T V k T + ρ k s k s k T = ( V k V k − 1 … V 0 ) H 0 ( V k V k − 1 … V 0 ) T + ( V k V k − 1 … V 1 ) ρ 0 s 0 s 0 T ( V k V k − 1 … V 1 ) T + ( V k V k − 1 … V 2 ) ρ 1 s 1 s 1 T ( V k V k − 1 … V 2 ) T … + ( V k ) ρ k − 1 s k − 1 s k − 1 T ( V k ) T + ρ k s k s k T H_{k+1}= V_kV_{k-1}H_{k-1}V_{k-1}^TV_k^T\\+V_k\rho_{k-1}s_{k-1}s_{k-1}^TV_k^T\\+\rho_ks_ks_k^T \\ \quad \\ = (V_kV_{k-1}\dots V_{0})H_0(V_kV_{k-1}\dots V_{0})^T \\ + (V_kV_{k-1}\dots V_{1})\rho_0s_0s_0^T(V_kV_{k-1}\dots V_{1})^T \\ + (V_kV_{k-1}\dots V_{2})\rho_1s_1s_1^T(V_kV_{k-1}\dots V_{2})^T \\ \dots \\ +(V_k)\rho_{k-1}s_{k-1}s_{k-1}^T(V_k)^T \\+\rho_ks_ks_k^T \\ \quad \\ Hk+1​=Vk​Vk−1​Hk−1​Vk−1T​VkT​+Vk​ρk−1​sk−1​sk−1T​VkT​+ρk​sk​skT​=(Vk​Vk−1​…V0​)H0​(Vk​Vk−1​…V0​)T+(Vk​Vk−1​…V1​)ρ0​s0​s0T​(Vk​Vk−1​…V1​)T+(Vk​Vk−1​…V2​)ρ1​s1​s1T​(Vk​Vk−1​…V2​)T…+(Vk​)ρk−1​sk−1​sk−1T​(Vk​)T+ρk​sk​skT​ 如果每次迭代的中间向量 s i , y i s_i,y_i si​,yi​,随着迭代次数的增加,内存消耗逐渐增大。因此,需要丢弃一部分中间向量,只保存最近的 m m m次中间向量来作近似: H k + 1 = ( V k V k − 1 … V k − m + 1 ) H k − m + 1 ( V k V k − 1 … V k − m + 1 ) T + ( V k V k − 1 … V k − m + 2 ) ρ k − m + 1 s k − m + 1 s k − m + 1 T ( V k V k − 1 … V k − m + 2 ) T + … + ( V k ) ρ k − 1 s k − 1 s k − 1 T ( V k ) T + ρ k s k s k T H_{k+1} = \\ (V_kV_{k-1}\dots V_{k-m+1})H_{k-m+1}(V_kV_{k-1}\dots V_{k-m+1})^T \\ + (V_kV_{k-1}\dots V_{k-m+2})\rho_{k-m+1}s_{k-m+1}s_{k-m+1}^T(V_kV_{k-1}\dots V_{k-m+2})^T \\ + \dots \\ + (V_k)\rho_{k-1}s_{k-1}s_{k-1}^T(V_k)^T \\+\rho_ks_ks_k^T \\ \quad \\ Hk+1​=(Vk​Vk−1​…Vk−m+1​)Hk−m+1​(Vk​Vk−1​…Vk−m+1​)T+(Vk​Vk−1​…Vk−m+2​)ρk−m+1​sk−m+1​sk−m+1T​(Vk​Vk−1​…Vk−m+2​)T+…+(Vk​)ρk−1​sk−1​sk−1T​(Vk​)T+ρk​sk​skT​ 将 H k − m + 1 H_{k-m+1} Hk−m+1​设置为数量矩阵: H k − m + 1 = s k T y k y k T y k I H_{k-m+1}=\frac{s_k^Ty_k}{y_k^Ty_k}I Hk−m+1​=ykT​yk​skT​yk​​I

于是,LBFGS的基本思想到这就结束了。不过以上 H k + 1 H_{k+1} Hk+1​的计算虽然节省了矩阵存储,但过程非常复杂,计算量大。于是LBFGS还提到了一种神奇的双循环算法来简化上面这个 H k + 1 H_{k+1} Hk+1​计算过程: 在这里插入图片描述 g k g_k gk​表示当前的梯度;最后得到的 z z z,直接就获得了迭代增量 − H k + 1 g -H_{k+1}g −Hk+1​g。

由此可以看出,LBFGS的操作实际上是一种以时间换空间的方法,通过增加计算过程,减少需要的存储空间。

代码示例

补充代码:仍然是用python写的lbfgs:

import numpy as np
import scipy.optimize
import time
import matplotlib.pyplot as plt


def partial_derivate_xy(x, y, F):
    dx = (F(x + 0.001, y) - F(x, y))/0.001
    dy = (F(x, y + 0.001) - F(x, y))/0.001
    return dx, dy


def partial_partial_derivate_xy(x, y, F):
    dx, dy = partial_derivate_xy(x, y, F)
    dxx = (partial_derivate_xy(x + 0.001, y, F)[0] - dx) / 0.001
    dyy = (partial_derivate_xy(x, y + 0.001, F)[1] - dy) / 0.001
    dxy = (partial_derivate_xy(x, y + 0.001, F)[0] - dx) / 0.001
    dyx = (partial_derivate_xy(x + 0.001, y, F)[1] - dy) / 0.001
    return dxx, dyy, dxy, dyx


def non_linear_func(x, y):
    fxy = 0.5 * (x ** 2 + y ** 2)
    return fxy


def non_linear_func_2(x, y):
    fxy = x*x + 2*y*y + 2*x*y + 3*x - y - 2
    return fxy


def non_linear_func_3(x, y):
    fxy = 0.5 * (x ** 2 - y ** 2)
    return fxy


def non_linear_func_4(x, y):
    fxy = x**4 + 2*y**4 + 3*x**2*y**2 + 4*x*y**2 + x*y + x + 2*y + 0.5
    return fxy


def non_linear_func_5(x, y):
    fxy = (np.exp(x) + np.exp(0.5 * y) + x)**2
    return fxy


def quasi_newton_LBFGS_optim(x, y, F, g_list, lr=0.01):
    dx, dy = partial_derivate_xy(x, y, F)
    grad = np.array([dx, dy])

    if len(g_list) == 0:
        G_0 = np.eye(2)
        z = np.matmul(G_0, grad)
    else:
        alpha_list = []
        q = grad
        for i in np.arange(len(g_list) - 1, -1, -1):
            s_i = g_list[i][0, :]
            y_i = g_list[i][1, :]
            p_i = 1. / np.dot(y_i.T, s_i)
            alpha_i = p_i * np.dot(s_i.T, q)
            q = q - alpha_i * y_i
            alpha_list.append(alpha_i)
        G_0 = np.dot(g_list[-1][0, :].T, g_list[-1][1, :]) / np.dot(g_list[-1][1, :].T, g_list[-1][1, :]) * np.eye(2)
        z = np.matmul(G_0, q)
        for i in np.arange(0, len(g_list), 1):
            s_i = g_list[i][0, :]
            y_i = g_list[i][1, :]
            p_i = 1. / np.dot(y_i.T, s_i)
            beta_i = p_i * np.dot(y_i.T, z)
            z = z + s_i * (alpha_list[i] - beta_i)

    z_direction = z / np.linalg.norm(z)
    vec_delta = -lr * z_direction
    vec_opt = np.array([x, y]) + vec_delta
    x_opt = vec_opt[0]
    y_opt = vec_opt[1]

    dxx, dyy = partial_derivate_xy(x_opt, y_opt, F)
    grad_delta = np.array([dxx - dx, dyy - dy])

    if len(g_list)             
关注
打赏
1658651101
查看更多评论
0.3312s