- 引言
- 回顾:狭义EM与广义EM
- 狭义EM求解过程以及难点
- 广义EM针对问题的处理方法
- 广义EM算法的延伸
- 坐标上升法
- 在EM算法框架的变种形式
上一节介绍了广义EM算法,本节将介绍广义EM算法的其他变种形式。
回顾:狭义EM与广义EM 狭义EM求解过程以及难点回顾狭义EM算法的推导结果。即: log P ( X ∣ θ ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ] \log P(\mathcal X \mid \theta) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)] logP(X∣θ)=ELBO+KL[Q(Z)∣∣P(Z∣X,θ)] 其中: E L B O = ∫ Z Q ( Z ) log [ P ( X , Z ∣ θ ) Q ( Z ) ] d Z = E Q ( Z ) [ log P ( X , Z ∣ θ ) ] + H [ Q ( Z ) ] K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ] = − ∫ Z Q ( Z ) log [ P ( Z ∣ X , θ ) Q ( Z ) ] d Z ELBO = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \left[\frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z = \mathbb E_{\mathcal Q(\mathcal Z)}[\log P(\mathcal X,\mathcal Z \mid \theta)] + \mathcal H[\mathcal Q(\mathcal Z)] \\ \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)] = - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \log \left[\frac{P(\mathcal Z \mid \mathcal X,\theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ELBO=∫ZQ(Z)log[Q(Z)P(X,Z∣θ)]dZ=EQ(Z)[logP(X,Z∣θ)]+H[Q(Z)]KL[Q(Z)∣∣P(Z∣X,θ)]=−∫ZQ(Z)log[Q(Z)P(Z∣X,θ)]dZ
其迭代求解最优模型参数 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)转化为如下两个步骤:
- 令 Q ( Z ) = P ( Z ∣ X , θ ) \mathcal Q(\mathcal Z) = P(\mathcal Z \mid \mathcal X,\theta) Q(Z)=P(Z∣X,θ);
- 基于步骤1,选择合适的 θ ^ \hat \theta θ^,使ELBO达到最大; θ ^ = arg max θ { E Q ( Z ) [ log P ( X , Z ∣ θ ) ] + H [ Q ( Z ) ] } ( H [ Q ( Z ) ] = ∫ Z Q ( Z ) log Q ( Z ) d Z ) \hat \theta = \mathop{\arg\max}\limits_{\theta} \left\{ \mathbb E_{\mathcal Q(\mathcal Z)}[\log P(\mathcal X,\mathcal Z \mid \theta)] + \mathcal H[\mathcal Q(\mathcal Z)]\right\} \quad \left(\mathcal H[\mathcal Q(\mathcal Z)] = \int_{\mathcal Z}\mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z\right) θ^=θargmax{EQ(Z)[logP(X,Z∣θ)]+H[Q(Z)]}(H[Q(Z)]=∫ZQ(Z)logQ(Z)dZ)
狭义EM在求解过程中的难点在于: P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(Z∣X,θ)无法求解。其根本原因在于我们定义的隐变量 Z \mathcal Z Z本身的概率分布 P ( Z ) P(\mathcal Z) P(Z)也是非常复杂的。
广义EM针对问题的处理方法广义EM的核心思想在于:既然 P ( Z ∣ X , θ ) P(\mathcal Z\mid \mathcal X,\theta) P(Z∣X,θ)无法求解,干脆找一个与 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(Z∣X,θ)最接近的概率分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)分布替换 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(Z∣X,θ):
- 将ELBO看成关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)和 θ \theta θ的函数,则有: L [ Q ( Z ) , θ ] = ∫ Z Q ( Z ) log [ P ( X , Z ∣ θ ) Q ( Z ) ] d Z \mathcal L[\mathcal Q(\mathcal Z),\theta] = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \left[\frac{P(\mathcal X,\mathcal Z \mid \theta)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z L[Q(Z),θ]=∫ZQ(Z)log[Q(Z)P(X,Z∣θ)]dZ
- 基于上式,
log
P
(
X
∣
θ
)
\log P(\mathcal X \mid \theta)
logP(X∣θ)表示如下:
由于
K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)] \geq 0 KL[Q(Z)∣∣P(Z∣X,θ)]≥0恒成立,因而
log P ( X ∣ θ ) ≥ L [ Q ( Z ) , θ ] \log P(\mathcal X \mid \theta) \geq \mathcal L[\mathcal Q(\mathcal Z),\theta] logP(X∣θ)≥L[Q(Z),θ]恒成立。
log P ( X ∣ θ ) = L [ Q ( Z ) , θ ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ] ≥ L [ Q ( Z ) , θ ] \log P(\mathcal X \mid \theta) = \mathcal L[\mathcal Q(\mathcal Z),\theta] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)] \geq \mathcal L[\mathcal Q(\mathcal Z),\theta] logP(X∣θ)=L[Q(Z),θ]+KL[Q(Z)∣∣P(Z∣X,θ)]≥L[Q(Z),θ] - 以第 t + 1 t+1 t+1次迭代为例,上一次迭代(第 t t t次)的模型参数 θ ( t ) \theta^{(t)} θ(t)是已知项,因而 log P ( X ∣ θ ) \log P(\mathcal X \mid \theta) logP(X∣θ)是已知项,定值。因而 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)满足如下关系: Q ^ ( Z ) = arg min Q ( Z ) K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X , θ ) ] = arg max Q ( Z ) L [ Q ( Z ) , θ ] \begin{aligned} \hat {\mathcal Q}(\mathcal Z) & = \mathop{\arg\min}\limits_{\mathcal Q(\mathcal Z)} \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X,\theta)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z),\theta] \end{aligned} Q^(Z)=Q(Z)argminKL[Q(Z)∣∣P(Z∣X,θ)]=Q(Z)argmaxL[Q(Z),θ]
- 此时已经找到第
t
+
1
t+1
t+1次迭代步骤中最优的隐变量后验
Q
^
(
Z
)
\hat {\mathcal Q}(\mathcal Z)
Q^(Z),令
Q
(
t
+
1
)
(
Z
)
=
Q
^
(
Z
)
\mathcal Q^{(t+1)}(\mathcal Z) = \hat {\mathcal Q}(\mathcal Z)
Q(t+1)(Z)=Q^(Z),继续执行M部操作:
此时的
H [ Q ( t + 1 ) ( Z ) ] \mathcal H[\mathcal Q^{(t+1)}(\mathcal Z)] H[Q(t+1)(Z)]中的参数已经确定,可以看作一个常数;
θ ( t + 1 ) = arg max θ L [ Q ( t + 1 ) ( Z ) , θ ] = arg max θ E Q ( t + 1 ) ( Z ) [ log P ( X , Z ∣ θ ) ] + H [ Q ( t + 1 ) ( Z ) ] = arg max θ E Q ( t + 1 ) ( Z ) [ log P ( X , Z ∣ θ ) ] \begin{aligned} \theta^{(t+1)} & = \mathop{\arg\max}\limits_{\theta} \mathcal L[\mathcal Q^{(t+1)}(\mathcal Z),\theta]\\ & = \mathop{\arg\max}\limits_{\theta} \mathbb E_{\mathcal Q^{(t+1)}(\mathcal Z)}[\log P(\mathcal X,\mathcal Z \mid \theta)] + \mathcal H[\mathcal Q^{(t+1)}(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\theta} \mathbb E_{\mathcal Q^{(t+1)}(\mathcal Z)}[\log P(\mathcal X,\mathcal Z \mid \theta)] \end{aligned} θ(t+1)=θargmaxL[Q(t+1)(Z),θ]=θargmaxEQ(t+1)(Z)[logP(X,Z∣θ)]+H[Q(t+1)(Z)]=θargmaxEQ(t+1)(Z)[logP(X,Z∣θ)]
至此,整理广义EM算法迭代过程如下: { Q ( t + 1 ) ( Z ) = arg max Q ( Z ) L [ Q ( Z ) , θ ( t ) ] θ ( t + 1 ) = arg max θ L [ Q ( t + 1 ) ( Z ) , θ ] \begin{cases} \mathcal Q^{(t+1)}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z),\theta^{(t)}] \\ \theta^{(t+1)} = \mathop{\arg\max}\limits_{\theta} \mathcal L[\mathcal Q^{(t+1)}(\mathcal Z),\theta] \end{cases} ⎩ ⎨ ⎧Q(t+1)(Z)=Q(Z)argmaxL[Q(Z),θ(t)]θ(t+1)=θargmaxL[Q(t+1)(Z),θ]
广义EM算法的延伸 坐标上升法基于上述迭代过程,广义EM算法又称MM算法(Minorize-Maximization algorithm)。 无论求解
Q
(
Z
)
\mathcal Q(\mathcal Z)
Q(Z)还是求解
θ
\theta
θ都是求解
L
[
Q
(
Z
)
,
θ
]
\mathcal L[\mathcal Q(\mathcal Z),\theta]
L[Q(Z),θ]的最大值。
继续观察上述迭代过程:
- 首先将 L [ Q ( Z ) , θ ] \mathcal L[\mathcal Q(\mathcal Z),\theta] L[Q(Z),θ]中的 θ \theta θ变量固定住,选择合适的变量 Q ( Z ) = Q ^ ( Z ) \mathcal Q(\mathcal Z) = \hat {\mathcal Q}(\mathcal Z) Q(Z)=Q^(Z),来求解 L [ Q ( Z ) , θ ] \mathcal L[\mathcal Q(\mathcal Z),\theta] L[Q(Z),θ]的最优值;
- 在步骤1的基础上,固定 Q ( Z ) = Q ^ ( Z ) \mathcal Q(\mathcal Z) = \hat {\mathcal Q}(\mathcal Z) Q(Z)=Q^(Z),选择合适的变量 θ = θ ^ \theta = \hat \theta θ=θ^,再次求解 L [ Q ( Z ) , θ ] \mathcal L[\mathcal Q(\mathcal Z),\theta] L[Q(Z),θ]的最优值;
根据迭代过程,我们会得到这样一组结果: θ i n i t , Q ( 1 ) ( Z ) , θ ( 1 ) , Q ( 2 ) ( Z ) , θ ( 2 ) , ⋯ , θ ( t ) , Q ( t + 1 ) ( Z ) , θ ( t + 1 ) , ⋯ \theta_{init},\mathcal Q^{(1)}(\mathcal Z),\theta^{(1)},\mathcal Q^{(2)}(\mathcal Z),\theta^{(2)},\cdots,\theta^{(t)},\mathcal Q^{(t+1)}(\mathcal Z),\theta^{(t+1)},\cdots θinit,Q(1)(Z),θ(1),Q(2)(Z),θ(2),⋯,θ(t),Q(t+1)(Z),θ(t+1),⋯ 这种基于同一函数,交互固定变量,使得函数结果最大的方法称为坐标上升法(Coordinate Ascent),同理,也存在坐标下降法(Coordinate Descent)。取决于对函数求解最大值还是最小值。
从维度角度解释,其主要思想是每一次迭代时,只对一个维度方向进行优化,其他维度信息固定不变,从而使多变量优化问题转化为迭代的单变量优化问题。
在EM算法框架的变种形式整个EM算法框架的核心仍然是对隐变量 Z \mathcal Z Z最优后验的求解: P ^ ( Z ∣ X , θ ) \hat P(\mathcal Z \mid \mathcal X,\theta) P^(Z∣X,θ) 基于 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(Z∣X,θ)难求解的问题,广义EM算法框架给出它的近似方法: Q ( t + 1 ) ( Z ) = arg max Q ( Z ) L [ Q ( Z ) , θ ( t ) ] \mathcal Q^{(t+1)}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z),\theta^{(t)}] Q(t+1)(Z)=Q(Z)argmaxL[Q(Z),θ(t)]
如果仅针对 P ( Z ∣ X , θ ) P(\mathcal Z \mid \mathcal X,\theta) P(Z∣X,θ)的近似求解,我们还有其他方式,例如近似推断:
- 确定性近似:变分推断(Variational Inference,VI);
对应的EM框架称为:VBEM方法(变分贝叶斯EM方法)
- 随机性近似:蒙特卡洛方法(Monte Carlo);
对应的EM框架称为:MCEM方法(蒙特卡洛EM算法)
至此,EM算法的介绍结束,下一节将介绍第一个基于EM算法求解的概率生成模型——高斯混合模型(Gaussian Mixture Model)。
相关参考: 机器学习-EM算法6(EM的变种)