Bishop PRML - Ch1. Introduction (6)

Posted on September 06, 2020

Information Theory

information
- information(surprisal)
  - $h(x) = -\ln{p(x)}$
- entropy
  - $H[x] = -\sum_x{p(x)\ln{p(x)}} = -\int_x{p(x)\ln{p(x)}\,dx}$
  - expected information (expected surprisal)
thermodynamics
- 총 입자수 $N$ , $i$ 싱태에 $n_i$ 개의 입자가 속함
- multiplicity
  - $W = {N!\over{\prod_in_i!}}$ (Linus Pauling, 1969)
  - the number of microstates corresponding to a macrostate
  - N개의 입자가 가질 수 있는 총 상태의 수
- entropy
  - $S = {R\over{N}}\ln{W} = k\ln{W}$ ( $JK^{-1}$ ) (Ralph Baierlein, 1999)
  - $H = {1\over{N}}\ln{W}$
    - 만약 온도를 섭씨의 스케일을 따른 절대온도를 쓰지 않고, 적절하게 스케일링한 온도 ( $R * \text{absolute temperature}$ )를 쓸 경우
  - $H = {1\over{N}}\ln{W} = {1\over{N}}\ln{N!} - {1\over{N}}\sum_i{\ln{n!}}$
  - $H = -\lim_{N\to{\infty}}\sum_i{\left({n_i\over{N}}\right)\ln{\left({n_i\over{N}}\right)}} = -\sum_i{p(x_i)\ln{p(x_i)}}$
    - $\sum_i{n_i} = N$
    - $p(x_i) = \lim_{N\to{\infty}}(n_i/N)$ (입자가 $x_i$ 상태에 속할 확률)
    - $n_i/N$ 비를 유지시키면서 $N\to{\infty}$
      - $\ln{N}!\simeq{N\ln{N}-N}$ (Stirling’s approximation)
discrete distribution에서 entropy의 성질
- $0 \leq H$
- $p(x_i) = 1, p(x_{j\neq{i}})=0$ 일때 $H=0$
- $p(X)$ 가 uniform할 때 $H$ 가 최대
  - Lagrange multiplier 통해 증명
    - functional
      - $\mathcal{L} = -\sum_i{p(x_i)\ln{p(x_i)}}+\lambda(\sum_ip(x_i)-1)$
    - stationary point를 찾으면 모든 $p(x_i)$ 값이 같은 경우가 됨
    - second derivative를 구하면 음수로, 최대치임을 확인 가능
      - ${\partial^2{\tilde{H}}\over{\partial{p(x_i)}\partial{p(x_j)}}} = -I_{ij}{1\over{p_i}}$
  - Jensen’s inequality를 통해서도 유도 가능
continuous distribution
- $H[x] = -\int{p(x)\ln{p(x)}\,dx}$
  - measure theory에서 실수 변수를 $\Delta$ 너비의 인터벌로 쪼갠 뒤, 각 인터벌의 분포를 discrete로 가정 후, $\lim_{\Delta\to{0}}$ 를 취하여 continuous한 경우에 대한 식을 얻을 수 있음
- $H[\mathbf{x}] = -\int{p(\mathbf{x})\ln{p(\mathbf{x})}\,d\mathbf{x}}$
  - multivariable
continuous distribution에서 entropy의 성질
- Lagrange multiplier적용하기 위하여 constraint셋업
  - $\int_{-\infty}^\infty{p(x)\,dx} = 1$
  - $\int_{-\infty}^\infty{xp(x)\,dx} = \mu$
  - $\int_{-\infty}^\infty{(x-\mu)^2p(x)\,dx} = \sigma^2$
- functinal
  - $\mathcal{L} = -\int_{-\infty}^\infty{p(x)\ln{p(x)}\,dx}\\ +\lambda_1\left(\int_{-\infty}^\infty{p(x)\,dx}-1\right)\\ +\lambda_2\left(\int_{-\infty}^\infty{xp(x)\,dx}-\mu\right)\\ +\lambda_3\left(\int_{-\infty}^\infty{(x-\mu)^2p(x)\,dx}-\sigma^2\right)$
- stationary point
  - $p(x) = \exp\{-1+\lambda_1+\lambda_2x+\lambda_3(x-\mu)^2\}$
  - $p(x) = {1\over{(2\pi\sigma^2)^{1/2}}}\exp\{-{(x-\mu)^2\over{2\sigma^2}}\}$
    - constraint 이용
- 가우시안 분포 하에서 엔트로피 최대가 된다는 것 확인 가능
- 가우시안 분포 하에서의 엔트로피
  - $H[x] = {1\over{2}}\{1+\ln(2\pi\sigma^2)\}$
    - $\sigma^2$ 이 클 수록 엔트로피가 증가
    - $\sigma^2 < 1/(2{\pi}e)$ 일 때, $H(x)<0$
conditional entropy
- $H[\mathbf{y}{\mid}\mathbf{x}] = -\int\int{p(\mathbf{y},\mathbf{x})\ln{p(\mathbf{y}{\mid}\mathbf{x})\,d\mathbf{y}d\mathbf{x}}}$
- conditional probability에 대한 entropy
joint entropy
- $H[\mathbf{x},\mathbf{y}] = H[\mathbf{y}{\mid}\mathbf{x}] + H[\mathbf{x}]$
- $\mathbf{x}$ 와 $\mathbf{y}$ 를 특정하기 위한 정보량은 $\mathbf{x}$ 를 특정하기 위한 정보량과 $\mathbf{x}$ 가 주어졌을 때 $\mathbf{y}$ 를 특정하기 위한 정보량의 합 (덧셈임에 주의)
Kullback-Leibler divergence (relative entropy)
- $D_\mathrm{KL}(p{\parallel}q) = -\int{p(\mathbf{x})\ln{q(\mathbf{x})}\,d\mathbf{x}}-\left(-\int{p(\mathbf{x})\ln{p(\mathbf{x})}\,d\mathbf{x}}\right)\\ = -\int{p(\mathbf{x})\ln\left\{q(\mathbf{x})\over{p(\mathbf{x})}\right\}\,d\mathbf{x}}$
- 비대칭임에 주의 ( $D_\mathrm{KL}(p{\parallel}q){\neq}D_\mathrm{KL}(q{\parallel}p)$ )
- 분포의 dissmilarity 척도
  - 증명 : Jensen’s inequality
    - for convex function $f(x)$
    - $f\left(\sum_{i=1}^M\lambda_ix_i\right) \leq \sum_{i=1}^M\lambda_if(x_i)\quad(\lambda_i \geq 0, \sum_i\lambda_i=1)$
    - $f(\mathbb{E}[x])\leq\mathbb{E}[f(x)]$
    - $f\left(\int{\mathbf{x}p(\mathbf{x})\,dx}\right)\leq\int{f(\mathbf{x})p(\mathbf{x})d\mathbf{x}}$
    - $D_\mathrm{KL}(p{\parallel}q) = -\int{p(\mathbf{x})\ln\left\{q(\mathbf{x})\over{p(\mathbf{x})}\right\}\,d\mathbf{x}} \geq -\ln\int{q(\mathbf{x})\,d\mathbf{x}} = 0$
      - $-\ln{x}$ 는 convex, $\int{q(\mathbf{x})\,d\mathbf{x}}=1$
- $-\int{p(\mathbf{x})\ln{q(\mathbf{x})}\,d\mathbf{x}}$
  - $p$ 의 entropy( $-\int{p(\mathbf{x})\ln{p(\mathbf{x})}\,d\mathbf{x}}$ )를 constant 취급할 경우, 앞 항( $-\int{p(\mathbf{x})\ln{q(\mathbf{x})}\,d\mathbf{x}}$ )만 남음
  - $p$ 가 조건부 레이블 분포, $q$ 가 조건부 예측 분포라고 할 때, 이 값은 곧 cross-entropy가 됨
    - 이 값은 곧 negative log likelihood와 동일
    - multinomial distribution(classification 문제)에서는 곧 cross-entropy error
    - gaussian distribution(regression 문제)에서는 곧 SSE
- 조절 가능한 패러미터 $\boldsymbol{\theta}$ 에 종속된 parametric distribution $q(\mathbf{x}{\mid}\boldsymbol{\theta})$ 를 통하여 알려지지 않은 분포 $p(\mathbf{x})$ 를 찾는 상황을 가정
  - $p(\mathbf{x})$ 는 모르지만 $p(\mathbf{x})$ 에서 샘플링된 학습 데이터셋 $(x_1,\cdots,x_N)$ 은 있는 상태, 데이터셋을 통하여 $p(\mathbf{x})$ 의 기대값 근사 가능
  - 이 때, KLD를 구하면
    - $D_\mathrm{KL}(p{\parallel}q)\simeq{{1\over{N}}\sum_{n=1}^N\{-\ln{q(\mathbf{x}_n{\mid}\boldsymbol{\theta})}+\ln{p(\mathbf{x}_n)}\}}$
  - 두 번째 항은 $\boldsymbol{\theta}$ 에 대하여 독립, 첫 번째 항은 negative log likelihood
mutual information
- $I[\mathbf{x},\mathbf{y}] = D_\mathrm{KL}(p(\mathbf{x},\mathbf{y}){\parallel}p(\mathbf{x})p(\mathbf{y}))\\ =-\int\int{p(\mathbf{x},\mathbf{y})\ln\left({p(\mathbf{x})p(\mathbf{y})\over{p(\mathbf{x},\mathbf{y})}}\right)\,d\mathbf{x}d\mathbf{y}}$
- 두 변수가 얼마나 독립적인지 척도
- $I[\mathbf{x},\mathbf{y}]\geq{0}$
- $\mathbf{x}$ 와 $\mathbf{y}$ 가 서로 독립일 때 $I[\mathbf{x},\mathbf{y}] = {0}$
- $I[\mathbf{x},\mathbf{y}] = H[\mathbf{x}]-H[\mathbf{x}{\mid}\mathbf{y}] = H[\mathbf{x}]-H[\mathbf{y}{\mid}\mathbf{x}]$

Machine Learning

ML Book Bishop PRML