기본개념 미정리 문서

Posted on August 18, 2018

KAIST 문인철교수 강의 / Utah Univ cs5350 / Georgia Tech cs7641 / PRML
참고하여 헷갈릴 수 있는 개념 위주로 정리

MLE (Maximum Likelihood Estimation)

MLE의 정확한 의미

observed data의 등장확률을 최대화하는 모수를 찾음

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta)$

Binomial에서의 MLE (Thumbtack flip의 예)

$P(D|\theta) = \theta^{a_H}(1-\theta)^{a_T}$

당연히 이항계수가 없음에 주의. 데이터행렬 그 자체의 Likelihood이므로 순서를 구분한다. 단순하게 독립사건확률의 곱으로 표현 ln 함수가 monotonically increasing 하기 때문에

$\ln{P(D|\theta)} = a_H\ln\theta+a_T\ln(1-\theta)$

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta) = \underset{\theta}{\operatorname{argmax}}\ln{P(D|\theta)} = \underset{\theta}{\operatorname{argmax}}\{a_H\ln\theta+a_T\ln(1-\theta)\}$

${d\over{d\theta}}\{a_H\ln\theta+a_T\ln(1-\theta)\} = 0$

${a_H\over\theta}-{a_T\over{1-\theta}} = 0$

$\hat\theta = {a_H\over{a_H+a_T}}$

Simple Error Bound (Hoeffding’s inequality)

$P(|\hat\theta-\theta^*|\geq\epsilon)\leq2\exp{(-2N\epsilon^2)}$

$\hat\theta$ 은 probability $P$ , approximately $\epsilon$ 에 대하여 “PAC(Probably Approximate Correct) learning의 결과”

MLE의 단점

Observation에 따라 그 값이 너무 민감하게 변함
이러한 단점이 보완된 방법으로 MAP가 있음

MAP (Maximum a Posteriori Estimation)

MAP의 정확한 의미

observed data에 대하여 최대 확률을 가지는 모수를 찾음

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(\theta|D)$

$P(\theta|D) = {P(D|\theta)P(\theta)\over{P(D)}}$

$posterior = {likelihood * prior\over{normalizning\quad{constant}}}$

$P(D|\theta) = \theta^{a_H}(1-\theta)^{a_T}$

$P(\theta) = ?$

Beta Distribution Model 가정 : 변수범위 [0,1], likelihood와 곱할 경우 동일한 형태가 됨

$P(\theta) = {\theta^{\alpha-1}(1-\theta)^{\beta-1}\over{B(\alpha,\beta)}}$

$B(\alpha,\beta) = {\Gamma(\alpha)\Gamma(\beta)\over\Gamma(\alpha+\beta)}$

$\Gamma(\alpha) = (\alpha - 1)!$

$B(\alpha,\beta) = const$

$P(\theta|D)\propto\theta^{a_H}(1-\theta)^{a_T}\theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{a_H+\alpha-1}(1-\theta)^{a_T+\beta-1}$

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(\theta|D)$

$\hat\theta = {a_H+\alpha-1\over{a_H+a_T+\alpha+\beta-2}}$

MLE, MAP 되새김질

MLE: likelihood의 최대값을 바로 계산, 관찰된 데이터를 가장 잘 표현하는 분포와 모수를 추정
MAP: likelihood * prior의 최대값을 계산, 미리 prior의 분포와 모수를 추정한 뒤 이와 likelihood를 곱한 값이 최대가 되는 분포와 모수를 추정
(내 생각) prior의 의미: 동전을 던질 때, 5:5가 나올 수도 있지만, 4:6이 나올 수도 있고, 7:3이 나올 수도 있음 이 자체를 확률이라고 생각하면, 사전확률 자체를 특정한 값이 아닌 확률분포의 일종으로 여길 수 있고 이를 통하여 유동적인 연쇄적 확률분포 추정이 가능해지며, 인위적인 prior모델을 반영해 보다 유연한 확률분포 추정이 가능
데이터에 대한 적절한 가정이 있을 경우 관측한 데이터만을 사용하는 것보다 더 우수한 추정 가능 (더 좋은 가정이 있다면 더 좋은 추정이 가능하다)

Distribution

Gaussian Distribution

$f(x;\mu,\sigma) = {1\over(2\pi)^{1/2}\sigma}exp{\left\{-{(x-\mu)^2\sigma^{-2}\over2}\right\}}$

Multivariable Gaussian Distribution

$f(x;\mu,\Sigma) = {1\over(2\pi)^{D/2}|\Sigma|^{1/2}}exp{\left\{-{(x-\mu)^T\Sigma^{-1}(x-\mu)\over2}\right\}}$

특징: long tail … x의 범위 $(-\infty, +\infty)$

Beta Distribution

$f(x;\alpha,\beta) = {x^{\alpha-1}(1-x)^{\beta-1}\over\int_0^1{u^{\alpha-1}(1-u)^{\beta-1}du}} = {x^{\alpha-1}(1-x)^{\beta-1}\over{B(\alpha,\beta)}} = {\Gamma(\alpha+\beta)\over\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}$

특징: x의 범위 $[0, 1]$
mean: ${\alpha\over\alpha+\beta}$
variance: ${\alpha\beta\over(\alpha+\beta)^2(\alpha+\beta+1)}$

Binomial Distribution

$f(k;n,p) = {n\choose{k}}p^k(1-p)^{n-k}$

${n\choose{k}} = {n!\over{k!(n-k)!}}$

특징: discrete
mean: $np$
variance: $np(1-p)$

Multinomial Distribution

$f(x_1,\cdots,x_k;n,p_1,\cdots,p_k) = {n!\over{x_1!{\cdots}x_k!}}p_1^{x_1}{\cdots}p_k^{x_k}$

mean: $np_i$
variance: $np_i(1-p_i)$
covariance: $-np_ip_j$

Gamma function

오일러 적분

$\Gamma(x) = \int_0^\infty{t^{x-1}\over{e^t}}dt$

가우스 극한

$\Gamma(x) = \underset{n\rightarrow\infty}{\lim}{1{\cdots}n\over{x\cdots(x+n)}}n^x = \underset{n\rightarrow\infty}{\lim}{n!(x-1)!\over{(x+n)!}}n^x = \underset{n\rightarrow\infty}{\lim}{n!n^x\over{(x+n)!}}(x-1)!$

가우스 극한의 식은 함수를 유일하게 정하기 위함

$\underset{n\rightarrow\infty}{\lim}{n!n^x\over{(x+n)!}}=1$

이므로

$(x-1)!$

로 수렴

오일러 적분식에서 팩토리얼식 유도

$\Gamma(x+1) = \int_0^\infty{t^x\over{e^t}}dt = \left[-t^xe^{-t}\right]_{t=0}^{t=\infty} - \left({\int_0^\infty}-xt^{x-1}e^{-t}dt\right)$

$= \underset{t\rightarrow\infty}{\lim}(-t^xe^{-t}) - 0 + \left({\int_0^\infty}xt^{x-1}e^{-t}dt\right)$

극한항 L'Hospital's Rule 적용 (혹은 exponential이 polynomial보다 증가속도가 크기 때문에 0으로 수렴)

$= x{\int_0^\infty}t^{x-1}e^{-t}dt = x\Gamma(x)$

$\Gamma(x+1) = x\Gamma(x)$

$\Gamma(1) = 1,{\quad}0! = 1$

$\therefore\Gamma(n+1) = n!,{\quad}\Gamma(n) = (n-1)!$

Beta function

$B(x,y) = \int_0^1t^{x-1}(1-t)^{y-1}dt = {\Gamma(x)\Gamma(y)\over\Gamma(x+y)}$

$B(n,m) = {(n-1)!(m-1)!\over(n+m-2)!}$

Rule-Based ML

perfect world

No observation errors
- Training data is error-free
No inconsistent observations
- Training data is noise-free
No stochastic elements in the system we observe
- Target function is deterministic
Full information in the observations to regenerate the system
- Target function is contained in hypothesis set

Find-S Algorithm

initialize h to the most specific in H ...[0]
for instance x in D:
    if y of x is positive: ...[1]
        for feature f in O:
            if (f in h) == (f in x): ...[1-1]
                do nothing
            else: ...[1-2]
                (f in h) = (f in h) U (f in x)

…[0]: hypothesis h를 할 수 있는 한 가장 specific하게 초기설정 (어떤 feature로도 만족시킬 수 없는 경우로 시작)

…[1]: Dataset의 instance x에 대하여, x의 레이블값이 참일 경우

…[1-1]: h의 f와 x의 f가 동일한 경우 그대로

…[1-2]: h의 f와 x의 f가 다를 경우 둘 모두를 포함하는 범위로 h의 f를 수정

점점 범위를 넓혀 나가서 모든 dataset을 만족하는 최소집합(most specific)의 Rule을 찾음

문제점
- 가장 specific한 범위로부터 widen했기 때문에, 구해지는 hypothesis는 가장 specific한 hypothesis일 뿐이고, 이외에도 수많은 hypothesis가 가능
- 수많은 hypothesis를 converge시킬 수 없음
- Version space(the set of possible hypothesis)를 찾는 방법으로 발전
- Candidate Elimination Algorithm: General boundary와 Specific boundary를 모두 찾음

Candidate Elimination Algorithm

initialize S to maximally specific h in H ...[0-1]
initialize G to maximally general h in H ...[0-2]
for instance x in D:
    if y of x is positive:
        generalize S as much as needed to cover o in x
        remove any h in G, for which h(o) != y
    if y of x is negative:
        specialize G as much as needed to exclude o in x
        remove any h in S, for which h(p) = y
generate h that satisfies (s in S, g in G, g >= h >= s)

Decision Tree

Entropy

$H(X) = {\sum_i}P(x_i)I(x_i) = -{\sum_i}P(x_i){\ln}P(x_i) = -{\int}p(x){\ln}p(x)dx$

$I(x_i)$ : Information (Surprisal) of $x_i$
- 밑이 양수기에 (2, e, 10) 0 ~ 무한대의 값을 가짐
- 확률변수가 나타낼 수 있는 상태의 총합의 로그
- 정보량
- 의외의 정도
$H(X)$ : Entropy
- expected surprisal, mean of information
- 확률변수가 나타낼 수 있는 상태의 총합의 로그의 평균
- 엔트로피가 높다: 계 전체적으로 가능한 상태가 많다 (계산에 따른 일반화 - 균일한 형태)
- 엔트로피가 낮다: 계가 보다 더 특정지어져 있다 (계산에 따른 일반화 - 균일하지 않은 형태)
Cross Entropy
- $-\int{p(x)\ln}q(x)dx$ .
- true distribution p를 distribution p의 coding scheme으로 표현시의 정보량
- p의 확률분포를 q라는 정보량분포에 담아내었을 때의 정보량
- 그 자신의 정보량에 담아내었을 때에 최소값을 가지기 때문에, 무조건 원래 정보량보다 큰 값을 가짐
KL Divergence(Relative Entropy)
- $-\int{p(x)\ln}q(x)dx - (-\int{p(x)\ln}p(x)dx) = -\int{p(x)\ln}{\left\{q(x)\over{p(x)}\right\}}dx$ .
- Cross Entropy와 원래 Entropy와의 차이
- 항상 0보다 큰 값을 가지며, 0일 경우 같은 분포를 의미
- p(x)는 타겟으로, 보통 이 분포의 엔트로피는 고정되어 있기 때문에 타겟과 모델의 차이를 나타낼 때 주로 Cross Entropy를 씀
Conditional Entropy

$H(Y|X) = {\sum_x}p(x)H(Y|X=x)$

$= -{\sum_x}p(x){\sum_y}p(y|x){\ln}p(y|x)$

$= -\sum_x{\sum_y}p(x,y){\ln}p(y|x)$

$= -\sum_{x,y}p(x,y){\ln}p(y|x)$

$= -\sum_{x,y}p(x,y){\ln}\left\{p(x,y)\over{p(x)}\right\}$

$= \sum_{x,y}p(x,y){\ln}\left\{p(x)\over{p(x,y)}\right\}$

Information Gain

$IG(Y, A_i) = H(Y) - H(Y|A_i)$

데이터의 information(surprisal)이 감소한 정도
이를 획득한 정보라고 생각함
값이 클 수록(정보량이 많이 감소하였을 수록) 분류가 잘 되었다고 판단

Top-down Induction Algorithm

IG가 큰 순서대로 feature을 선정, 분기를 생성
ID3 Algorithm

select an open node to split
select a best variable to split
for values of the selected variable:
    sort instances with the value of the selected variable
    put the sorted items under the branch of the value of the variable
    if the sorted items are all in one class:
        close the leaf node of the branch

Linear Regression

Hypothesis

$h: \hat{f}(x;\theta) = \theta_0+\sum_{i=1}^n\theta_ix_i = \sum_{i=0}^n\theta_ix_i\quad(x_0=1)$

$f(x;\theta) = \sum_{i=0}^n\theta_ix_i+e=y$

$f: X\theta+e = Y$

$\hat\theta = \underset\theta{\operatorname{argmin}}(f-\hat{f})^2 = \underset\theta{\operatorname{argmin}}(Y-X\theta)^2 = \underset\theta{\operatorname{argmin}}(Y-X\theta)^T(Y-X\theta)$

$= \underset\theta{\operatorname{argmin}}(Y^TY-2\theta^TX^TY+\theta^TX^TX\Theta) = \underset\theta{\operatorname{argmin}}(\theta^TX^TX\theta-2\theta^TX^TY)$

$\nabla_\theta(\theta^TX^TX\theta-2\theta^TX^TY) = 0$

$2X^TX\theta-2X^TY = 0$

$\theta = (X^TX)^{-1}X^TY$

$(X^TX)^{-1}X^T = X^+(Moore-Penrose\,Pseudoinverse)$

$\theta = X^+Y$

Naive Bayes Classifier

한마디로: prior에 label값에 대하여 feature가 conditionally independent하다고 가정된 feature별 class conditional density의 총 곱을 곱하여 얻은 posterior값을 최대로 하는 label값 선택

Optimal Classification

Bayes Optimal Classifier

$\hat{f} = \underset{f}{\operatorname{argmin}}P(f(X){\neq}Y)$

$\hat{y} = \underset{y}{\operatorname{argmax}}\sum_{\theta_i{\in}\Theta}P(y|\theta_i)P(\theta_i|D)$

$\hat{y} = \underset{y}{\operatorname{argmax}}P(Y=y|X=x) = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)$

모두 같은 의미, Bayes Classifier의 의미를 정확하게 이해하여야 함
- $\hat{f} = \underset{f}{\operatorname{argmin}}P(f(X){\neq}Y)$ 은 MLE혹은 MAP를 의미하는 것이 아님
- hypothesis space에서 error을 minimization하는 한 가지 hypothesis를 선택하는 의미가 아니라, 전체 space에서 최적의 hypothesis를 찾는 의미
- MAP를 통한 hypothesis가 dominant한 경우에는 이를 통해 classification을 진행해도 문제가 보이지 않으나, dominant하지 않을 경우 가시적인 문제가 발생
- 전체 hypothesis space에 대하여 데이터에 대한 모든 y기대값의 평균이 최대인 y를 선택하는 것과 동일 (Utah univ. cs5350 강의자료)
- 이는 P(Y=y|X=x)를 최대화하는 y를 구하는 것과 동일
- P(Y=y|X=x)를 구하는 과정에서 P(X=x|Y=y) (class conditional density), P(Y=y) (class prior)를 구하는 부분: 무의식적으로 구하지만 그 자체가 데이터를 통해 MLE혹은 MAP를 통하여 구한 Hypothesis들을 통한 기대값의 평균(optimal한 값)을 구하는 것
- P(X=x|Y=y) (class conditional density)를 구할 때, Gaussian / Bernoulli / Multinomial 등의 모형을 적용(Bag of words는 Bernoulli)
- 이상한 점은 유타대 강의에 따르면 class conditional density를 구할 때, 모수 그 자체가 아니라 모수의 분포를 통하여 모든 모수분포에 대한 레이블별 평균확률을 구하는 것이 합당한데, 데이터 사이언스 스쿨과 Scikit-learn패키지에서는 모수를 특정하여 학습을 진행한다. 이 점에 대해서는 추후에 고민해 볼 것
- MAP prediction과 MAP learning은 다르다. 헷갈리지 말 것

Bayes Risk - 추후 재정리

Optimal Classification의 한계

$\hat{y} = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)$

feature 개수가 많아짐에 따라 feature compound가 급격하게 늘어남 (총 feature = (2^D-1)*k)
충분한 샘플을 확보하기가 힘들기 때문에, sparse한 데이터가 되어 학습에 문제가 발생

Conditional Independence Assumption

$P(X = <x_1, \cdots, x_i>|Y = y) \approx \prod_iP(X_i = x_i|Y = y)$

$P(x_1|x_2,y) = P(x_1|y)$

Conditional(local) vs Marginal Independence

Marginal Independence

$P(X) = P(X|Y)$

Conditional(local) Independence

$P(x_1|x_2,y) = P(x_1|y)$

Conditional Independence는 latent variable이 존재하고, 이에 대해서 dependent하기 때문에, Y가 condition일 경우 상호간에는 independent
Naive Bayes에서는 Label Y를 latent variable로 가정, 이에 대해서 모든 변수가 종속적인 것으로 가정하고 Y를 condition으로 줌
주의: Conditional Independence는 latent variable이 condition일 경우에만 독립

Naive Bayes Classifier

conditional independence 가정시

$\hat{y} = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)$

$\approx \underset{y}{\operatorname{argmax}}P(Y=y)\prod_{1{\leq}i{\leq}d}P(X_i = x_i|Y = y) = f_{NB}(x)$

해당 식을 MLE식과 비교하여 정확한 차이를 숙지하자 개념적으로는 차이가 크지만, 계산방법에 있어서는 어느정도 유사

Logistic Regression

Logistic Function

$\sigma(z) = {1\over{1+\exp(-z)}}$

Logit Function

$logit(p) = \ln\left({p\over{1-p}}\right)$

Logistic Function의 domain = Logit Function의 range = $(-\infty, +\infty)$ : input feature
Logit Function의 domain = Logistic Function의 range = $(0, 1)$ : 확률을 표현하기에 적합
Logit을 $\theta^Tx$ 로, x의 매퍼를 정의하여 Logistic Function을 x축방향 scaling(variance) 및 translation(bias)
- 본래 bias $\theta_0$ 항이 포함되어 translation이 정의되나, $(x_0 = 1)$ 조건을 통해 간략히 translation을 포함

$z = \ln\left({p\over{1-p}}\right) = X\theta$

$h_\theta(x) = \sigma(\theta^Tx) = {1\over{1+\exp(-\theta^Tx)}}$

다른 방식으로 해석하면 fixed basis function을 이용, $x$ 를 $\phi(x)$ 로 매핑하여 변환 후 판별한다고 해석할 수 있음(PRML Chapter4)
Bernoulli 적용

$P(y|x) = \mu(x)^y(1-\mu(x))^{1-y}$

$\mu(x) = {1\over{1+\exp(-\theta^Tx})} = {1\over{1+\exp(-X\theta})} = P(y=1|x)$

$X\theta = \ln\left(\mu(x)\over{1-\mu(x)}\right)$

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta)$

MLE의 대상 D는 상황에 따라 의미가 다르다: 식에서 D가 무엇을 의미하는지 헷갈리지 말 것
Likelihood는 단순히 모수에 대하여 sample(관측값)이 발현할 확률
Likelihood를 과대해석하여 의미를 부여하려 하지 말 것 - 여기서는 모수와 X에 따른 Y의 발현확률
MCLE (Maximum Conditional Likelihood Estimation)

$\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta) = \underset{\theta}{\operatorname{argmax}}{\prod_{1{\leq}i{\leq}N}}P(Y_i|X_i;\theta)$

$= \underset{\theta}{\operatorname{argmax}}\ln\left\{\prod_{1{\leq}i{\leq}N}P(Y_i|X_i;\theta)\right\} = \underset{\theta}{\operatorname{argmax}}\sum_{1{\leq}i{\leq}N}\ln\left\{P(Y_i|X_i;\theta)\right\}$

$P(Y_i|X_i;\theta) = \mu(X_i)^{Y_i}(1-\mu(X_i))^{1-Y_i}$

${\ln}P(Y_i|X_i;\theta) = Y_i\ln\mu(X_i)+(1-Y_i)\ln(1-\mu(X_i))$

$= Y_i(\ln\mu(X_i)-\ln(1-\mu(X_i))+\ln(1-\mu(X_i))$

$= Y_i\left(\ln{\mu(X_i)\over{1-\mu(X_i)}}\right)+\ln(1-\mu(X_i))$

$= Y_iX_i\theta + \ln(1-\mu(X_i))$

$= Y_iX_i\theta - \ln\{1+\exp(X_i\theta)\}$

${\partial\over\partial\theta_j}\left[\sum_{1{\leq}i{\leq}N}Y_iX_i\theta - \ln\{1+\exp(X_i\theta)\}\right]$

$= \sum_{1{\leq}i{\leq}N}X_{i,j}\left(Y_i-{\exp(X_i\theta)\over{1+\exp(X_i\theta)}}\right)$

$= \sum_{1{\leq}i{\leq}N}X_{i,j}(Y_i-P(Y_i=1|X_i;\theta)) = 0$

apply gradient ascent

$\theta_j^{t+1} = \theta_j^t+{h\over{c}}{\partial{f(\theta^t)}\over\partial\theta_j}$

$= \theta_j^t+{h\over{c}}{\sum_{1{\leq}i{\leq}N}X_{i,j}\left(Y_i-{\exp(X_i\theta^t)\over{1+\exp(X_i\theta^t)}}\right)}$

computational하게 계산할 때는 일반적으로 negative log likelihood

$-\sum_{1{\leq}i{\leq}N}{\ln}P(Y_i\|X_i;\theta) = -\sum_{1{\leq}i{\leq}N}\left\{Y_i\ln\mu(X_i)+(1-Y_i)\ln(1-\mu(X_i))\right\}$

을 loss function으로 둔다

Generative - Discriminative Pair

Discriminative Model
- posterior $P(Y=y\|x)$ 를 직접 모델링
- 사전지식이 많지 않지만, Data가 충분히 많을 경우 높은 퍼포먼스
Generative Model
- $P(Y=y\|X) = {P(X\|Y=y)P(Y=y)\over{P(X)}}$ 에서 likelihood를 특정 PDF로 모델링하여 posterior을 계산
- Data가 많지 않지만, 사전지식이 충분히 많을 경우 높은 퍼포먼스
Naive Bayes Classifier - Logistic Regression
QDA, LDA를 코딩하였을 때를 생각해 보자
- 이 때, 나는 QDA는 Generative, LDA는 Discriminative로 구현하였다고 생각하였는데 맞나?
- QDA는 확실히 Generative 모델로 구현
- 이때의 QDA는 Naive조건이 없는 상태에서 Gaussian Distribution을 PDF로 가정한 Bayes Optimal Classifier
- LDA는 베이즈 식을 이용하지 않고 between class scatter과 within class scatter만을 이용하여 구현하였으나, Discriminative라 할 수 있을까?
- 특정 PDF가정을 하지 않았기에, 맞다고 생각한다

Naive Bayes Classifier - Logistic Regression

Naive Bayes Classifier에서 Gaussian Distribution가정, class conditional variance를 동일하다고 가정할 경우, 식을 정리하면 Linear Regression의 형태가 됨

Softmax Regression(Multinomial Logistic Regression)

Softmax Function

$\sigma(z)_j = {e^{z_j}\over{\sum_{k=1}^Ke^{z_k}}}{\quad}(for\,j=1,\cdots,K)$

Softmax Function의 loss function

Bernoulli distribution

$P(y|x) = p_1^{y_1}(1-p_1)^{1-y_1}$

여확률 대신, True와 False의 확률과 라벨을 각각 분리시켜 일반화

$P(y|x) = p_1^{y_1}(1-p_1)^{1-y_1} = p_1^{y_1}p_0^{y_0}$

multinomial에 대하여 확장

$P(y|x) = p_1^{y_1}p_2^{y_2}p_3^{y_3}{\cdots}p_C^{y_C}$

$= \prod_{1{\leq}j{\leq}C}p_j^{y_j}$

negative log likelihood

$-\sum_{1{\leq}i{\leq}N}{\ln}P(Y_{i,j}\|X_i;\theta) = -\sum_{1{\leq}i{\leq}N}\sum_{1{\leq}j{\leq}C}Y_{i,j}\ln{\mu_j(X_{i})}$

결과가 cross entropy의 식이 된다
logistic regression에서도 마찬가지, 분포로부터 얻은 negative log likelihood는 둘다 cross entropy의 식
cross entropy를 이처럼 KL-divergence를 통하여 모델과 타겟의 유사도를 판정한다는 개념 외에도, 분포 자체의 likelihood를 최대화하는 MLE과정에서의 목적함수로서의 개념으로도 이해할 수 있다.

Support Vector Machine

이 부분에 있어서는 이미 이해도가 높기에 간략히 정리하고 넘어가고, 추후 이슈가 발생할 때에 내용을 보완하기로 한다
개념 자체는 간단하기 때문에 기본개념과 dual problem, kernel trick부분만 간단히 정리
추후 linear programming, quadratic programming, dual problem에 대하여 집중적으로 정리해 본다
추후 주제와는 별개로 convex optimization에 중점을 두고 생각해 본다

Problem Definition

hyperplane

$w^Tx+b = 0$

margin boundary

$w^Tx_++b = +\delta$

$w^Tx_-+b = -\delta$

$w^Tx_++b \geq +\delta : y = +1$

$w^Tx_-+b \leq -\delta : y = -1$

$y(w^Tx+b) \geq \delta$

distance

${2\delta\over\lVert{w}\rVert}$

$\delta$ is arbitrary

$y(w^Tx+b)/\delta \geq 1$

$w^Tx+b = 0, \quad (w^Tx+b)/\delta = 0$

두 평면은 동일한 평면이며, 우리는 w와 b의 절대적인 값이 아니라, 비율에만 관심이 있다. 따라서,

constraint

$y(w^Tx+b) \geq 1$

$y(w^Tx+b)-1 \geq 0$

objective function

$\operatorname{min}{\lVert{w}\rVert}$

계산상 편의를 위해

$\operatorname{min}{1\over{2}}\lVert{w}\rVert^2$

quadratic programming을 이용하여 풀거나, lagrange duality를 이용하여 풀거나
알고리즘적인 부분은 여유 있을 때 보고, 지금은 lagrange duality를 이용하여 이차식의 최소값을 gradient descent로 푸는 방식으로 해결하자

Soft-margin and Penalization

Zero-One Loss 적용한 objective function

$\operatorname{min}{1\over{2}}\lVert{w}\rVert^2+C*\#_{error}$

$s.t.\quad(w^Tx_i+b)y_i\geq{1}$

Hinge Loss 적용한 objective function
- slack variable $\xi$ ( $\xi_j > 1$ when mis-classified)
- margin boundary로부터 벗어난 거리를 penalty로
- margin boundary까지는 값이 0이다가, margin boundary를 통과하는 순간 그래프가 경첩 모양으로 꺾여 값이 선형으로 증가하기 때문에 hinge loss라 함

$\operatorname{min}{1\over{2}}\lVert{w}\rVert^2+C\sum_{i}^n\xi_i$

$s.t.{\quad}y_i(w^Tx_i+b)\geq{1-\xi_i},\quad{\xi_i\geq{0}}$

$\xi_i = max(0, \{1-y_i(w^Tx_i+b)\})$

parameter C의 효과
- C가 클 수록 slack variable의 영향(strength of loss function)이 커지고, 이에 따라 soft-margin이 작아짐
- C가 작을 경우 slack variable의 영향(strength of loss function)이 작아지고, 이에 따라 soft-margin이 넓어짐
- C가 0일 경우, slack variable에 대한 제한이 없기에 hyperplane은 제 구실을 하지 못함
- C가 무한대일 경우, slack variable을 허용하지 않기에 hard-margin

Langrangian Dual Problem

Karush-Kuhn-Tucker conditions
- Nonlinear optimization problem

$optimize\,f(x)$

$subject\,to$

$g_i(x)\leq{0}$

$h_i(x) = 0$

KKT - 1: stationarity
- lagrange multiplier method 의 핵심제약조건: 정류점(local max / local min / saddle)을 찾는 제약조건

for maximazing f(x):

$\nabla{f(x^*)} = \sum_{i=1}^{m}\mu_i\nabla{g_i}(x^*)+\sum_{j=1}^{l}\lambda_j\nabla{h_j}(x^*)$

for minimizing f(x):

$-\nabla{f(x^*)} = \sum_{i=1}^{m}\mu_i\nabla{g_i}(x^*)+\sum_{j=1}^{l}\lambda_j\nabla{h_j}(x^*)$

KKT - 2: primal feasibility
- 원 문제로부터 주어진 기본제약조건

$g_i(x^*)\leq{0},{\quad}for\,i=1,\cdots,m$

$h_j(x^*)=0,{\quad}for\,j=1,\cdots,l$

KKT - 3: dual feasibility
- 그라디언트의 부호가 반대일 경우 변수 평면상의 기울기는 동일하나, 함수 축방향 기울기가 반대
- 등식일 경우는 접점만 찾으면 되기에 관계없으나 부등식의 경우 그라디언트 부호에 따라 범위가 뒤집히기 때문에, 부호를 유지해 줘야 함

$\mu_i\geq{0},{\quad}for\,i=1,\cdots,m$

KKT - 4: complementary slackness
- constraint 부등식이 local maximum을 포함할 경우 - “부등식에 관계없이 해당 local maximum이 optimum”
- constraint 부등식이 local maximum을 포함하지 않을 경우 - “optimum은 부등식의 boundary에 존재”

$\mu_ig_i(x^*) = 0,{\quad}for\,i=1,\cdots,m$

Lagrangian

$\operatorname{min}L_P = {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i+\sum_{i=1}^n\alpha_i\{1-y_i(w^Tx_i+b)-\xi_i\}+\sum_{i=1}^n-\beta_i\xi_i$

$= {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^n\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\}-\sum_{i=1}^n\beta_i\xi_i$

Stationary condition

${\partial{L}\over{\partial{w}}} = 0\quad{\partial{L}\over{\partial{b}}} = 0\quad{\partial{L}\over{\partial{\xi}}} = 0$

$w = \sum_{i=1}^n\alpha_iy_ix_i$

$\sum_{i=1}^n\alpha_iy_i = 0$

$\sum_{i=1}^n(C-\alpha_i-\beta_i) = 0$

Primal feasibility condition

$y_i(w^Tx_i+b)\geq{1-\xi_i},\quad{\xi_i\geq{0}}$

Dual feasibility condition

$\alpha_i\geq{0},\quad\beta_i\geq{0}$

Complementary slackness condition

$\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\} = 0,\quad\beta_i\xi_i = 0$

derived condition (from stationary & dual feasibility)

$C\geq{\alpha_i}\geq{0}$

Dual problem

$\operatorname{min}L_P = {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^n\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\}-\sum_{i=1}^n\beta_i\xi_i$

$\operatorname{max}L_D = \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jx_i^Tx_j$

$\left(\sum_{i=1}^n\alpha_iy_i = 0\right)$

$\left(\alpha_i\{y_i(w^Tx_i+b)-1\} = 0\right)$

$\left(C\geq{\alpha_i}\geq{0}\right)$

Primal problem과 Dual problem
- kernel trick을 적용하고자 한다면 dual problem으로 변형시켜야 (가장 중요한 이유)
- multivariable에 대한 optimization인 primal problem에 비하여 dual problem의 경우 Lagrangian multiplier에 대한 optimization이기 때문에 풀기가 좀 더 수월
- 속도 측면에 대해서는 더 생각해 볼 필요가 있음 - dual problem의 경우 feature space 대신 sample space에 대하여 optimization 을 진행하는데, optimization method에 따라 수렴 속도 / 시간복잡도에 있어 primal problem과 차이가 있을 수도? 여유 있을 때 생각해 볼 것

Kernel Trick

Mercer’s theorem
- $K(x,y)$ 가 positive semidefinite할 경우, $K(x_i,x_j) = \phi(x_i)\cdot\phi(x_j) = \phi(x_i)^T\phi(x_j)$ 를 만족하는 $\phi$ 존재
- 다른 공간으로 매핑하는 함수 $\phi$ 의 존재를 알기에, K를 kernel로 사용 가능
- $\phi$ 를 고차원 basis function으로 mapping시킬 경우, 연산량이 대단히 높음 (RBF의 경우 무한대의 차원으로 매핑)
- kernel을 이용하여 이러한 문제를 해결
dual problem 적용

$\operatorname{max}L_D = \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)$

$= \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jK(x_i,x_j)$

$w = \sum_{i=1}^n\alpha_iy_i\phi(x_i)$

$b = y_i-\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x_j)$

decision boundary를 확인할 필요 없고, classification만이 관심사

$sign(w\phi(x)+b) = sign\left(\sum_{i=1}^n\alpha_iy_i\phi(x_i)\phi(x)+y_i-\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x_j)\right)$

$= sign\left(\sum_{i=1}^n\alpha_iy_iK(x_i,x)+y_i-\sum_{i=1}^n\alpha_iy_iK(x_i,x_j)\right)$

Linear Kernel

$K(x_1,x_2) = x_1^Tx_2$

Polynomial Kernel

$K(x_1,x_2) = (\gamma(x_1^Tx_2)+\theta)^d$

RBF(Radial Basis Function), Gaussian Kernel

$K(x_1,x_2) = \exp(-\gamma\lVert{x_1-x_2}\rVert^2)$

Sigmoid Kernel

$K(x_1,x_2) = \tanh(\gamma(x_1^Tx_2)+\theta)$

Training / Testing & Regularization

Source of error in two folds

Approximation error
- Validity of accuracy
- 측정, 근사간 에러 발생
Generalization error(out-of-sample error)
- Validity of dataset
- dataset이 real world를 정확하게 반영하지 못함

$E_{out}\leq{E_{in} + \Omega}$

$E_{out}:{\quad}Estimation\,error$

$E_{in}:{\quad}Error\,from\,approximation$

$\Omega:{\quad}Error\,from\,variance\,of\,the\,observation$

Bias-Variance Decomposition

(문인철교수 강의보다는 PRML의 식이 더 입맛에 맞아 이쪽의 notation으로 정리)

notation
- $t$ : 입력 데이터 x에 대응되는 target값 (x에 대하여 단일 값이 아닌 분포의 형태를 띔)
- $h$ : target t를 real world의 전체 dataset에 대하여 평균낸 optimal prediction
- $y$ : dataset을 통하여 예측한 hypothesis
- $L$ : loss

optimal prediction

$h(x) = E(t|x) = \int{tp(t|x)}dt$

expected loss

$E(L) = \iint\{y(x)-t\}^2p(x,t)dxdt$

$= \int\{y(x)-h(x)\}^2p(x)dx+\iint\{h(x)-t\}^2p(x,t)dxdt$

우측텀은 sampling시 data의 noise항이므로 hypothesis와 무관계 첫 번째 텀에 대하여 논의 진행

$\int\{y(x)-h(x)\}^2p(x)dx = \int{E_D(\{y(x;D)-h(x)\}^2)p(x)dx}$

$E_D(\{y(x;D)-h(x)\}^2)$

$= E_D(\{y(x;D)-E_D(y(x;D))+E_D(y(x;D))-h(x)\}^2)$

$= E_D(\{y(x;D)-E_D(y(x;D))\}^2+\{E_D(y(x;D))-h(x)\}^2$

$+2\{y(x;D)-E_D(y(x;D))\}\{E_D(y(x;D))-h(x)\})$

$E_D(y(x;D))-h(x) = const$

$E_D(y(x;D)-E_D(y(x;D))) = 0$

$\therefore{\int{E_D(\{y(x;D)-h(x)\}^2)p(x)dx}}$

$= \int{\{E_D(y(x;D))-h(x)\}^2p(x)dx}+\int{E_D(\{y(x;D)-E_D(y(x;D))\}^2)p(x)dx}$

$\therefore{E(L) = \iint\{y(x)-t\}^2p(x,t)dxdt}$

$= \int{\{E_D(y(x;D))-h(x)\}^2p(x)dx}\quad(bias)^2$

$+ \int{E_D(\{y(x;D)-E_D(y(x;D))\}^2)p(x)dx}\quad(variance)$

$+ \iint\{h(x)-t\}^2p(x,t)dxdt\quad(noise)$

bias: model이 real world를 충분하게 표현할 수 없는 한계
- to reduce: more complex model
variance: sample과 infinite dataset의 괴리
- to reduce: more data
Dataset이 제한되어있을 때, bias와 variance는 model complexity를 변수로 하는 tradeoff관계
- Model Complexity를 높임 (unstable, specific, overfitting)
- variance증가, bias감소
- dataset에 대하여 높은 성능
- real world에 대하여 unstable
- Model Complexity를 낮춤 (stable, general, underfitting)
- variance감소, bias증가
- dataset을 잘 표현하지 못함
- real world에 대하여 stable

Occam’s Razor

오차가 비슷할 경우, complexity가 낮은 모델을 선택

Cross Validation

k-fold cross validation
- k개의 subset
- k-1개의 subset은 training
- 1개의 subset testing
LOOCV(Leave One Out Cross Validation)
- k-fold cross validation의 extreme case(k = 데이터샘플 수)

Performance metrics

|||실제정답| |||Positive|Negative| |:---:|:---:|:---:|:---:| 실험결과|Positive|TP(True Positive)|FP(False Positive)| ||Negative|FN(False Negative)|TN(True Negative)|

Accuracy = ${TP+TN\over{TP+FP+FN+TN}}$
Precision = ${TP\over{TP+FP}}$
Recall = ${TP\over{TP+FN}}$
True negative rate(specificity) = ${TN\over{TN+FP}}$
Precision
- Positive로 판정한 data중 실제로 positive인 data
- Precision을 높일 경우 False Positive를 최소화하게 됨, positive 판정에 대하여 엄격
- 높은 Precision을 요구하는 예로 spam filter가 있음
Recall
- 실제로 positive인 data중 positive로 판정한 data
- Recall의 높일 경우 False Negative를 최소화하게 됨, negative 판정에 대하여 엄격
- 높은 Recall을 요구하는 예로 CRM이 있음
F-measure

$F_\beta={(1+\beta^2)*precision * recall\over\beta^2*precision+recall}$

$F_1$ : evenly weighted
$F_2$ : emphasizes recall
$F_0.5$ : emphasizes precision

Regularization

PRML과 (http://enginius.tistory.com/476)의 게시글 정리, 블로그의 게시글에 추가적으로 볼 내용 있으나, Legendre polynomial / Augmented error파트는 다 읽지 못함 추후에 반드시 공부할 것

$E_{out}\leq{E_{in} + \Omega}$

$E(w) = E_D(w)+\lambda{E_W(w)}$

sacrifice perfect fit, complexity를 낮추는 효과(variance control)
weight decay : coefficient(weight)의 크기를 variance의 요인(복잡도)의 평가 기준으로 삼음

$E(w) = {1\over{2}}\sum_{n=1}^N\{t_n-w^T\phi(x_n)\}^2+{1\over{2}}\sum_{j=1}^M\lVert{w_j}\rVert^q$

$L_2$ regularization(Ridge regression: q = 2)
- w벡터의 Euclidean norm의 제곱을 regularizer로 삼음
- 람다값이 높아질수록 가중치w의 분포가 정규분포에 가까워짐
$L_1$ regularization(Lasso regression: q = 1)
- 람다값이 충분히 클 경우, 많은 feature에 대한 w값이 0이 되거나 0에 가까워짐
- 따라서 feature selection이 되고, sparse모델의 형태가 되는데, 몇 개의 basis function만이 모델의 식을 만드는데 사용되게 됨

Mahalanobis Distance / Singular Value Decomposition (과정외)

$D_M^2 = (x-\mu)^T\Sigma^{-1}(x-\mu)$

$\overline{X_{D}} = USV^T$

$\Sigma = \overline{X_{D}}^T\overline{X_{D}} = VSU^TUSV^T = VSSV^T$

$\Sigma^+ = VS^+S^+V^T$

$D_M^2 = (x-\mu)^T\Sigma^{-1}(x-\mu) = (x-\mu)^TVS^+S^+V^T(x-\mu)$

$= \{(x-\mu)^TVS^+\}\{(x-\mu)^TVS^+\}^T$

Machine Learning