Learning

2 minute read

Published:

Learning

Probably Approximately Correct Learning (PAC)

Key assumptions:

  • Training and testing data are generated i.i.d according to an fixed but unknonwn distribution D (i.e. “averge error” made by hH)
  • “quality” of a hypothesis (classification function) hH should take unknown distributin D into account
  • risk functional
errD(h)=D{(x,y)X×{1,1}|h(x)y}

Generalization Error of Pac Model

  • Let S=(x1,y1),(xl,yl) be a set of training example according to D.
  • Treat the generalization error errD(hs) as a random variable depending on the random selection of S.
  • Find a bound of the trial of the distribution of random variable errD(hs) in the form ε=ε(l,H,δ)
  • ε=ε(l,H,δ) is a function of l, H and δ where 1δ is a confidence level of the error bound which is given by learner.

Probably Approximately Correct

Pr({errD(h)>ε=ε(l,H,δ)})<δ
  • ε(l,H,δ) does not depend on the unknown distributin D.

Find Minimum expected risk

  • The expected misclassification error made by hH is
R[h]=X×{1,1}12|h(x)y|dp(x,y)

The ideal hypothesis minimization

  • The ideal hypothesis hopt should has the smallest expected risk
R[hopt]R[h]hH.

Empirical Risk Hypothesis (ERM)

  • Replace the expected risk over p(x,y) by an average over the training example.
  • The empirical risk
Remp[hopt]Remp[h]hH
  • only focusing on empirical risk will cause overfitting.

VC Confidence

  • The following inequality will held with probability 1δ
R[h]Remp[h]+v(log2l/v)+1log(δ/4)l

Let say

E(v,l,δ)=v(log2l/v)+1log(δ/4)l

Remarks:

  • As l, E(v,l,δ)0.
  • while hypothesis space is larger, the change of overfitting is increase. Since the probability of errors increase while we have much more selections from H.
  • E(v,l,δ) increase as long as δ decreases, which the probability 1δ larger.

The Structural Risk Minimization(SRM)

  • The structural risk will be lees than or equal the empirical risk(training error)+VC(error) bound
  • w22VC bound
  • minVC boundminw22Max margin

Shattered and VC dimension

Shattered

  • A given training set S is shattered by H iff for every labeling of S, hH consistent with this labeling.

Theorem 1:

Consider some set of m points in Rn, choose a point as origin, then
the m point can be shattered by oriented hyperplanes iff the positive vectors of the rest points are linearly independent.

VC-dimension

The Vapnink-Chervonenkis dimension, VC(H) of hypothesis space H defined over the input space X is the size of the (existent) largest finite subset of X shattered by H.

example:

H={all hyperplanes in Rn}VC(H)=n+1