Learning
Published:
Learning
Probably Approximately Correct Learning (PAC)
Key assumptions:
- Training and testing data are generated i.i.d according to an fixed but unknonwn distribution D (i.e. “averge error” made by h∈H)
- “quality” of a hypothesis (classification function) h∈H should take unknown distributin D into account
- risk functional
Generalization Error of Pac Model
- Let S=(x1,y1),…(xl,yl) be a set of training example according to D.
- Treat the generalization error errD(hs) as a random variable depending on the random selection of S.
- Find a bound of the trial of the distribution of random variable errD(hs) in the form ε=ε(l,H,δ)
- ε=ε(l,H,δ) is a function of l, H and δ where 1−δ is a confidence level of the error bound which is given by learner.
Probably Approximately Correct
Pr({errD(h)>ε=ε(l,H,δ)})<δ- ε(l,H,δ) does not depend on the unknown distributin D.
Find Minimum expected risk
- The expected misclassification error made by h∈H is
The ideal hypothesis minimization
- The ideal hypothesis h∗opt should has the smallest expected risk
Empirical Risk Hypothesis (ERM)
- Replace the expected risk over p(x,y) by an average over the training example.
- The empirical risk
- only focusing on empirical risk will cause overfitting.
VC Confidence
- The following inequality will held with probability 1−δ
Let say
E(v,l,δ)=√v(log2l/v)+1−log(δ/4)lRemarks:
- As l→∞, E(v,l,δ)→0.
- while hypothesis space is larger, the change of overfitting is increase. Since the probability of errors increase while we have much more selections from H.
- E(v,l,δ) increase as long as δ decreases, which the probability 1−δ larger.
The Structural Risk Minimization(SRM)
- The structural risk will be lees than or equal the empirical risk(training error)+VC(error) bound
- ‖w‖22∝VC bound
- minVC bound⇔min‖w‖22⇔Max margin
Shattered and VC dimension
Shattered
- A given training set S is shattered by H iff for every labeling of S, ∃h∈H consistent with this labeling.
Theorem 1:
Consider some set of m points in Rn, choose a point as origin, then
the m point can be shattered by oriented hyperplanes iff the positive vectors of the rest points are linearly independent.
VC-dimension
The Vapnink-Chervonenkis dimension, VC(H) of hypothesis space H defined over the input space X is the size of the (existent) largest finite subset of X shattered by H.
example:
H={all hyperplanes in Rn}⇒VC(H)=n+1