Learning
Published:
Learning
Probably Approximately Correct Learning (PAC)
Key assumptions:
- Training and testing data are generated i.i.d according to an fixed but unknonwn distribution $\mathbf D$ (i.e. “averge error” made by $h\in H$)
- “quality” of a hypothesis (classification function) $h \in H$ should take unknown distributin $\mathbf D$ into account
- risk functional
Generalization Error of Pac Model
- Let $S={(x_1,y_1),\dots (x_l,y_l) }$ be a set of training example according to $\mathbf D$.
- Treat the generalization error $\text{err}_{\mathbf D}(h_s)$ as a random variable depending on the random selection of $S$.
- Find a bound of the trial of the distribution of random variable $\text{err}_{\mathbf D}(h_s)$ in the form $\varepsilon=\varepsilon(l,H,\delta)$
- $\varepsilon=\varepsilon(l,H,\delta)$ is a function of $l$, $H$ and $\delta$ where $1-\delta$ is a confidence level of the error bound which is given by learner.
Probably Approximately Correct
\[\text{Pr}(\{ \text{err}_{\mathbf D}(h) > \varepsilon=\varepsilon(l,H,\delta) \})< \delta\]- $\varepsilon(l,H,\delta)$ does not depend on the unknown distributin $\mathbf D$.
Find Minimum expected risk
- The expected misclassification error made by $h \in H$ is
The ideal hypothesis minimization
- The ideal hypothesis $h^{*}_{\text{opt}}$ should has the smallest expected risk
Empirical Risk Hypothesis (ERM)
- Replace the expected risk over $p(x,y)$ by an average over the training example.
- The empirical risk
- only focusing on empirical risk will cause overfitting.
VC Confidence
- The following inequality will held with probability $1-\delta$
Let say
\[E(v,l,\delta)=\sqrt{\frac{v(log2l/v)+1-log(\delta/4)}{l}}\]Remarks:
- As $l \rightarrow \infty$, $E(v,l,\delta) \rightarrow 0$.
- while hypothesis space is larger, the change of overfitting is increase. Since the probability of errors increase while we have much more selections from $H$.
- $E(v,l,\delta) $ increase as long as $\delta$ decreases, which the probability $1-\delta$ larger.
The Structural Risk Minimization(SRM)
- The structural risk will be lees than or equal the empirical risk(training error)+VC(error) bound
- $\Vert w \Vert_2^2 \propto \text{VC bound}$
- $\min \text{VC bound} \Leftrightarrow \min \Vert w \Vert_2^2 \Leftrightarrow \text{Max margin}$
Shattered and VC dimension
Shattered
- A given training set $S$ is shattered by $H$ iff for every labeling of $S$, $\exists h \in H$ consistent with this labeling.
Theorem 1:
Consider some set of $m$ points in $\mathbb R^n$, choose a point as origin, then
the $m$ point can be shattered by oriented hyperplanes iff the positive vectors of the rest points are linearly independent.
VC-dimension
The Vapnink-Chervonenkis dimension, $\text{VC}(H)$ of hypothesis space $H$ defined over the input space $X$ is the size of the (existent) largest finite subset of $X$ shattered by $H$.
example:
\[H=\{ \text{all hyperplanes in } \mathbb R^n \}\Rightarrow \text{VC}(H)=n+1\]