One-Inclusion Graphs and the Optimal Sample Complexity of PAC Learning: Part 2

We’re back with the second blog post by Ishaq Aden-AliYeshwanth CherapanamjeriAbhishek Shetty, and Nikita Zhivotovskiy, continuing on the optimal sample complexity of PAC learning. If you missed the first post, check it out here.


In the last blog post, we saw the transductive model of learning, the one-inclusion graph (OIG) algorithm and how this led to optimal in-expectation bounds for learning VC classes. But, we also saw that the OIG algorithm itself does not achieve non-trivial high probability risk bounds. In this blog post, we will see how to modify the one-inclusion graph algorithm to achieve high probability bounds. Towards this, we will see a connection to online learning and a technique known as online-to-batch.

Further, we will connect this back to the notion of exchangeability which, as we mentioned before, will play a key role.

High Probability PAC Bounds

From the previous blog post, recall that the one-inclusion graph algorithm achieved optimal results in expectation over the training data. That is, the algorithm achieved

\mathbb{E}_{S} \left[ \Pr_{X \sim P} \left[\widehat{f}_{\mathrm{OIG}}(X) \neq f^\star(X) \right]\right]  \le \frac{d}{n+1}.

It is natural to ask whether this would imply optimal results in the PAC model. That is, when we require

\Pr_{S} \left[ \Pr_{X \sim P} \left[\widehat{f}_{\mathrm{OIG}}(X) \neq f^\star(X) \right]  \geq \epsilon \right] \le \delta .

where \epsilon and \delta are parameters. Unfortunately, we saw in the previous blog post that directly implementing the one-inclusion graph algorithm provably does beat just applying Markov’s inequality to the expected risk, that is, we get \epsilon = d/n \delta.

However, we will now describe a general framework that enables transferring many of these same guarantees to the PAC setting as well. The framework is a variant of the online-to-batch technique to convert strong predictors in the online setting to the batch setting of PAC learning. A key aspect of this online-to-batch conversion is that rather than building on worst case online guarantees, we will need to rely on the average performance of the online algorithm in order to get meaningful PAC guarantees.

Online-to-batch: Deterministic Error Bound

To describe the online-to-batch framework, we will use the same notation with \mathcal{X} and \mathcal{Y} being input and label spaces, and \mathcal{F} being a function class of mappings from \mathcal{X} to \mathcal{Y}. In the online setting, a learner observes a series of samples (x_1, y_1), \dots, (x_n, y_n) \in \mathcal{X} \times \mathcal{Y} with the guarantee that the sequence is realizable with respect to \mathcal{F}, i.e. there exists a function f^{\star} \in \mathcal{F} such that y_i = f^{\star} (x_i). For each i, the task of the learner is to predict the label of x_i given only the data points up to i - 1 and is rewarded if they get it right or penalized if the prediction is incorrect. That is, a learner, \mathcal{A}, aims to minimize:

\sum_{i = 1}^n \mathbb{I} \left\{y_i \neq \mathcal{A} (x_i; \{(x_j, y_j)\}_{j = 1}^{i - 1}) \right\}.

Note that key difference between the online and the PAC setting is that (x_t, y_t) can depend fully on the predictions of the learner in the past and on the previous examples (x_1, y_1), \dots, (x_{t-1}, y_{t-1}). This setting corresponds to one in which an adversary, at each time t picks (x_t , y_t) with full knowledge of the entire history of the interaction so far, potentially with the aim to maximize the number of mistakes made by the learner. A learner that achieves low error in this setting learns the class in a very strong sense.

Suppose, now that we have access to a good prediction strategy for the online setting. Formally, a strategy, \mathcal{A}, that satisfies for any realizable sequence for some deterministic and fixed M > 0:

\sum_{i = 1}^n \mathbb{I} \left\{y_i \neq \mathcal{A} (x_i; \{(x_j, y_j)\}_{j = 1}^{i - 1}) \right\} \leq M.

Can we now use this algorithm to derive one for the PAC setting? To see how such results transfer to the PAC setting, consider now the simpler setting where the sequence (X_1, Y_1), \dots, (X_n, Y_n) is such that X_t are drawn i.i.d. from distribution \mathcal{D} with Y_t = f^{\star} (X_t) for some function f^{\star} \in \mathcal{F}. Now, suppose we were to run our algorithm \mathcal{A} on this randomly generated sequence and obtain a series of predictions f^{i} (\cdot) = \mathcal{A} (\cdot ; \{(X_j, Y_j)\}_{j = 1}^{i - 1}), how would these perform on the true underlying distribution \mathcal{D}? As a first attempt towards this, consider the average risk of the classifiers f^{i} where we use the fact that the \mathcal{A} is a strong online predictor:

\sum_{i = 1}^n \mathcal{R} (f^{i - 1}) = \sum_{i = 1}^n \mathcal{R} (f^{i - 1}) - \mathbb{I} \left\{Y_i \neq f^{i - 1} (X_i) \right\} + \sum_{i = 1}^n \mathbb{I} \left\{Y_i \neq f^{i - 1} (X_i) \right\}

M + \leq \sum_{i = 1}^n \mathcal{R} (f^{i - 1}) - \mathbb{I} \left\{ Y_i \neq f^{i - 1} (X_i) \right\}

The first term above corresponds to the generalization error. Given the above decomposition, we are left with two additional issues. The first is that the above would simply bound the average risk of the f^{i}. The second issue is obtaining a bound on the Generalization Term above. As we will see, the first issue is easy to resolve. However, for the second, it is not immediately clear why the Generalization Term is small. We see that it is 0 in expectation. To see why, note that since (X_i , Y_i) are independent of \{(X_1, Y_1) \dots (X_{i-1} ,Y_{i-1})\} and thus, evaluating on (X_i, Y_i) is exactly evaluating the risk. However, why should it concentrate? In the expression, each term in the sum is the difference between the population risk of f^{i-1}, and loss evaluated on a single fresh sample drawn from the distribution, \mathcal{D}. Since the sample is drawn independently in each step, this can be utilized to establish the concentration of the generalization term.

Lemma: Let W_1, \dots, W_n be a sequence of \{0, 1\} random variables adapted to a filtration \mathcal{Z}_1 \subseteq \mathcal{Z}_2 \dots \mathcal{Z}_{n - 1} \subseteq \mathcal{Z}_n. Then, we have:

\sum_{i = 1}^n W_i \leq 4 \left( \sum_{i = 1}^n \mathbb{E} [W_i \mid \mathcal{Z}_{i - 1}] + \log (2 / \delta) \right)

\sum_{i = 1}^n W_i \geq \frac{1}{2}\sum_{i = 1}^n \mathbb{E} [W_i \mid \mathcal{Z}_{i - 1}] - 2 \log (2 / \delta)

with probability at least 1 - \delta.

As a consequence, we get:

\sum_{i = 1}^n \mathcal{R} (f^{i - 1}) \leq 5 (M + \log (1 / \delta)),

with probability at least 1 - \delta. Hence, beyond the average risk of the classifiers f^i being small in expectation, it is also small with high probability. To construct a predictor from the f^i, we can simply take the majority vote. To see why \hat{f} = \mathrm{Maj} (\{f^i\}_{i = 1}^n) has small risk, observe the following:

\mathcal{R} (\hat{f}) = \mathbb{E}_{(X, Y) \sim \mathcal{D}} [\mathbb{I} \left\{\hat{f} (X) \neq Y\right\}] \leq \frac{2}{n} \mathbb{E}_{(X, Y) \sim \mathcal{D}} \left[\sum_{i = 1}^n \mathbb{I} \left\{ f^i (X) \neq Y\right\}\right] \leq 10 \left(\frac{M + \log (1 / \delta)}{n} \right),

as the majority classifier \hat{f} only makes an error when more than half of the f^i make an error. This online-to-batch conversion was first discovered by Nick Littlestone a few decades ago [Lit89]. 1 Given the above result, we would be done if there existed good prediction strategies for the online setting. Unfortunately, this is not the case.

Even for simple hypothesis classes, such linear thresholds in one dimension any online prediction strategy for the simplest setting of threshold classifiers must incur error \Omega (n) for some realizable sequence.2 This precludes the possibility of obtaining good PAC guarantees using worst-case guarantees of online learning algorithms. In the next section, we discuss how this limitation can be overcome in the PAC setting, by exploiting the exchangeability of the input sequences, allowing us to take advantage of beyond worst-case nature of our sequences.

Online-to-batch: PAC Error Bound

Given the worst-case error guarantees of an online learning algorithm, it is natural to conclude that the online-to-batch framework cannot yield non-trivial conclusions for the PAC setting. However, a more careful analysis of the argument from the previous section shows we only require an online learning algorithm that performs well on sequences generated i.i.d. from a distribution. One might think that one immediately needs to worry about “average-case” instances that come from a distribution and this would potentially lead to a PAC guarantee. But it turns out in most cases, the choice of the instances x_1, \dots, x_n are less important than the order in which they are presented to the learner. But for the online-to-batch framework in the previous section, the fact that the error bound M works for any sequence (in any order) of examples is crucial. Another way of seeing this is to note that we had to bound

\sum_{i = 1}^n \mathbb{I}  \left\{Y_i \neq f^{i - 1} (X_i)\right\}

with high probability over the choice of X_i. If one tried to apply standard martingale type arguments to this, one would be required to bound

\mathbb{E}[\mathbb{I} \left\{Y_i \neq f^{i - 1} (X_i) \right\} | (X_1, Y_1),\dots, (X_{i-1} , Y_{i-1}) ]

in terms of the VC dimension. Note that in this argument, we would need to condition on (X_1,Y_1), \dots (X_{i-1} , Y_{i-1}) . This would nullify any simple hope that one might have to use the average case nature of the X_i.

But, let us try to recall what type of guarantees we have already seen from transductive learning. Note that the one can interpret the OIG as an algorithm \mathcal{A} that gave a guarantee akin to

\frac{1}{i!} \sum_{ \pi } \mathbb{I} \left[ Y_{\pi(i)} \neq \mathcal{A} ( X_{ \pi(i) } ; \{ (X_{ \pi( k) } , Y_{\pi(k)} ) \}_{ \pi(k) \neq \pi(i) } ) \right] \leq \frac{d}{i}

The sum is over all permutations. But, this seems to be insufficient for what we need since naively to apply the martingale argument, we don’t get to average over the choice of test point again.

But, the above way of thinking of problem is still useful. The first high level point is that, in order to evaluate the loss and use the average case nature of the instances, ideally we would like to condition on as little as possible. This would allow for us preserve more randomness in order to use the average case nature of the problem. So, what should be condition on?

The first observation is a key aspect that mathematically characterizes why i.i.d. sequences are unlikely to behave like worst-case ones. This is the notion of exchangeability i.e. that i.i.d. sequences are equivalent to be presented in random order. Recall that a sequence of random variables (X_1, \dots, X_n) is exchangeable if for any permutation \pi, (X_{\pi (1)}, \dots, X_{\pi (n)}) has the same distribution as (X_1, \dots, X_n). Since each element of an i.i.d. sequence is independently drawn from \mathcal{D}, these sequences are also exchangeable. This hints at the fact that in order to evaluate quantities under an i.i.d sequence it might suffice to condition on the values \{(x_1, y_1), \dots, (x_n, y_n)\} in the sequence (X_1, Y_1), \dots, (X_n, Y_n) but not their order.

We have observed that for X_i, the order does not matter. But this alone is not sufficient. Ideally, whatever we decide to condition on, we still would like to able to evaluate the loss \mathbb{I} [ Y_j \neq A(X_j ; \{ (X_1, Y_1) , \dots (X_{j-1} , Y_{j-1} ) ) ] . This a prior could depend on the order of (X_1, Y_1) , \dots (X_{j-1} , Y_{j-1} ) since the algorithm is allowed use the data set in whatever order it would like. But wait! We noted earlier that the OIG algorithm, does not depend on the order i.e. is permutation invariant. Thus, it seems like we should condition on the data points \{X_1, \dots , X_i\}, and importantly not the order.

To formalize this notion, we introduce some notation that clarifies the conditioning in our argument. Let \pi be a random permutation. Let f^i be some prediction algorithm that we will choose later. Further, define

\mathcal{Z}_i = \{\pi (1), \dots, \pi (i)\} \text{ and } W_i = \mathbb{I} \left\{f^i (x_{\pi (i + 1)}) \neq y_{\pi(i+1)} \right\}.

The random quantity \mathcal{Z}_i in the above expression concerns the specific elements in the first i positions of x_{\pi (1)}, \dots, x_{\pi (n)} but crucially not on their order nor the (i + 1)^{th} element. In order for the above conditioning to be useful, ideally W_i depends only on the set of elements and not the order.

This would happen if f^i is permutation invariant. Hence, conditioning on \mathcal{Z}_i, the value of W_i is determined as it does not depend on the ordering of the first i elements.

We could now apply martingale arguments to W_i. In order to do this, we need to evaluate \mathbb{E} \left[W_{i - 1} \mid \mathcal{Z}_i \right]. On the other hand, we also have:

\mathbb{E} \left[ W_{i - 1} \mid \mathcal{Z}_i \right] \leq M_i

where M_i is the error of f^i on a randomly chosen test point in \{ x_{\pi(1)} , \dots , x_{\pi(i)+1}\}.

This is due to the fact that conditioned on \mathcal{Z}_i, the test point for W_{i - 1}, x_{\pi (i)}, is chosen uniformly among the elements in the set \{x_{\pi (1)}, \dots, x_{\pi (i)}\}. The key observation is that this exactly corresponds to the error of f^i in the transductive model. Thus, we just need to chose a permutation-invariant function that works in the transductive model. As we saw earlier, this is exactly the one-inclusion graph algorithm for which we have the bound:

\mathbb{E} \left[ W_{i - 1} \mid \mathcal{Z}_i \right] \leq \frac{d}{i}.

Furthermore, since each W_i is a function only of the set of elements \{x_{\pi (1)}, \dots, x_{\pi (i)}\}, and the test point x_{\pi (i + 1)}, we may apply our concentration inequality again to obtain the following result for the risks:

\sum_{i = 1}^n \mathcal{R} (f^{i - 1}) \leq \frac{1}{n} \sum_i \frac{d}{i} + \frac{\log(1/\delta)}{n}

with probability at least 1 - \delta. But, this does not seem like it improves over the bound for ERM that we discussed earlier.

One final trick that we need to fix this issue is considering f^{i} only for i> n/2. This still allows us to average over n/2 predictors, but we have

\sum_{i = 1}^n \mathcal{R} (f^{i - 1}) \leq \frac{1}{n} \sum_{i>n/2} \frac{d}{i} + \frac{\log(1/\delta)}{n} \leq O \left(\frac{d + \log(1/\delta) }{n} \right).

Hence, by simply exploiting the permutation-invariance of the distribution generating the input sequence, we are able to obtain a high-probability concentration inequality for the mistake bound that is substantially better than the worst-case bound incurred for arbitrary realizable input sequences. This mistake bound along with the discussion in the previous section now yields PAC learners with optimal high-probability guarantees.

Theorem 3 [Optimal Sample Complexity of PAC Learning, [AACSZ23a]] Let \mathcal{X} be a domain and \mathcal{F} be a class of functions from \mathcal{X} to \{0,1\} with VC dimension d. Let \epsilon, \delta \in (0,1). Let

n = \Omega\left( \frac{d + \log \left( 1 / \delta \right) }{\epsilon} \right).


Let S = \left\{ (x_i , f^{\star} (x_i) ) \right\} be a sample of size n drawn i.i.d. from \mathcal{D} and f^{\star} \in \mathcal{F} . There is an prediction strategy \hat{f} such that, with probability at least 1 - \delta over the choice of the training set S, we have

\Pr_{x \sim D } \left[ \hat{f}(x) \neq f^{\star}(x) \right] \leq \epsilon.

Conclusions

In this pair of blogposts, we discussed the problem binary classification in both the PAC and transductive models of learning. We talked about an elegant learning algorithm in the transductive learning model based on the idea of one-inclusion graphs. We then saw how the classic notion of online-to-batch conversion can be modified in a simple way in order to port strong learning guarantees from the transductive model to the PAC model in an optimal way. Though the focus of this blogpost was on binary classification, the ideas we discussed naturally extend to other settings such as multiclass classification and regression and thus these techniques can be seen as a unifying perspective on optimal PAC bounds in various models of realizable learning. For more details on this, check out the full paper [AACSZ23a] (and maybe also [AACSZ23b]).

An interesting direction for future study is whether the techniques presented here could lead to proofs of optimal PAC bounds in the agnostic setting, where all known such bounds go through so-called chaining techniques. It would be interesting to develop techniques that unify approaches to obtaining optimal PAC bounds in both the realizable and agnostic cases.

References

  • [Lit89] Nick Littlestone: From on-line to batch learning. In Proceedings of the Second Annual Workshop
    on Computational Learning Theory, pages 269–284. Morgan Kaufmann, 1989.
  • [AACSZ23a] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. Optimal PAC bounds without uniform convergence, In Foundations of Computer Science (FOCS), 2023.
  • [AACSZ23b] Ishaq Aden-Ali, Yeshwanth Cherapanamjeri, Abhishek Shetty, and Nikita Zhivotovskiy. The one-inclusion graph algorithm is not always optimal. In Conference on Learning Theory (COLT), 2023.
  1. The argument presented here is slightly simpler. In his original argument, Littlestone used a small holdout set to pick the best hypothesis in the sequence instead of taking a majority vote. ↩︎
  2. The notion of complexity, analogous to the VC dimension, that captures this setting is known as the Littlestone dimension. It turns out that even for simple classes such as the set of thresholds, this complexity is unbounded. ↩︎

Leave a Reply