Rademacher Complexity Bounds for a Penalized Multiclass Semi-Supervised Algorithm

We propose Rademacher complexity bounds for multiclass classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing $\kappa$ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold. In the second step, a classifier is trained by minimizing a margin empirical loss over the labeled training set and a penalization term measuring the disability of the learner to predict the $\kappa$ predominant classes of the identified clusters. The resulting data-dependent generalization error bound involves the margin distribution of the classifier, the stability of the clustering technique used in the first step and Rademacher complexity terms corresponding to partially labeled training data. Our theoretical result exhibit convergence rates extending those proposed in the literature for the binary case, and experimental results on different multiclass classification problems show empirical evidence that supports the theory.


Introduction
Learning with partially labeled data, or Semi-supervised learning (SSL), has been an active field of study in the ML community these past twenty years.In this case, labeled examples are usually supposed to be very few leading to an inefficient supervised model, while unlabeled training examples contain valuable information on the prediction problem at hand which exploitation may lead to a performant prediction function.For this scenario, we assume available a set of labeled training examples S ℓ = (x i , y i ) 1≤i≤n ∈ (X × Y) n drawn i.i.d. with respect to a fixed, but unknown, probability distribution D over X × Y and a set of unlabeled training examples S u = (x n+i ) 1≤i≤u ∈ X u supposed to be drawn from the marginal distribution, D X , over the domain X .If S u is empty, then the problem is cast into the supervised learning framework.The other extreme case corresponds to the situation where S ℓ is empty and for which the problem reduces to unsupervised learning.
The issue of learnability with partially labeled data was studied under three related yet different hypotheses of smoothness assumption, cluster assumption, and low density separation (Chapelle, , & Zien, 2006;Zhu, 2005) and many advances have been made on both algorithmic and theoretical front under these settings.
Although classification problems, for which the design of SSL techniques is appealing, are multiclass in nature, the majority of theoretical results for semi-supervised learning has mainly considered the binary case (Kääriäinen, 2005;Leskes, 2005;Amini, Laviolette, & Usunier, 2008a;El-Yaniv & Pechyony, 2009;Balcan & Blum, 2010;Urner, Shalev-Shwartz, & Ben-David, 2011).In this paper, we tackle the learning ability of multiclass classifiers trained on partially labeled data by first identifying dense clusters covering labeled and unlabeled examples and then minimizing an objective composed of the margin empirical loss of the classifier over the labeled training set, and also a penalization term measuring the disability of the learner to predict the predominant classes of dense clusters.
Our main result is a data-dependent generalization error bound for classifiers trained under this setting and which exhibits a complexity term depending on the effectiveness of the clustering technique to find homogenous regions of examples belonging to each class, the margin distribution of the classifiers and the Rademacher complexities of the class of functions in use defined for labeled and unlabeled data.The convergence rates deduced from the bound extends those proposed in the literature for the binary case, further experiments carried out on text and image classification problems, show that the proposed approach yields improved classification performance compared to extensions of state-of-the-art SSL algorithms to the multiclass classification case.
In the following section, we first define our framework, then the learning task we address.Section 3 presents the Rademacher generalization bound for a classifier trained with the proposed algorithm.Section 4 positions our theoretical findings with respect to the state-of-the-art, and finally, section 5 details experimental results that support this approach.

Penalized based semi-supervised multiclass classification
We are interested in the study of multiclass classification problems where the output space is Y = {1, . . ., K}, with K > 2. The semi-supervised multiclass classification algorithm that we consider is tailored under the cluster assumption and operates in two steps depicted in the following sections.

Partitioning of data and identifying κ-uniformly bounded clusters with level η
The first step consists in partitioning the unlabeled training observations, into G > 0 separate clusters with a clustering algorithm A trained on S u , denoted by Π Su .
Clusters of Π Su that are well covered by classes in the labeled training set are then kept for learning the classifier (Section 2.2).Formally, for a fixed κ ∈ {1, . . ., K}, let Y κ (C) be the κ most predominant classes of Y present in cluster C ∈ Π Su .We then define κ-uniformly bounded clusters with level η, C κ (η), the set of clusters within Π Su that are covered by their κ most predominant Distance between two clusterings A Z and A Z ′ estimated over Z (Eq.9), Π Su Partition of the unlabeled set obtained by A Su , Π ⋆ Limit clustering of the input space obtained by A ⋆ , a particular instantiation of A, C κ (η) The set of κ-uniformly bounded clusters (Eq.1), The margin of an example (x, y) over the whole set Y (Eq.3), The margin of an unlabeled example taken with respect to Y κ (C) (Eq.7), µ h (x) = arg max y∈Y h(x, y) The class prediction of h ∈ R X ×Y for an example x,
classes such that the proportion of other classes within C not belonging to Y κ (C) is less than η/G : Where P n the uniform probability distribution over S ℓ ; defined for any subset B ⊆ S ℓ , as P n (B) = 1 n card(B).

Learning objective
In the second step, we address a learning problem that is to find, in a hypothesis set H ⊆ R X ×Y , a scoring function h ∈ H with low risk: where ½ π is the indicator function and m h (x, y) is the margin of the function h at an example (x, y) & Panchenko, 2002): This is achieved by minimizing a penalized empirical loss, defined for a given ρ > 0 : composed of an empirical margin loss of h ∈ H on a labeled training set S ℓ , Algorithm 1: Pseudo-code of the PMS 2 L algorithm Hypothesis space H; G the number of clusters, A Su : X → {1, . . ., G} the clustering algorithm found on S u , κ ∈ N * , and η > 0; Stage 1: Using the labeled examples, S ℓ , identify the κ-bounded clusters in Π Su with level η, C κ (η); // in accordance with Eq. ( 1) Stage 2: Find a hypothesis h * ∈ H that minimizes the penalized objective function (Eq.4) : Output: h * and a penalization term that reflects the ability of the hypothesis h ∈ H to identify the κ most predominant classes within the disjoint clusters of C κ (η); where is the margin of an unlabeled example taken with respect to the set of κ predominant classes, Y κ (C) : and, Φ ρ : R → [0, 1] is the ρ-margin loss defined as (Koltchinskii & Panchenko, 2002) : Table 1 summarizes notations used throughout the paper and the pseudo-code of the proposed 2-step approach, referred to as Penalized Multiclass Semi-Supervised Learning (PMS 2 L) in the following, is given in algorithm 1.
The algorithm shares similarities with algorithms proposed in (Amini, Truong, & Goutte, 2008b;Urner et al., 2011), where the k-NN technique was used to increase the size of the labeled training data by pseudo-labeling unlabeled examples that are in the nearest neighborhood of labeled examples, for binary classification and bipartite ranking.In (Rigollet, 2007), another two-step semisupervised procedure is proposed where in the first stage a clustering of the feature space derived from the unlabeled data is produced and then each unlabeled observation, in a given cluster is assigned the same class label than the majority of labeled examples belonging to that class within the cluster.
In the present work we tackle a more general situation by considering multiclass classification problems and by relaxing the pseudo-labeling part which may be too aggressive in the multiclass case.Our analysis is based on the ability of a clustering technique to capture the structure of the data, and the ability of the classifier to identify predominant classes in κ-uniformly bounded clusters, leading to a multiclass definition of the cluster assumption which states that penalization over κ-uniformly bounded clusters with a bounded confident level η helps learning.

Theoretical study
We now analyze how the use of unlabeled training data can improve generalization performance in some cases.Essentially, the trade-off is that clustering offers additional knowledge on the problem, therefore potentially helps learning, but can also be of lower quality, which may degrade it.

Stable clustering with the bounded difference property
Before, let us first introduce notations that are used in the statement of the following results.We consider a hard clustering algorithm A Z defined as a function found over a finite sample Z.
Our analyzes are based on a notion of stability of the clustering algorithm A . ; measured as the average number of examples in a given set Z of size n that are in the exclusive disjunction of clusters (present in one and absent from the other) found by A . over two sets Z and Z ′ , and defined as : where π : {1, . . ., G} → {1, . . ., G} is a permutation.It is straightforward to show that ∆ n defines a true metric, sometimes referred to as the minimal matching distance (Luxburg, 2010), on the space of clusterings (see Th. 6 in the Appendix).Hence, the clustering algorithm A . is said to obey the bounded difference property, if and only if for any i.i.d.samples Z, Z ′ ∼ D |Z| X differing in exactly one observation, and for any i.i.d.sample Z ∼ D n X of size n, there exists a universal constant L such that : For some clustering algorithms such as k-means or k-hyperplane clustering, it has been shown that the bounded difference property is tightly related to their (in)stability.We refer to (Luxburg, 2010;Luxburg, Bousquet, & Belkin, 2004;Rakhlin & Caponnetto, 2006;Thiagarajan, Ramamurthy, & Spanias, 2011) and a number of references therein for the algorithmic details as well as various notions of clustering instability, and to (Shamir & Tishby, 2007) for the relation between bounded differences property, stability and model selection.Furthermore, in the case where a clustering algorithm A obeys the bounded difference property; it is said to be stable if for any distribution D X over X there exists a unique limit clustering of the input space Π ⋆ , obtained by a particular instantiation of the algorithm denoted by A ⋆ , such that for any Z drawn i.i.d.from D X and for any sample Z of size n drawn i.i.d.from the same distribution we have : In this case, it is possible to (tightly) upper-bound the distance between A ⋆ and the algorithm A trained on any unlabeled training set S u , estimated over the labeled training set S ℓ : ∆ n (A Su , A ⋆ , S ℓ ), as it is stated in the following Lemma.
Lemma 1 Let S ℓ = (x i , y i ) 1≤i≤n and S u = (x n+i ) 1≤i≤u be a labeled and an unlabeled training sets drawn i.i.d.according respectively to a probability distribution D over X × Y, and its marginal D X .For any 1 > δ > 0 and any stable clustering algorithm A that obeys the bounded differences property with constant L > 0, the average number of examples in S ℓ that are in the exclusive disjunction of clusters found by the clustering algorithm A on S u and by A ⋆ is upper-bounded with probability at least 1 − δ as follows : The proof is given in Appendix B. This result suggests that for any labeled and unlabeled training data, if a clustering algorithm obeys the bounded differences property and that it is stable, then with high probability, Π Su covers as well the labeled training data as the limit partition Π ⋆ (i.e.most of the labeled examples would more likely be present in the intersection Π Su ∩ Π ⋆ ).

Semi-supervised Data-dependent bounds
Based on the previous lemma, we can define situations where the Empirical Risk Minimization principle of algorithm PMS 2 L becomes consistent.This result is stated in Theorem (3) which provides bounds on the generalization error of a multiclass classifier trained with the penalized empirical loss defined above (Eq.4).
The notion of function class capacity used in the bounds, is the labeled and unlabeled Rademacher complexities of the function class F H = {f : x → h(x, y) : y ∈ Y, h ∈ H}, defined respectively as: where σ i 's, called Rademacher variables, are independent uniform random variables taking values in {−1, +1}; i.e. ∀i, P( 2 .The proof of the theorem is based on the following Lemma that provides generalization bounds over the true risk of any classifier h, found by algorithm PMS 2 L and estimated within a single confident cluster; with respect to the estimated empirical risk : Lemma 2 Let H ⊆ R X ×Y be a hypothesis set where Y = {1, . . ., K}, and let S ℓ = (x i , y i ) 1≤i≤n and S u = (x n+i ) 1≤i≤u be two sets of labeled and unlabeled training data, drawn i.i.d.respectively according to a probability distribution over X × Y and a marginal distribution D X .Fix ρ > 0, κ ∈ {1, . . ., K} then for any 1 > δ > 0, the following multiclass classification generalization error bound holds with probability at least 1 − δ for all h ∈ H learned by algorithm 1 over a single κ-uniformly bounded cluster C j ∈ C κ (η) derived from S u by a clustering algorithm A Su that partitions the input space into G clusters : , The proof is provided in Appendix B. From this result and Lemma 1, we can then derive a data-dependent generalization bound for any semi-supervised multiclass prediction function found by algorithm PMS 2 L as stated below.
Theorem 3 Let H ⊆ R X ×Y be a hypothesis set where Y = {1, . . ., K}, and let S ℓ = ((x i , y i )) n i=1 and S u = (x i ) n+u i=n+1 be two sets of labeled and unlabeled training data, drawn i.i.d.respectively according to a probability distribution over X × Y and a marginal distribution D X .Fix ρ > 0 and κ ∈ {1, . . ., K}, and consider a clustering algorithm A that obeys the bounded difference property with constant L and is stable.If the κ-uniformly bounded clusters found in Π Su are such that the confident level η satisfies η ≤ ∆ n (A Su , A ⋆ , S ℓ ), then for any 1 > δ > 0 and all h ∈ H found by the PMS 2 L algorithm using A Su , the following multiclass classification generalization error bound holds with probability at least 1 − δ : where The proof is provided in Appendix B. This result implies that with stable clustering algorithms obeying the bounded differences property, if the proportion of other classes than κ-predominant ones in confident clusters is less than the number of labeled examples in the exclusive disjunction of limit clusters and those found using the unlabeled training data, then with the strategy defined in algorithm PMS 2 L we can expect to have interesting situations for learning prediction models as it is stated in the following corollary.
Consider kernel-based hypotheses with K : X × X → R a PSD kernel and Φ : X → H its associated feature mapping function, defined as : Where W H,2 is the Frobenius norm of the parameter matrix for a linear kernel, or the L H,2 group norm of W, defined as In this case, we can derive the following corollary from theorem 3 : Corollary 4 Let K : X × X → R be a PSD kernel and let Φ : X → H be the associated feature mapping function.Assume that there exists R > 0 such that K(x, x) ≤ R 2 for all x ∈ X .Then for any 1 > δ > 0 and under the conditions and the definitions of theorem 3, the following multi-class classification error bound holds for all hypothesis h ∈ H B learned by the proposed algorithm over the set of κ-uniformly bounded set of clusters, C κ (η), with probability at least 1 − δ : where Proof.From the proposition (8.1) in (Mohri, Rostamizadeh, & Talwalkar, 2012), and the Cauchy- j=1 b 2 j with b j = 1 and a j = u η (j), ∀j; the Rademacher complexity of the class of linear classifiers in the feature space can be bounded as : where u η (j) in the number of unlabeled examples in η-confident cluster C j and u η = j u η (j) is the total number of unlabeled examples within a set of confident clusters C κ (η).
and also R n (F H ) 2RB n−nη n 2 .Applying the Cauchy-Schwartz inequality again we finally get : The non-empirical terms of this bound determine the convergence rate of the proposed penalized semi-supervised mutliclass algorithm, and hence following (Vapnik, 2000, theorem 2.1, p.38), gives insights on its consistency.These terms may be better explained using orders of magnitude (Knuth, 1976).If we now consider the common situation in semi-supervised learning where u ≫ n, and and The convergence rate of the bound of corollary 4 is of the order where, for any real valued functions f and g the equality ; (Knuth, 1976).In the following section we present an overview of the related-work and show that in the case where the clustering technique A captures the true structure of the data, measured by the set of κ-uniformly bounded clusters with rate η, resulting in approximations above, then for linear kernel-based hypotheses, the convergence rate ( 15) is the direct extension of dimension-free convergence rates proposed in semi-supervised learning for the binary case.
As for the opposite case n ≫ u the pseudo-labeling step does not help to learning and even can make the bounds worse than at the supervised case.The same situation takes place when the number of classes is comparable to the number of objects and one can not clarify whether a cluster is consistent or not.
Finally we would like to emphasize that our main target is the most practical case with u ≫ n and the number of classes comparable to the number of clusters.

Related works and discussion
Semi-supervised learning (SSL) approaches exploit the geometry of data to learn a prediction function from partially labeled training sets (Seeger, 2000).The three main SSL techniques; namely graphical, generative and discriminant approaches, were mostly developed for the binary case and tailored under smoothness, low density separation and cluster assumptions (Zhu, 2005;Chapelle et al., 2006;Amini & Usunier, 2015).
Graphical approaches construct an empirical graph where the nodes represent the training examples and the edges of the graph reflect the similarity between them.These approaches are mostly based on label spreading algorithms that propagate the class label of each labeled node to its neighbors (Zhou, Bousquet, Lal, Weston, & Schölkopf, 2003;Zhu, 2002).Generative approaches naturally exploit the geometry of data by modelling their marginal distributions.These methods are developed under the cluster assumption and use the Bayes rule to make decision.In the seminal work of (Castelli & Cover, 1995) it is shown that, without extra assumptions relating marginal distribution and true distribution of labels, a sample of unlabeled data is of (almost) no help for learning purpose.Recent work from (Ben-David, Lu, & Pál, 2008) investigated further the limitations of semi-supervised learning and concluded that theoretical results for semi-supervised learning should be accompanied by an extra assumption on the true label distribution.
Discriminant approaches directly find the decision boundary without making any assumptions on the marginal distribution of examples.The two most popular discriminant models are without doubts co-training (Blum & Mitchell, 1998) and Transductive SVMs (Vapnik, 2000).The co-training algorithm supposes that each observation is produced by two sources of information and that each view-specific representation is rich enough to learn the parameters of the associated classifier in the case where there are enough labeled examples available.The two classifiers are first trained separately on the labeled data.A subset of unlabeled examples is then randomly drawn and pseudo-labeled by each of the classifiers.The estimated output by the first classifier becomes the desired output for the second classifier and reciprocally.Under this setting, (Leskes, 2005) proposed a Rademacher complexity bound, where unlabeled data are used to decrease the disagreement between hypotheses from a class of functions H and proved that in some cases, the bound of the excess risk |R(h) − R(h, S ℓ )| for any h ∈ H is of the order Õ n −1/2 + u −1/2 .Another study in this line of research is (Tolstikhin, Zhivotovskiy, & Blanchard, 2015).However, transductive learning tends to produce a prediction function for only a fixed number of unlabeled examples.Transductive algorithms generally use the distribution of unsigned margins of unlabeled examples in order to guide the search of a prediction function and find the hyperplane in a feature space that separates the best labeled examples and that does not pass through high density regions.The notion of transductive Rademacher complexity was introduced in (El-Yaniv & Pechyony, 2009).In the best case, the excess risk bound proposed in this paper is of the order Õ u min(u, n)/(n + u) .
Our two step multiclass SSL approach is in between generative and discriminant approaches, and hence bears similarity with the study of (Urner et al., 2011).The main difference is however Binary; (Balcan & Blum, 2010) Binary; (Leskes, 2005) Binary; (Kääriäinen, 2005) Multi-class; Corollary 4 that the proposed approach does not rely on any pseudo-labeling mechanism and that our analyzes are based on the Rademacher complexity leading to dimension free data-dependent bounds.On another level and under the PAC-Bayes setting, (Kääriäinen, 2005) showed that in the realizable case where the hypothesis set contains the Bayes classifier, the obtained excess risk bound takes the form inf f ∈F 0 sup g∈F 0 d(f, g)+ Õ u −1/2 ; where d(f, g) is a normalized empirical disagreements between two hypothesis that correctly classify the labeled set and can be of order at least Õ n −1/2 .The convergence rates of the mentioned bounds are sum up in Table 2. From these results, it becomes apparent that the convergence rate deduced from corollary 4, (Equation 15) extends those found in (Kääriäinen, 2005;Leskes, 2005) for multiclass classification.

Experimental Results
We perform experiments on six publicly available datasets.The three first ones are Fungus, Birds and Athletics that consist of three aggregations of lead nodes that go down from parent nodes in the ImageNet hierarchy 1 .Each image is characterized by a Fisher vector representation as described in (Harchaoui, Douze, Paulin, Dudík, & Malick, 2012).The three others collections are respectively the MNIST database of handwritten digits, the pre-processed 20 Newsgroups (20-NG) collection 2 and the USPS dataset 3 .Table 2 resumes the characteristics of these datasets.The proportions of training and test sets were kept fixed to those given in the released data files.Within the training set (S ℓ ∪ S u ) we randomly sampled labeled examples S ℓ , with different sizes, and used the remaining as unlabeled data.To validate the proposed penalized based multiclass semi-supervised learning approach (PMS 2 L), we compared its results with respect to a multiclass extension of a popular SSL algorithm proposed within each of the Generative, Graphical and Discriminant approaches.More precisely we considered the extension of the label propagation algorithm to the multiclass case (McLP) proposed by (Wang, Tu, & Tsotsos, 2013).A generative SSL model based on the mixture of gaussians (S 2 GM), the extension of TSVM 4 (Vapnik, 2000) to the multiclass case (McTSVM), and a purely supervised technique which does not make use of any unlabeled examples in the training stage (SUP).
As the clustering algorithm A, we employed the Nearest Neighbor Clustering technique proposed in (Bubeck & Luxburg, 2009), and fixed m = 4K, κ = 2 and η = 10 −3 .Meaning that each cluster in C κ (η) is mainly composed of the two most predominant classes within it.For the second stage of PMS 2 L, as well as for SUP and McTSVM, we adapted the aggregated one-versus-all approach using a linear kernel SVM that respects the conditions of corollary 4. The penalized objective function can be easily implemented using convex optimization tools for convex surrogates of the 0/1 loss.The parameter C of the SVM classifier is determined by five fold cross-validation in logarithmic range between 10 −4 and 10 4 over the available labeled training data.Results are evaluated over the test set using the accuracy, and the reported performance is averaged over 25 random (labeled/unlabeled/test) sets of the initial collections.
Table 3 summarizes results obtained by SUP, PMS 2 L, McLP, S 2 GM and McTSVM when a very small proportion of labeled training data is used in the learning of the models.We use boldface to indicate the highest performance rates, and the symbol ↓ indicates that performance is significantly worse than the best result, according to a Wilcoxon rank sum test used at a p-value threshold of 0.05 (Lehmann, 1975).From these results it becomes clear that -The algorithm PMS 2 L performs significantly better than all of the four other algorithms, and it improves over SUP by an average of 1.5 to 6.5% on different datasets.
-McLP and McTSVM also perform better than SUP, though not in the same range than previously, while the mixture of Gaussians S 2 GM does worse than SUP especially in the cases where the dimension of the problem is high.
-Finally, the difference in performance between PMS 2 L and McTSVM is smaller than the one between the former and McLP.
Our analysis of these results is that the Nearest Neighbor Clustering technique (Bubeck & Luxburg, 2009) is effectively able to map correctly the considered data, into homogenous clusters containing mostly unlabeled examples of the same class than the κ = 2 most predominant classes contained in them.In this case, the penalized term of the objective function used to learn the classifier (Equation 4) forcefully helps to pick a better hypothesis in the set of linear classifiers, than when only labeled training data are used.Hence, for unlabeled examples within a given cluster, the constraint of predicting the same classes than the κ = 2 most predominant classes of that cluster makes the decision boundary to pass through regions where the unsigned margins of unlabeled  examples are small.As stated in section 4, this is exactly how TSVM works, and the proximity of results between McTSVM and PMS 2 L, compared to the two other SSL algorithms can be explained by the similitude of the assumptions leading to the development of these models.However, the fundamental difference between these two algorithms in the iterative pseudolabeling of unlabeled examples (or not), would do that, when the proportion of labeled training data is small, the iterative pseudo-labeling steps of McTSVM injects noise into the learning process at the same level or even more than the true labeled information.The question therefore arises as to how these two techniques behave for more labeled training data available at the learning phase?
In order to analyze more finely this situation, we compared SUP, PMS 2 L and McTSVM for an increasing size of the labeled training data.Figure 2, illustrates this by showing the accuracy (in percentage) with respect to the number of labeled examples in the initial labeled training set S ℓ .The main observations drawn from these results, are: -As expected, all performance curves increase monotonously with respect to the additional labeled data and converge to the same performance.We note that when all the labeled training data are used for learning the linear SVM gives the same results than those reported in the state-of-the art (e.g. the MLP model with no hidden layer on USPS (LeCun, Bottou, Bengio, & Haffner, 2001) and (Maji & Malik, 2009)).
-Though McTSVM takes advantage of unlabeled data in its learning process, it is outperformed by PMS 2 L.
-On ImageNet Birds and MNIST, a non-negligible quantity of labeled examples is necessary for SUP to catch the performance of PMS 2 L learned with the same proportion of labeled data than the one of Table 3, and the remaining unlabeled training data.
These behaviour first suggest that when enough labeled data is available, unlabeled data do not serve the learning algorithm as for the reverse situation.These results suggest that for SSL discriminant techniques designed following the low density separation hypothesis, a more convenient approach than the pseudo-labeling strategy, used in most of these techniques, would be the incorporation of a penalized factor concerning unlabeled examples into the objective of the learning algorithm as the one proposed in Equation 4.

Conclusion
The contributions of this paper are twofold.First, we proposed a bound on the risk of a multiclass classifier trained over partially labeled training data.We derived data-dependent bounds for the generalization error of a classifier trained by minimizing an objective function that consists of an empirical risk term, estimated over the labeled training set, and a penalized term corresponding to the ratio of unlabeled examples of each cluster; within the κ bounded set of clusters, for which their predicted class does not belong to the set of the associated κ predominant classes.The analysis of this bound for kernel-based hypotheses reveals a convergence rate that is an extension to the multiclass case, of some other rates over the bounds of the excess risk proposed in the literature.Empirical results on a various datasets support our findings by showing that the proposed algorithm is competitive compared to different extensions of binary semi-supervised learning algorithms and that it may significantly increase classification performance in the most interesting situation, when there are few labeled data available for training.
where (Eq.17) is due to the triangle inequality with absolute value; and (Eq.18) results from (Eq. 16) and the bounded-difference property of algorithm A (Eq. 10).
Then by McDiarmid's inequality (Appendix, Th. 5) for any ǫ > 0 we get : Setting the right-hand side to be δ/2, and solving for ǫ, we obtain that with probability at least 1− δ 2 : Where the last inequality is due to the stability of the clustering algorithm A (Eq. 11).Furthermore, by bounding φ(S u ) = E S ℓ ∼D n [∆ n (A Su , A ⋆ , S ℓ )] in terms of S ℓ using again the McDiarmid inequality we have for any ǫ > 0 : Indeed, if we consider the multivariate function ψ : S ℓ → ∆ n (A Su , A ⋆ , S ℓ ); changing a single labeled observation in S ℓ could not change ∆ n (A Su , A ⋆ , S ℓ ) on more than 1/n by definition (Eq.9).Hence, by setting the right-hand side to be δ/2, and solving for ǫ, we obtain that with probability greater than 1 − δ 2 : Applying the union bound on both inequalities (Eq.19) and (Eq.20), we finally get that for any labeled and unlabeled training sets S ℓ and S u and with probability at least 1 − δ : Lemma 2 Let H ⊆ R X ×Y be a hypothesis set where Y = {1, . . ., K}, and let S ℓ = (x i , y i ) 1 i n and S u = (x n+i ) 1 i u be two sets of labeled and unlabeled training data, drawn i.i.d.respectively according to a probability distribution over X × Y and a marginal distribution D X .Fix ρ > 0, κ ∈ {1, . . ., K} then for any 1 > δ > 0, the following multiclass classification generalization error bound holds with probability at least 1 − δ for all h ∈ H learned by algorithm 1 over a single κ-uniformly bounded cluster C j ∈ C κ (η) derived from S u by a clustering algorithm A Su that partitions the input space into G clusters : Proof.We start with the decomposition of the risk estimated in a single κ-uniformly bounded cluster C j ∈ C κ (η), by considering two situations where the prediction µ h (x) = arg max y∈Y h(x, y) falls within any set of confident clusters and without them respectively: where The first term in the inequality above involves the margin of examples and it can be upperbounded using the definition of the ρ-margin loss (Eq.8) estimated over the labeled examples that are in cluster C j : where m h (x, y, Y ′ κ ) = h(x, y) − max y ′ ∈Y ′ κ \{y} h(x, y ′ ), x ∈ C j .Expected risk over a single cluster C j can be decomposed through conditional risk as : From the data-dependent Bennett's inequality (appendix A, theorem 7), we have with probability at least 1 − δ/4 : where n η (j) = |S ℓ ∩ C j |, and the sample variance, which is upper-bounded by : we have from ( 23) and ( 24) with probability at least 1 − δ/4 : Further, the ρ-margin loss function Φ ρ (•) (Eq.( 8)) is 1/ρ-Lipschitz, from the multi-class classification generalization bound proposed in (Lei et al., 2015) (appendix A, theorem 11); it then comes that for any fixed set Y ′ κ ⊂ Y, |Y ′ κ | κ and any 1 > δ > 0 with probability at least 1 − δ/4K κ we have for all h ∈ H : where, x i ∈S ℓ ∩C j σ i f (x i ).Now for any possible set of κ predominant classes Y κ in C j , and using the union bound and the inequality k i=1 K i 2K κ , it comes from ( 25) and ( 26) and the union bound, we have with probability at least 1 − δ/2 : Where By decomposing the sum in the first term of the above inequality, and considering the two cases where the class label y is within or without Y κ : Here we are in the case where µ ½ y ∈Yκ∧x∈C j .Hence, for any sample S ℓ and a set of predominant classes Y κ we have From definition (1) we have 1 n (x,y)∈S ℓ ½ y ∈Yκ∧x∈C j η/G, and so Further, the second term in inequality ( 21) for any set Y κ ⊂ Y, |Y κ | κ can be upperbounded using unlabeled data that are in cluster C j : where m ′ h (x, Y κ ) = max y∈Yκ(C j ) h(x, y) − max y∈Y\Yκ(C j ) h(x, y), x ∈ C j .As the ρ-margin loss has its values in [0, 1], from the standard Rademacher complexity bound (appendix A, theorem 9) over i.i.d.sample S u ∩ C j , for any 0 > δ > 1 and Y κ ⊆ Y it comes that with probability at least 1 − δ/4 : where Due to the monotonicity of supremum, we have for any C j ∈ C κ (η) : By Lemma 8 (Appendix A) we have : Similarly to (25) we have with probability at least 1 − δ/4 : Thus, by ( 30) and ( 31), and the union bound we have with probability at least 1 − δ/2: The statement of the Lemma follows from the inequalities ( 21), ( 28), (32), and the union bound.
Theorem 3 Let H ⊆ R X ×Y be a hypothesis set where Y = {1, . . ., K}, and let S ℓ = ((x i , y i )) n i=1 and S u = (x i ) n+u i=n+1 be two sets of labeled and unlabeled training data, drawn i.i.d.respectively according to a probability distribution over X × Y and a marginal distribution D X .Fix ρ > 0 and κ ∈ {1, . . ., K}, and consider a clustering algorithm A that obeys the bounded difference property with constant L and is stable.If the bounded clusters found in Π Su are such that the confident level η satisfies η ≤ ∆ n (A Su , A ⋆ , S ℓ ), then for any 1 > δ > 0 and all h ∈ H found by the PMS 2 L algorithm using A Su , the following multiclass classification generalization error bound holds with probability at least 1 − δ : By the Cauchy-Schwarz inequality then by fixing b i = 1, ∀i ∈ {1, . . ., G}, we can bound the two last terms of the right hand side inequality, and get The result then follows from the inequalities, ∀a > 0, b > 0, c > 0; (a+b+c) 2 3(a 2 +b 2 +c 2 ); 5 √ 3 < 9; (33), ( 34), ( 35), (37) and the union-bound.

Figure 2 :
Figure 2: Accuracy in percentage with respect to the proportion of labeled examples in the initial training set for ImageNet Birds (a), Athletics (b), Fungus (c); 20-NG (d), MNIST (e), and USPS (f).Each reported performance on the test is averaged over 25 random (labeled/unlabeled/test) sets of the initial collections.

Table 2 :
Summary of the convergence rates of dimension free bounds of excess risks for different SSL approaches.

Table 2 :
Characteristics of datasets used in our experiments.dataset |S ℓ ∪ S u | size of the test dimension, d # of classes, K