An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory

In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\mathbb{R}^n$ generated by $k$-sparse vectors is bounded below by $k \lg (n/k)$ and above by $2k \lg (n/k)$, plus some round-off terms. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg (n/k))$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\mathbb{R}^n$. It is further shown that random sign-flipping errors result only in an increase in the constant in the $O(k \lg (n/k))$ estimate. Because constructing a consistent algorithm is not straight-forward, we present a heuristic based on the $\ell_1$-norm support vector machine, and illustrate that its computational performance is superior to a currently popular method.


Introduction
The field of "compressed sensing" has become very popular in recent years, with an explosion in the number of papers.Stated briefly, the core problem in compressed sensing is to recover a high-dimensional sparse (or nearly sparse) vector x from a small number of measurements of x.In the traditional problem formulation, the measurements are linear, consisting of m real numbers y i = a i , x , i = 1, . . ., m, where the measurement vectors a i ∈ R n are chosen by the learner.More recently, attention has focused on so-called one-bit compressed sensing, referred to hereafter as OBCS in the interest of brevity.In OBCS the measurements consist, not of the inner products a i , x , but rather just the signs of these inner products, i.e., a "one-bit" representation of these measurements.In much of the OBCS literature, the vectors a i are chosen at random from some specified probability distribution, often Gaussian.Because the sign of a i , x is unchanged if x is replaced by any positive multiple of x, it is obvious that under this model, one can at best aspire to recover the unknown vector x only to within a positive multiple, or equivalently, to recover the normalized vector x/ x 2 .This limitation can be overcome by choosing the measurements to consist of the signs of inner products a i , x + b i , where again a i , b i are selected at random.
The current status of OBCS is that while several algorithms have been proposed, theoretical analysis is available for only a few algorithms.Moreover, in cases where theoretical analysis is available, the sample complexity is extremely high.In the present paper, we interpret OBCS as a problem in probably approximately correct (PAC) learning theory, which is a wellestablished branch of statistical learning theory.By doing so, we are able to draw from the wealth of results that are already available, and thereby address many of the currently outstanding issues.In PAC learning theory, a central role is played by the so-called Vapnik-Chervonenkis, or VCdimension of the collection of concepts to be learned.The principal result of the present paper is that the VC-dimension of the set of half-planes in R n generated by k-sparse vectors is bounded below by k lg(n/k) and above by 2k lg(n/k), plus some roundoff terms.Using this bound, we are able to establish the following results for the case where x ∈ R n has no more than k nonzero components: • In principle, OBCS is possible whenever the measurement vector (a i , b i ) is drawn at random from any arbitrary probability distribution.Moreover, if a consistent algorithm1 can be devised, then the number of measurements is O(k ln(n/k)).
• There is also a lower bound on the OBCS problem.Specifically, there exists a probability distribution on R n such that, if (a i , b i ) is drawn at random from this distribution, then the number of measurements required to learn each k-sparse n-dimensional vector is bounded below by Ω(k ln(n/k)).
• In view of the intractability of constructing a consistent algorithm for this problem, an algorithm based on the ℓ 1 -norm support vector machine is proposed for recovering the unknown sparse vector x approximately.The algorithm is evaluated on a test problem and it is shown that it performs better than a currently popular method.
• It is shown that PAC learning with finite VC-dimension is robust under random flipping of labels, even when the flipping probability is not known.Thus, OBCS is still possible in the case where the actual information available to the learner consists of the sign of a i , x which is then passed through a binary symmetric channel that flips 0 to 1 and vice versa with some probability α < 0.5.Moreover, the number of measurements required is still O(k ln(n/k)), but with a larger constant under the O symbol.
• If the samples a i are not independent, but are β-mixing, learning is still possible, and explicit estimates are available for the rate of learning.
The paper is organized as follows: In Section 2, a brief review is given of some recent papers in OBCS.In Section 3, some parts of PAC learning theory that are relevant to OBCS are reviewed.In particular, it is shown how OBCS can be formulated as a problem in PAC learning, so that OBCS can be addressed by finding upper bounds on the VC-dimension of halfspaces generated by k-sparse vectors.In Section 4, both upper and lower bounds are derived for the VC-dimension of half-spaces generated by ksparse vectors.In Section 5, the standard results in PAC learning theory, namely that concept classes with finite VC-dimension are PAC learnable, are extended to the case where measurements are noisy.While some such results are known in the literature, they require the probability of mislabelling to be known; no such assumption is made here.In Section 6, we first present a conceptual algorithm arising from applying PAC learning theory to OBCS.However, since this algorithm is not computationally feasible, we then present a tractable algorithm based on the ℓ 1 -norm support vector machine.A numerical example is presented in Section 7, where it is shown that our suggested algorithm performs better than a currently popular method.Finally, Section 8 contains a discussion of some issues that merit further investigation.

Brief Review of One-Bit Compressed Sensing
By now there is a substantial literature regarding the traditional compressed sensing formulation, out of which only a few references are cited here in the interests of brevity.Book-length treatments of compressed sensing can be found in [1,2,3,4].Amongst these, [2] contains a thorough discussion of virtually aspect of compressed sensing theory.A recent volume [5] is a compendium of articles on a variety of topics.The first paper in this volume [6] is a survey of the basic results in compressed sensing.Another paper [7] provides a very general framework for sparse regression that can be used, among other things, to analyze compressed sensing algorithms.Each of these papers contains an extensive bibliography.
Throughout this paper, n denotes some fixed and large integer.For x ∈ R n , let supp(x) denote the support of a vector, and let Σ k denote the set of k-sparse vectors in R n ; that is, supp(x) := {i : where both n and k are known integers with k ≪ n.The basic problem in compressed sensing is to design an m × n matrix A where m ≪ n, together with a decoder map ∆ : R m → R n such that ∆(Ax) = x for all x ∈ Σ k , that is, x can be recovered exactly from the m-dimensional vector of linear measurements y = Ax.Variations of the problem include the case where x is only "nearly sparse," and/or y = Ax+η where η is a measurement noise.By far the most popular method for recovering a sparse vector is ℓ 1 -norm minimization.If y = Ax + η, the approach is to define where ǫ is a known upper bound on η 2 .In a series of papers by Candès, Tao, Donoho, and others, it is demonstrated that if the matrix A is chosen so as to satisfy the so-called Restricted Isometry Property (RIP), then the decoder ∆ defined in (1) produces a good approximation to x, and recovers it exactly if x ∈ Σ k and η = 0. See for example [8,9,10,11], as well as the survey paper [6] and the comprehensive book [2].Moreover, it is shown in [8] that if the elements of A are samples of independent and identically distributed (i.i.d.) normal random variables (denoted by a ij ∼ N (0, 1)), then with probability the resulting normalized matrix (1/ √ m)A satisfies the RIP.
The remainder of the section is devoted to a discussion of the one-bit compressed sensing (OBCS) problem.One bit compressed sensing is introduced in [12].In that paper, it is assumed that the measurement y i equals the bipolar quantity y i = sign( a i , x ), as opposed to the real number a i , x .Because the measurements remain invariant if x is replaced by any positive multiple of x, there is no loss of generality in assuming that x 2 = 1.A greedy algorithm called "renormalized fixed point iteration" is introduced, as follows: where the regularizing function f (•) is defined by The optimization problem is non-convex due to the constraint z 2 = 1.Only simulations are provided, but no theoretical results.In [13], the focus is on recovering the support set of the unknown vector x from noise-corrupted measurements of the form sign( a i , x + η i ), where the noise vector η consists of pairwise independent Gaussian signals.A nonadaptive algorithm is presented that makes use of Hoeffding's inequality applied to the expected value of the covariance of the signs of two Gaussian random variables.An adaptive algorithm is also presented.In [14], a new greedy algorithm is presented called "matched signed pursuit."The optimization problem is not convex; as a result there are no theoretical resuts.The algorithm is similar to the CoSaMP algorithm for the conventional compressed sensing problem [15].
In [16], one begins with a constant ǫ opt that satisfies where e denotes the base of the natural logarithm.Then the following result is shown: Let A ∼ N m×n (0, 1) consist of mn pairwise independent normal random variables, and let y i = sign( a i , x ).Fix ǫ > 0 and δ ∈ (0, 1 then for every pair x, s ∈ Σ k , sign(Ax) = sign(As) =⇒ x − s 2 ≤ ǫ, with probability ≥ 1 − η.In words, this result means that if we can find a k-sparse vector s that is consistent with the observation vector y, then s is close to x.In fact, s can be made as close to x as desired by increasing the number of measurements m.Unfortunately, this result is not practical because finding such a vector s is equivalent to finding a minimal ℓ 0 -norm solution consistent with the observations, which is known to be an NP-hard problem [17].
In [18], the authors focus on vectors x ∈ R n that satisfy an inequality of the form x 1 / x 2 ≤ s.Note that if x is s-sparse, then it satisfies the above inequality, though of course the converse is not true.Thus they use the ratio x 1 / x 2 as a proxy for x 0 .They choose measurement vectors a i ∈ R n according to the Gaussian distribution, or more generally, any radially invariant distribution; this means that, under the chosen probability distribution on the vector a ∈ R n , the normalized vector a/ a 2 is uniformly distributed on the sphere S n−1 ⊆ R n .With these randomly generated measurement vectors, the measured quantities are y i = sign( a i , x ).The authors propose to estimate x via where m is the number of measurementss.They show that if for some universal constant C, then with probability ≥ 1−exp(−cδm) where c is another universal constant, it is true that for all x ∈ R n such that x 1 / x 2 ≤ √ s.Although this is the first proposed convex algorithm to recover x, the number of measurement m is O(δ −5 ).From (3) we see that if we are able to carry out the ℓ 0 -norm minimization, then m is O(δ −1 ).It is still an open question whether or not a practical algorithm can achieve this optimal dependence on δ in (3).
In [19] the theory is extended to non-Gaussian noise signals that are sub-Gaussian.In [20], it is assumed that the measurements could be noisy and that where θ : R → [−1, 1] is an unknown function.If θ(α) = tanh(α/2), then the problem is one of logistic regression, whereas if θ(α) = sign(α), then the problem becomes OBCS.A probabilistic approach is proposed, which has the advantage that the resulting optimization problem is convex.However, the disadvantage is that the number of measurements m is O(δ −6 ) where δ is the probability that the algorithm may fail.The large negative exponent of δ makes the algorithm somewhat impractical.In all of the papers discussed until now, the measurement vector y i equals sign( a i , x ) for suitably generated random vectors a i .As mentioned above, with such a set of measurements one can at best aspire to recover only the normalized unknown vector x/ x 2 .In [21], it is proposed to overcome this limitation by changing the linear measurements to affine measurements.Specifically, the measurements in [21] are of the form y i = sign( a i , x + b i ), where a ij ∼ N (0, 1),2 and b i ∼ N (0, τ 2 ) where τ is some specified constant.If a prior upper bound R for x 2 is available, then it is possible to choose τ = R. Then the optimization problem in (2) is modified to3 (4) It is evident that the above formulation is similar to the formulation in [18] applied to the augmented vector (x, v) ∈ R n+1 .The following result is shown in [21,Theorem 4] then for all vectors x ∈ R n with x 1 / s 2 ≤ √ s, the solution (x, v) to the optimization problem in (4) satisfies with a probability exceeding where C and c are universal constants.If R is a known prior upper bound for x 2 , then one can choose τ = R in the above, in which the bound simplifies to (x/v) − x 2 ≤ 4 √ 2α.

Preliminaries
In this section we present some preliminary results, while the main results are presented in the next section.As shown below, the one-bit compressed sensing (OBCS) problem can be naturally formulated as a problem in probably approximately correct (PAC) learning.In fact, several of the approaches proposed thus far for solving the OBCS problem are similar to existing methods in PAC learning, but do not take full advantage of the power and generality of PAC learning theory.Some of the things that "come for free" in PAC learning theory are: explicit estimates for the number of measurements m, ready extension to the case where successive measurement vectors a i are not independent but form a β-mixing process, and ready extension to the case of noisy measurements.However, the PAC learning approach does not readily lend itself to the formulation of efficiently computable algorithms.This issue is addressed in Section 6.

Brief Introduction to the PAC Learning Problem
In this subsection, we give a brief introduction to PAC learning theory.Probably approximately correct (PAC) learning theory can be said to have originated with the paper [22].By now the fundamentals of PAC learning theory are well-developed, and several book-length treatments are available, including [23,24,25,26].The theory encompasses a wide variety of learning situations.However, OBCS is aligned closely with the most basic version of PAC learning, known as concept learning, which is formally described next.
The concept learning problem formulation includes the following "ingredients": • An underlying set X.
• A σ-algebra S of subsets of X.
• A collection C ⊆ S, known as the "concept class." • A family of probability measures P on X.
Usually X is a metric space and S is the Borel σ-algebra on X.The family or probability measures P can range from a single set {P }, to P * , the set of all probability measures on the set X.If P is a singleton set {P }, then the problem is known as "fixed distribution" learning, whereas if P = P * , then the problem is known as "distribution-free" learning.
Learning takes place as follows: A fixed but unknown set T ∈ C, known as the "target concept," is chosen.If P consists of more than one probability measure, a fixed but unknown probability measure P ∈ P is also chosen.Then random samples {c 1 , c 2 , . ..} are generated independently in accordance with the chosen distribution P .This is the basic version of PAC learning studied in [23,24,25].The case where the sample sequence can exhibit dependence, for example if they come from a Markov process with the stationary distribution P , is studied in [26].Using the sample c i , an "oracle" generates a "label" y i ∈ {0, 1}.In the case of noise-free measurements, y i = I T (c i ), where T is the fixed but unknown target concept, and I T (•) denotes the indicator function of the set T .Thus the oracle informs the learner whether or not the training sample c i belongs to the unknown target concept T .After m such samples are drawn and labelled, the learner makes use of the set of labelled samples {(c i , y i )} m i=1 ∈ (X × {0, 1}) m to generate a "hypothesis," or an approximation to the unknown target concept T .The case where the label y i is a noisy version of I T (c i ) is studied in Section 5.
In statistical learning theory, an "algorithm" is any indexed collection of maps {A m } m≥1 where In other words, an algorithm is any systematic procedure for taking a finite sequence of labelled samples, and returning an element of the concept class C. The issue of whether A m is efficiently computable is ignored in statistical learning theory.The concept is called the "hypothesis" generated by the first m samples when the sample sequence is c = (c 1 , . . ., c m ), and the label sequence is y = (y 1 , . . ., y m ).In the interests of reducing clutter, we will use G m in the place of G m (T ; c) unless the full form is needed for clarity.Note that A m is a deterministic map, but G m is random because it depends on the random learning sequence {c i }.
To measure how well the hypothesis G m approximates the unknown target concept T , we use the generalization error defined by Thus J(T, G m ) is the expected value of the difference between the indicator function I T (•) and the label generated by the oracle with the input I Gm (•).Note that both I T and I Gm assume values in {0, 1}.Hence J(T, G m ) is also the probability that, when a random test sample x ∈ X is generated in accordance with the probability distribution P , the sample is misclassified by the hypothesis G m , in the sense that I T (x) = I Gm (x).
The key quantity in PAC learning theory is the learning rate, defined by r(m, ǫ) := sup Therefore r(m, ǫ) is the worst-case measure, over all probability distributions in P and all target concepts in C, of the set of "bad" samples c = (c 1 , . . ., c m ) for which the corresponding hypothesis G m has a generalization error larger than a prespecifie threshold ǫ.Thus, after m samples are generated together with their labels, and the hypothesis G m is generated using the algorithm, it can be asserted with confidence 1 − r(m, ǫ) that G m will correctly classify the next randomly generated test sample with probability of at least 1 − ǫ.
Definition 1 An algorithm {A m } is said to be probably approximately correct (PAC) if r(m, ǫ) → 0 as m → ∞, for every fixed ǫ > 0. The concept class C is said to be PAC learnable under the family of probability measures P is there exists a PAC algorithm.
The objective of statistical learning theory is to determine conditions under which there exists a PAC algorithm for a given concept class.

OBCS as a Problem in PAC Learning
In order to embed the problem of one-bit compressed sensing into the framework of concept learning, we proceed as follows.We begin with the case where the measurements are of the form sign( a i , x ) where the a i are chosen at random according to some arbitrary probability distribution, which need not be the Gaussian.Observe now that the closed half-space determines the vector x uniquely to within a positive scalar multiple.Conversely, the vector x uniquely determines the corresponding half-space H(x), which remains invariant if x is replaced by a positive multiple of x.Thus the OBCS problem can be posed as that of determining the half-space H(x) given the measurements sign( a i , x ), i = 1, . . ., m where the a i are selected at random in accordance with some probability measure P .Moreover, the one-bit measurement sign( a i , x ) equals 2I H(x) (a i ) − 1, where I H(x) (•) denotes the indicator function of the half-space H(x).Therefore, to within a simple affine transformation, the OBCS problem becomes that of determining an unknown half-space H(x) from labelled samples (a i , I H(x) (a i )), i = 1, . . ., m, where the samples a i are generated at random according to some prespecified probability measure.This is a PAC learning problem where the various entities are as follows: • The underlying space Xwould be R n .
• The σ-algebra S would be the Borel σ-algebra on R n .
• The concept class C would be the collection of all half-spaces {H n k (x)} where as x varies over Σ k , the set of k-sparse vectors in R n , • The family of probability measures P can either be a singleton {P } where P is specified a priori, or P * , the family of all probability measures on R n , or anything in-between.
When measurements are of the type sign( a i , x ), it is inherently impossible to determine the unknown vector x, except to within a positive scalar multiple.This is addressed by changing the measurements to be of the form sign( a i , x +b i ), as suggested in [21].Some slight modifications are required to address this modified formulation.In this case the various entities are as follows: • The underlying space X would equal R n+1 .
• The σ-algebra S would be the Borel σ-algebra on R n+1 .
• The concept class C would be the collection of all half-spaces as x varies over Σ k .
• The family of probability measures P can either be a singleton {P } where P is specified a priori, or P * , the family of all probability measures on R n , or anything in-between.
For a given x ∈ Σ k , if a ∈ R n belongs to the half-space H n k (x), then the vector (a, 0) ∈ R n+1 belongs to H n+1 k (x).However, the half-space H n+1 k (x) can also contain vectors of the form (a, b) with b = 0.

Interpretation of the Generalization Error
In the traditional PAC learning problem formulation, the quantity of interest is the generalization error defined in (5), namely Given two sets A, B ∈ S, let us define their symmetric difference by where A c denotes the complement of A. Thus A∆B consists of the points that belong to precisely one but not the other set.Now let us define the quantity d P (A, B) := P (A∆B).
Then d P is a pseudometric on S, in that d P satisfies all the axioms of a metric, except that d P (A∆B) = 0 does not necessarily imply that A = B.
In particular, if A = B but A∆B has zero measure under P , then A and B are indistinguishable under P .Let us further define a binary relation ∼ P on S by Then it is easy to verify that ∼ P is an equivalence relation on S. Also, it is easy to see that an alternate expression for the generalization error is Therefore if the hypothesis G m differs from the target concept T by a set of measure zero (under the chosen probability measure P ), then the generalization error would be zero, even though G m may not equal T .To put it another way, once the probability measure P is specified, PAC learning tries to identify an element in the equivalence class of T in the quotient space C/ ∼ P , and not T itself.
and their symmetric difference.
The above discussion explains the limitations of one-bit compressed sensing as described in [20].In their case, they choose two vectors in R 2 , namely x 1 = [ 1 0 ] and x 1 = [ 1 0.5 ]. 4 Their choice for P is the Bernoulli distribution on R 2 , which is purely atomic and assigns a weight of 0.25 to the four points (1, 1), (1, −1), (−1, 1), (−1, −1).Now let us plot the half-planes H x 1 , H x 2 and their symmetric difference, which is the shaded region shown in Figure 1.Because none of the four points (1, 1), (1, −1), (−1, 1), (−1, −1) (shown as red circles) belongs to the symmetric difference, x 1 and x 2 are indistinguishable in OBCS under this probability measure.In [18], the authors conclude that OBCS cannot always recover an unknown vector x, depending on what P is.Indeed, x 1 and x 2 would be indistingishable under any probability measure that assigns a value of zero to H x 1 ∆H x 2 .Therefore, one must be careful to draw the right conclusion: When subsequent theorems in this paper show that OBCS is possible under all probability measures on R n , including all purely atomic probability measures, what this means is that if x ∈ Σ k , then OBCS will return a vector x such that P (H x ∆H x) = 0 whatever be P , and not that x = x.Now we examine the relationship of the generalization error J(x, x) to a couple of other quantities that are widely used in OBCS as error measures.
First, define where x is the true vector, x is its estimate.Note that |sign( a, x ) − sign( a, x )| equals 0 or 2.
Next, we examine the relationship of ρ(x, x) to x − x 2 .Without loss of generality it can be assumed that both x and x have unit Euclidean norm.This can be achieved using some results from [27].Define Then we have the following results.
Lemma 1 Let P be any radially invariant probability measure on R n , and suppose a ∈ R n is drawn at random according to P .Suppose x 2 = x 2 = 1, and let J(x, x) denote the generalization error defined in (5).Then Proof: We make use of a couple of results from [27].First, [27, Lemma 3.2] states that Therefore the above is equivalent to Second, [27,Lemma 3.4] states that or equivalently Now note that, when both x and x are unit vectors, we have

PAC Learning via the Vapnik-Chervonenkis (VC) Dimension
One of the most useful concepts in PAC learning theory is defined next.Therefore a concept class C has VC-dimension d if two statements hold: (i) There exists a set of cardinality d that is shattered by C, and no set of cardinality larger than d is shattered by C. If there exist sets of arbitarily large cardinality that are shattered by C, then its VC-dimension is defined to be infinite.
If a concept class has finite VC-dimension, then it is PAC learnable under every probability distribution on X.An algorithm is said to be consistent if it always produces a hypothesis that classifies all the training samples correctly.In other words, an algorithm is consistent if the hypothesis G m produced by applying the algorithm to the sequence {(c i , I T (c i ))} i≥1 has the property that I T (c i ) = I Gm (c i ) for all i and m.Note that a consistent algorithm always exists if the axiom of choice is assumed.However, in some situations, it is NP-hard or NP-complete to find a consistent algorithm.
With these notions in place, we have the following very fundamental result.
Theorem 1 ( [28]; see also [26,Theorem 7.6]) Suppose a concept class C has finite VC-dimension.Then C is PAC learnable for every probability measure on X. Suppose that d is an upper bound for VC-dim(C), and let {A m } be any consistent algorithm.Then, no matter what the underlying probability measure is, the learning rate is bounded by where e denotes the base of the natural logarithm.Therefore r(m, ǫ) samples are chosen.
Note that the number of samples required to achieve an accuracy of ǫ with confidence 1 − δ is O((1/ǫ) ln(1/δ)).However, the main challenge in applying this result is in finding a consistent algorithm.
Theorem 1 shows that the finiteness of the VC-dimension of a concept class is a sufficient condition for PAC learnability.The next result shows that the condition is also necessary.
Theorem 2 ( [28,29]; see also [26,Theorem 7.7]) Suppose a concept class C has VC-dimension d ≥ 2. Then there exist probability measures on X such that any algorithm requires at least samples, in order to learn to accuracy ǫ and confidence δ.

Estimates of the VC-Dimension
The main enabler of the PAC approach to OBCS is an explicit estimate of the VC-dimension of half-spaces generated by k-sparse vectors.
Theorem 3 Let H n k denote the set of half-spaces H n k (x) in R n generated by k-sparse vectors, as defined in (7).Then Proof: We begin with the upper bound in (14).It is shown that if a set U = {u 1 , . . ., u l } ⊆ R n is shattered by the collection of half-spaces H n k , then l ≤ ⌊2k lg(en)⌋.The proof combines a few ideas that are standard in PAC learning theory, which are stated next.
The first result needed is [30, Theorem 7.2]. 5It states the following: Suppose that F is a collection of functions mapping a given set Z into R, such that F is a k-dimensional real vector space over the field R. Define the associated collection of subsets of X by Pos(f ) := {z ∈ Z : f (z) ≥ 0}, Pos(F) := {Pos(f ), f ∈ F}.
Then VC-dim(Pos(F)) = k.To apply the above theorem to this particular instance, we fix the integer k as well as a support set S ⊆ {1, . . ., n} such that |S| = k, and choose F to be the set of functions {f (z) = z, x supp(s) ⊆ S}.This family of functions is clearly a k-dimensional linear space because the adjustable parameter here is the k-sparse vector x with support in the fixed set S. Therefore it follows that, if we define The next result needed is Sauer's lemma [31], which states the following: Suppose C is a collection of subsets of X with finite VC-dimension d, and that where e denotes the base of the natural logarithm.Strictly speaking, Sauer's lemma is the first inequality, which states that the number of subsets of U that can be generated by taking intersections with sets in the collection C is bounded by the summation shown.The second bound is derived in [28].By applying Sauer's lemma to the problem at hand, it can be seen that, for a fixed support set S, the number of subsets of U that can be generated by intersecting with the collection of half-spaces H S is bounded by (el/k) k , because H S has VC-dimension k.Note that a similar bound is derived in [16], but without any reference to Sauer's lemma.Also, the result in [16] is specifically for collections of half-spaces, whereas Sauer's lemma is for completely general collections of sets.Now observe that H n k is just the union of the collections H S as S ranges over all subsets of {1, . . ., n} with |S| = k.The number of such sets S is the combinatorial parameter n choose k, which is bounded by n k . 6Moreover, for each fixed support set S, the collection of subsets H S ∩ U has cardinality no larger than (en/k) k , as shown above.Therefore The final step in the proof comes from [26, Lemma 4.6], which states the following (see specifically item 2 of this lemma): Suppose α, β > 0, αβ > 4 and l ≥ 1.Then l ≤ α lg(βl) =⇒ l < 2α lg(αβ).
In the present instance, the collection of sets H n k ∩ U has cardinality no larger (nel/k) k , whereas U has 2 l subsets in all.Therefore, if U is shattered by the collection H n k , then we must have or, after taking binary logarithms, which is of the form ( 15) with α = k, β = ne/k.Substituting these values into (15) leads to the conclusion that l ≤ 2k lg(ne), provided αβ = ne ≥ 4, which holds if n ≥ 2. Because l is an integer, we can replace the right side by its integer part, which leads to the upper bound in (14).Now we turn our attention to the lower bound.First we consider the simple case where k = 1.Given n, define l = ⌊lg n⌋ + 1, so that n ≥ 2 l−1 .The first step is to show that the set of half-spaces H n 1 generated by "onesparse vectors" has VC-dimension l.Let s = l − 1 = ⌊lg n⌋, and enumerate the 2 s bipolar row vectors in {−1, 1} s in some order, call them v 1 , . . .v 2 s .Now define the n × l matrix In other words, the matrix M has a first column of ones, and then the 2 s bipolar vectors in {−1, 1} s in some order, padded by a block of zeros in case n > 2 l−1 .As we shall see below, the "padding" is not used.Now define U = {u 0 , u 1 , . . ., u s } denote the s + 1 = l columns of the matrix M .Note that for notational convenience we start numbering the columns with 0 rather than 1.It is claimed that the collection of half-spaces H n 1 shatters this set U , thus showing that VC-dim(H n 1 ) ≥ l.To show that the set U is shattered, let B ⊆ U be an arbitrary subset.Thus B consists of some columns of the matrix M .We examine two cases separately.First, suppose u 0 ∈ B. Then we associate a unique integer r between 1 and 2 s as follows.Define a bipolar vector i B ∈ {−1, 1} s by i j = 1 if u j ∈ B, and i j = −1 if u j ∈ B. This bipolar vector i B must be one of the vectors v 1 , . . ., v 2 s .Let r be the unique integer such that i B = v r .Define the vector x ∈ R n such that x r = 1, and the remaining elements of x are all zero, and note that x ∈ Σ 1 .Then u 0 , x = 1, while u j , x = 1 if u j ∈ B and u j , x = −1 if u j ∈ B. Therefore the associated half-space H(x) includes precisely the elements of the specified set B. Next, suppose u 0 ∈ B; in this case we basically flip the signs.Thus the bipolar vector i B ∈ {−1, 1} s is chosen such that i j = −1 if u j ∈ B, and i j = 1 if u j ∈ B. If this bipolar vector corresponds to row r in the ordering of {−1, 1} s , we choose x ∈ Σ 1 to have a −1 in row r and zeros elsewhere.This argument shows that the set H n 1 generated by all one-sparse vectors x has VC-dimension of at least ⌊lg n⌋ + 1, which is consistent with the left side of ( 14) when k = 1.
To extend the above argument to general values of k, suppose n and k are specified, and define l = ⌊lg(n/k)⌋ + 1 and s = l − 1 = ⌊lg(n/k)⌋.
Then n/k ≥ 2 s , or equivalently, n ≥ k2 s .Define matrices M 1 , . . ., M k ∈ {−1, 1} 2 s ×l in analogy with (16).Then define a matrix M ∈ {0, 1} n×kl as a block-diagonal matrix containing M 1 , . . ., M k on the diagonal blocks, padded by an appropriate number of zero rows so that the number of rows equals n.In other words, M has the form Define U to be the set of columns of the matrix M , and note that |U | = kl = k(⌊lg(n/k)⌋ + 1).It is now shown that the set U is shattered by the collection H n k of half-spaces generated by k-sparse vectors.Partition U as U 1 ∪ U 2 ∪ U k , where each U i consists of l column vectors.Then any specified subset B ⊆ U can be expressed as a union B 1 ∪ • • • ∪ B k where B i ⊆ U i for each i.Now it is possible to mimic the arguments of the previous paragraph to show that the set U can be shattered by the collection of half-spaces H n k .For each subset B i , identify an integer r i between 1 and 2 s such that the bipolar vector i B i is the r i -th in the enumeration of {−1, 1} s .For each index i between 1 and k, let x i ∈ R 2 s contain a 1 in row r i and zeros elsewhere.Define x ∈ R n by stacking x 1 through x k , followed by n − k2 s zeros.This shows that it is possible to shatter a set of cardinality k(1 + ⌊lg(n/k)⌋), which is the right inequality in (14).
Theorem 3 is applicable to the case where measurements are of the form sign( a i , x ).Such measurements can at best lead to the recovery of the direction of a k-sparse vector x, but not its magnitude.In situations where it is desired to recover a sparse vector in its entirety, the measurements are changed to where x varies over Σ k ⊆ R n and θ ∈ R. The concept class in this case is given by By inspecting the equation ( 17), we deduce that where x ∈ Σ k and θ ∈ R so that the vector [x θ] ⊤ is k + 1-sparse.We are now ready to state the first result of this section.
Theorem 4 Let H n+1 k denote the set of half-spaces H n k (x) in R n as defined in (17).Then Proof: From ( 19) we conclude that then the desired result follows Theorem 3.

OBCS with Noisy Measurements
In this section we study the one-bit compressed sensing problem when the information available to the learner is a noisy version of the true output sign( a i , x ) or sign( a i , x + b i ), where x is an unknown k-sparse vector.
Specifically, the label y i equals this sign with probability 1 − α, and gets flipped with probability α, where α ∈ (0, 0.5).In the PAC learning literature, the problem of concept learning with mislabelling has been studied, and there are some papers on the topic.Rather than cite these, we refer the reader to a recent paper [32] and the references therein.In this paper, which represents the state of the art, it is assumed that the error rate α is known.By adopting a different approach, we are able to show that minimizing empirical risk leads to provably near-optimal estimates, even without knowing α.Therefore the results given here are of independent interest.Note that in [32], it is not assumed that the two error probabilities (namely a one becoming a zero and vice versa) are equal.This assumption is made here solely in the interests of convenience, and can be dispensed with at the expense of more elaborate notation.
To make the problem formulation precise, we use the notation in Section 3.1, whereby X is a set, S is a σ-algebra of subsets of X, and P is a probability measure on X.To incorporate the randomness, we enlarge X by defining X N = X × {0, 1} as the sample space; define S N to be the σ-algebra of subsets in X N generated by cylinder sets of the form S × {0} and S × {1} for all S ∈ S; and define a probability measure P N on X N by defining Let (c, L) denote a typical element in the sample space X N .Then Here the event L = 0 corresponds to the label not being flipped, while the event L = 1 corresponds to the label being flipped.It is clear that P N is a product measure, so that the flipping of labels is independent of the generation of training samples.Learning takes place as follows: Independent samples {c i , ω i } i≥1 are generated in accordance with the above probability measure P N .Let T be a fixed but unknown target concept.Then for each i, a label y i is generated as This is equivalent to saying that y i = I T (c i ) with probability 1 − α and y i = 1 − I T (c i ) with probability α.As before, an algorithm is an indexed family of maps A m : (X × {0, 1}) m → C for each m ≥ 1.The algorithm A m is applied to the set of labelled samples {(c i , y i )} m i=1 , giving rise to a hypothesis G m .
To assess how well a hypothesis F (however it is derived) approximates the unknown target concept T , we generate a random test input x ∈ X according to P , and then predict that the oracle output on x will be I F (x).The error criterion therefore equals where f (I T (x)) is the noisy label and I F (x) is the indicator function of F .The premise in the above definition is that, while the oracle output is noisy, our prediction is not noisy.
The main difference from the case of noise-free labelling is that, even if F were to equal T , the error J N (T, F ) would not equal zero, due to the noisy labelling.Note that, for a given x ∈ X, the quantity |f In the case of noise-free labelling, the minimum achievable value of the error measure J (as defined in (5)) is 0, which is achieved by any hypothesis F such that P (T ∆F ) = 0.In contrast, the minimum achievable value of the modified error measure J N is α, which is again achieved by any hypothesis F such that P (T ∆F ) = 0. Therefore, to measure the performance of an algorithm with noise-corrupted labels, one should compare the error J N with the minimum achievable value of α.This motivates the next definition.Let G m denote the hypothesis generated by the algorithm, and set Note that if α = 0 so that the measurements are noise-free, then J N reduces to J as defined in ( 5) and r N (m, ǫ) reduces to r(m, ǫ) as defined in (6).
Definition 3 An algorithm {A m } is said to be probably approximately correct (PAC) with noise-corrupted measurements if r N (m, ǫ) → 0 as m → ∞, for every fixed ǫ > 0. The concept class C is said to be PAC learnable with noise-corrupted measurements if there exists a PAC algorithm.
In the case of noise-free measurements, the results on learnability were stated in terms of a consistent algorithm, which always exists if one were to assume the axiom of choice.In contrast, in the case where the labels are noisy, it might not be possible to construct a hypothesis that is consistent.Therefore the notion of consistency is replaced by the notion of minimizing empirical risk.Suppose we are given a labelled sample sequence {(c i , y i ) ∈ X × {0, 1}} i≥1 .Suppose F ∈ C is a hypothesis.Then the empirical risk of the hypothesis with respect to this labelled sequence, after m samples, is defined as Definition 4 An algorithm {A m } m≥1 is said to minimize empirical risk, or to be a MER algorithm, if for all sample sequences {(c i , y i ) ∈ X × {0, 1}} i≥1 , and all integers m, it is the case that where is the output of the algorithm after m samples, given the sequence {(c i , y i ) ∈ X × {0, 1}} i≥1 .
Note that if the labels are noise-free, then y i = I T (c i ), and is the empirical estimate of the distance D P (T, F ).In this case, a MER algorithm becomes a consistent algorithm.Now we state the main result regarding PAC learning with noisy labels.
Theorem 5 Suppose P ∈ P * is an arbitrary probability measure on X, and suppose that VC-dim(C) ≤ d.Let T ∈ C be any fixed but unknown target concept, and let G m be the output of a MER algorithm after m labelled samples.Then Proof: Let, as before, C 0 ∪ C c 1 denote the collection of sets that generate a label y i of one.Similarly, C N is the collection of sets that generate a label I F of one.Given concept classes A N , B N ⊆ S N , define Then it follows from [26,Thorem 4.5] applied with k = 2, that VC-dim(A N ∆B N ) ≤ 10 max{VC-dim(A N ), VC-dim(B N )}.
Then Ĵ as defined in (26) is the empirical distance between two sets, one belonging to C 0 ∪ C c Proof: We begin by establishing an elementary result.Suppose X 1 , X 2 are random variables, not necessarily independent, and that ǫ 1 , ǫ 2 are thresholds.Then To see this, note that Taking the contrapositive shows that Returning to the theorem, suppose T is the target concept and G m is the hypothesis produced by a MER algorthm.Now, because G m is the output of a MER algorthm, we have that where Ĵ(T, T ) denotes the empirical distance Note the right side is a random variable with an expected value of α (the probability of the label I T (c i ) being flipped).Therefore, by the additive form of the Chernoff bound (see for example [26, p. 24]), it follows that Pr{ Ĵ(T, T ) > α + ǫ 1 } ≤ exp(−2mǫ 2 1 ), ∀ǫ 1 .
Combining (32) and (33) shows that Next, due to the uniform convergence property of empirical means to their true values, it follows that Now, given a threshold ǫ, we can choose any ǫ 1 , ǫ 2 such that ǫ = ǫ 1 + ǫ 2 , and apply the above bounds.We choose ǫ 1 = 0.2ǫ, ǫ 2 = 0.8ǫ, so that the two exponents match.This leads to This completes the proof.

Algorithm for One-Bit Compressed Sensing
By combining Theorem 2 with Theorem 3 on the lower bound on the VCdimension of half-spaces generated by k-sparse n-vectors, we can prove the following result: Theorem 7 There exists a probabiity measure P on R n such that any algorithm that leads to a uniform error estimate of the form J(x, x) ≤ ǫ for all k-sparse vectors x ∈ R n requires at least Ω(k lg(n/k)) samples.
While this theorem might be only of theoretical interest, it does show the intrinsic difficulty of OBCS, which no algorithm can cross.Now let us study how to solve the OBCS problem.The results in Theorems 1 and 3 can be combined to produce the following "conceptual" algorithm.
Theorem 8 Let integers n, k with k ≪ n be specified, and suppose x ∈ R n is k-sparse.Let P be an arbitrary probability distribution on R n , and choose {a i } i≥1 generated independently at random according to P .Let y i = sign( a i , x ) for all i.With these conventions, any algorithm that generates a k-sparse estimate x such that y i = sign( a i , x ) for all i is probably approximately correct.In particular, given an accuracy ǫ and a confidence δ, let δ = ⌊2k lg(ne)⌋, and choose at least m samples where Then it can be guaranteed with confidence of at least 1 − δ that J(x, x) ≤ ǫ, and ρ(x, x) ≤ 2ǫ.Moreover, if P is a radially invariant probability distribution on R n , Then it can be guaranteed with confidence of at least 1 where the constant λ is defined in (10).
In the case where the labels y i are noisy versions of I T (c i ), the above theorem can be modified to say that any MER algorithm is PAC, using Theorem 5.
Let us return to the conceptual algorithm outlined above.Suppose we are given randomly generated labelled samples {(a i , y i )} m i=1 , where a i ∈ R n and y i ∈ {−1, 1}, where y i is a possibly noise-corrupted measurement of sign( a i , x ).Define M = {1, . . ., m}, M + = {i ∈ M : Ideally, in the case where measurements are noise-free, we would like to find a k-sparse vector x such that This would lead to a "consistent" hypothesis.If there exists a vector x ∈ R n that satisfies the above inequalities, the data is said to be linearly separable.However, finding a k-sparse separating vector x may not be easy.In the case of noisy measurements, constructing a MER algorithm would require us to find an x (k-sparse or otherwise) such that the number of violations in the above inequality is minimized.However, this problem is NP-hard, as shown in [17].Therefore we need to look for alternate approaches that do not strictly conform to the theory.
It is proposed here to use the ℓ 1 -norm support vector machine (SVM) formulation as introduced in [33], which is a modification of the the widely used ℓ 2 -norm (SVM) formalism introduced in [34].This formulation of the ℓ 1 -norm SVM has two important advantages over the standard ℓ 2 -norm SVM formlation in [34].First, the weight vector generated by the ℓ 1 -norm SVM is sparse, unlike with the ℓ 2 -norm SVM.Second, the particular formulation suggested in [33] works even when the data is not linearly separable.So we begin by describing the ℓ 1 -norm SVM, before proceeding to our proposed algorithm.
Then the modified ℓ 1 -norm SVM formulation in [33] can be stated as follows: There are a few points to note here.First, the constraints have "slack" variables α i , β i so that problem formulation makes sense even when the data is not linearly separable.This is in contrast to the standard SVM formulation min Note that the formulation in (39) is equivalent to that in [18], because they use the normalization m i=1 | a i , z | = m, while in (39) the normalization is that the minimum "gap" in satisfying the constraints equals 1.If the measurements are of the type y i = sign( a i , x + b i ), Then the problem formulation is modified to This is different from the formulation suggested in [21], as described here in (4), in exactly the same manner that the formulation in [18] differs from (38).
The major advantage of (38) over the formulation in [18], or of (40) over the formulation in [21] is this: In case the labels y i are corrupted by measurement noise, both the formulations of [18] and [21] would be infeasible, whereas the formulations in (38) and (40) continue to be meaningful.The second point follows upon the first.The objective function is the sum of the slack variables and the ℓ 1 -norm of z.Including the ℓ 1 -norm of z in the objective function forces the solution z to be sparse.The constant λ ∈ (0, 1) provides a tradeoff between minimizing z 1 and violating the constraints.By choosing the weight λ ∈ (0, 1) is close to, but not equal to, zero, one can ensure that the optimization attempts to violate the constraints by as little as possible.The above formulation does not lead to an x that is k-sparse.The next step is to truncate x by retaining only the k largest components by magnitude and discarding the next.

A Numerical Example
We chose n = 1000, k = 20, and generated 30 vectors x ∈ Σ k at random.Then we generated m measurements of the form sign( a i , x +b i ) where a i , b i are normally distributed.Then we applied the algorithm proposed in the previous section, that is, carrying out the minimization in (40) and then truncating the resulting x to the k dominant components by magnitude.Figure 2 shows a comparison of the error for our method versus that for the method proposed in [21], for one of the randomly generated k-sparse vectors.It is evident that our method ourperforms the latter.This can perhaps be attributed to replacing the conventional SVM as in [21] with the modification in (40).Figure 3 shows the number of correctly recovered components as a function of m, for one of the randomly generated k-sparse vectors.Figure 4 shows the box plot of the mean square error as a function of m over all 30 random repetitions of this experiment.The small horizontal lines show the maximum and minimum error, while the boxes display the 25th and 75th percentiles.Note that, in OBCS, it is meaningful to have m > n.

Discussion
In this paper, the problem of one-bit compressed sensing (OBCS) has been formulated as a problem in probably approximately correct (PAC) learning theory.In particular, it has been shown that the VC-dimension of the set of m/n 0.1 0.3 0.5 0.7 0.9 1 2 3 5 # of correctly identified nonzeros Therefore, in principle at least, the OBCS problem can be solved using only O(k lg(n/k)) samples.This is possible in principle even when the measurements are corrupted by noise.However, in general, it is NP-hard to find a consistent algorithm when measurements are free from noise, and to find an algorithm that minimizes empirical risk when measurements are noisy.
We proposed a modification of the ℓ 1 -norm support vector machine as a feasible alternative, and illustrated that our approach outperforms earlier algorithms in the literature.One of the main advantages of formulating OBCS as a problem in PAC learning is that extending these results to the case where the samples {a i } (or {(a i , b i )} as the case may be) are not i.i.d.essentially "comes for free."It is now known that, if a concept class has finite VC-dimension, then empirical means converge to their true values not only for i.i.d.samples {a i } (or {(a i , b i )} as the case may be), but also when this sequence forms an ergodic process; see [35], which builds on an earlier result in [36] for β-mixing processes.However, in order to be useful in OBCS, it is not enough to know this.One must also have expicit estimates of the rate at which empirical means converge to their true values, or what is called the learning rate here.Such estimates are provided in [37].As it is fairly straight-forward to adapt the various theorems given here to the case of β-mixing processes using the above-mentioned results, the details are omitted.

Definition 2 A
set S ⊆ X of finite cardinality is said to be shattered by the concept class C if, for every subset B ⊆ S, there exists a concept A ∈ C such that S ∩ A = B.The Vapnik-Chervonenkis-or VC-dimension of the concept class C is the largest integer d such that there exists a set S of cardinality d that is shattered by C.
S = {c 1 , . . ., c s } ⊆ X.If S N is of the form S × {0}, then it can only be shattered by C 0 , which implies that S itself is shattered by C. Therefore |S| ≤ d.By entirely analogous reasoning, if S N is of the form S × {1}, then it can only be shattered by C c 1 , which implies that S itself is shattered by C c .Now we make use of the easily proved fact that C and C c shatter exactly the same sets.Therefore once again |S| ≤ d.In either case |S N | ≤ d.This leads to the first conclusion.The proof that VC-dim(C N ) = VC-dim(C) is similar and is omitted.Theorem 6 Suppose VC-dim(C) ≤ d, and let T, F ∈ C be arbitrary.Also let P ∈ P * be an arbitrary probability measure.Then sup T,F ∈C Pr{|J(T, F ) − Ĵm (T, F )| > ǫ} ≤ 4 0.2em d 10d exp(−mǫ 2 /8).(30)

1
and the other belonging to C N .Now VC-dim(C) ≤ d, and in turn this implies that VC-dim(C 0 ∪ C c 1 ) ≤ d by Lemma 2. Therefore VC-dim[(C 0 ∪ C c 1 )∆C N ] ≤ 10d.Now observe that J(T, F ) is the expected value of |y − I F |, while ĴM is an empirical mean based on m samples.Therefore (30) follows from [26, Theorem 7.4].Now we give the proof of Theorem 5.

Figure 2 :
Figure2: Mean square error between x and x as a function of m for our algorithm and that of[21]

Figure 3 :Figure 4 :
Figure 3: Number of correctly recovered components (out of 20) as a function of m