Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach

Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.


Introduction
The widespread adoption of algorithmic decision-making in high-stakes domains has determined an increased attention to the underlying algorithms and their impact on people, with attention to sensitive (or "protected") groups. Typically, sensitive groups are subpopulations determined by salient social and demographic factors, such as race or sex. The unfair treatment of such groups is not only unethical, but also ruled out by anti-discrimination laws, and is thus studied by a growing community of algorithmic fairness researchers. Important works in this area have addressed the unfair treatment of subpopulations that may arise in the judicial system Berk et al., 2021;, healthcare (Gervasi et al., 2022;Obermeyer et al., 2019;Ricci Lara et al., 2022), search engines (Ekstrand et al., 2022;Fabris et al., 2020;Geyik et al., 2019), insurance (Angwin et al., 2017;Donahue and Barocas, 2021;Fabris et al., 2021), and computer vision (Buolamwini and Gebru, 2018;Goyal et al., 2022;Raji and Buolamwini, 2019), just to name a few domains that may be affected. One common trait of these research works is their attention to a careful definition (and subsequent measurement) of what it means for a model to be fair to the subgroups involved (group fairness), which is typically viewed in terms of differences, across the salient subpopulations, in quantities of interest such as accuracy, recall, or acceptance rate. According to popular definitions of fairness, sizeable such differences (e.g., between women and men) correspond to low fairness on the part of the algorithm (Barocas et al., 2019;Dwork et al., 2012;Pedreshi et al., 2008).
Unfortunately, sensitive demographic data, such as the race or sex of subjects, are often not available, since practitioners find several barriers to obtaining these data, both during model development and after deployment. Among these barriers, legislation plays a major role, prohibiting the collection of sensitive attributes in some domains (Bogen et al., 2020). Even in the absence of explicit prohibition, privacy-by-design standards and a data minimization ethos often push companies in the direction of avoiding the collection of sensitive data from their customers. Similarly, the prospect of negative media coverage is a clear concern, so companies often err on the side of caution and inaction (Andrus et al., 2021). The unavailability of these data thus makes the measurement of model fairness nontrivial, even for the company that is developing and/or deploying the model. For these reasons, in a recent survey of industry professionals, most of the respondents stated that the availability of tools that support fairness auditing in the absence of individual-level demographics would be very useful (Holstein et al., 2019). In other words, the problem of measuring group fairness when the values of the sensitive attributes are unknown (fairness under unawareness) is pressing and requires ad hoc solutions.
In the literature on algorithmic fairness, much work has been done to propose techniques directly aimed at improving the fairness of a model (Donini et al., 2018;Hardt et al., 2016;Hashimoto et al., 2018;He et al., 2020;S. Sankar et al., 2021;Zafar et al., 2017). However, relatively little attention has been paid to the problem of reliably measuring fairness. This represents an important, but rather overlooked, preliminary step to enforce fairness and make algorithms more equitable across groups. More recent works have studied non-ideal conditions, such as missing data (Goel et al., 2021), noisy or missing group labels (Awasthi et al., 2020;Chen et al., 2019), and non-iid samples (Rezaei et al., 2021;Singh et al., 2021), and showed that naïve fairness-enhancing algorithms may actually make a model less fair under noisy demographic information (Ghosh et al., 2021a;Mehrotra and Celis, 2021).
In this work, we propose a novel solution to the problem of measuring classifier fairness under unawareness by using techniques from quantification (Esuli et al., 2023;González et al., 2017), a supervised learning task concerned with estimating, rather than the class labels of individual data points, the class prevalence values for samples of such data points, i.e., group-level quantities, such as the percentage of women in a given sample. Quantifi-cation methods address two pressing facets of the fairness under unawareness problem: (1) their estimates are robust to distribution shift (i.e., to the fact that the distribution of the labels in the unlabeled data may significantly differ from the analogous distribution in the training data), which is often inevitable since populations evolve, and demographic data are unlikely to be representative of every condition encountered at deployment time; (2) they allow the estimation of group-level quantities but do not allow the inference of sensitive attributes at the individual level, which is beneficial since the latter might lead to the inappropriate and nonconsensual utilization of this sensitive information, reducing individuals' agency over data (Andrus and Villeneuve, 2022). Quantification methods achieve these goals by directly targeting group-level prevalence estimates. They do so through a variety of approaches, including, e.g., dedicated loss functions, task-specific adjustments, and ad hoc model selection procedures.
Overall, we make the following contributions: • Quantifying fairness under unawareness. We show that measuring fairness under unawareness can be cast as a quantification problem and solved with approaches of proven consistency established in the quantification literature (Section 4). We propose and demonstrate several high-accuracy fairness estimators for both vanilla and fairness-aware classifiers.
• Experimental protocols for five major challenges. Drawing from the algorithmic fairness literature, we identify five important challenges that arise in estimating fairness under unawareness. These challenges are encountered in real-world applications, and include the nonstationarity of the processes generating the data and the variable cardinality of the available samples. For each such challenge, we define and formalise a precise experimental protocol, through which we compare the performance of quantifiers (i.e., group-level prevalence estimators) generated by six different quantification methods (Sections 5.3-5.7).
• Decoupling group-level and individual-level inferences. We consider the problem of potential model misuse to maliciously infer demographic characteristics at an individual level, which represents a concern for proxy methods, i.e., methods that measure model fairness based on proxy attributes. Proxy methods are estimators of sensitive attributes which exploit the correlation between available attributes (e.g., ZIP code) and the sensitive attributes (e.g., race) in order to infer the values of the latter. Through a set of experiments, we demonstrate two methods that yield precise estimates of demographic disparity but poor classification performance, thus decoupling the (desirable) objective of group-level prevalence estimation from the (undesirable) objective of individual-level class label prediction (Section 5.9).
It is worth noting from the outset some intrinsic limitations of proxy methods and measures of group fairness. In essence, proxy methods exploit the co-occurrence of membership in a group and display of a given trait, potentially learning, encoding and reinforcing stereotypical associations (Lipton et al., 2018). More in general, even when labels for sensitive attributes are available, these are not all equivalent. Self-reported labels are preferable to avoid external assignment (i.e., inference of sensitive attributes), which can be harmful in itself (Keyes, 2018). In broader terms, approaches that define sensitive attributes as rigid and fixed categories are limited in that they impose a taxonomy onto people, erasing the needs and experiences of those who do not fit the envisioned prevalent categories (Namaste, 2000). Although we acknowledge these limitations, we hope that our work will help highlight, investigate, and mitigate unfavourable outcomes for disadvantaged groups caused by different automated decision-making systems. The outline of this work is as follows. Section 2 summarizes the notation and background for this article. Section 3 introduces related works. After giving a primer on quantification, with emphasis on the approaches we consider in this work, Section 4 shows how these approaches can be leveraged to measure fairness under unawareness of sensitive attributes. Section 5 presents our experiments, in which we tackle, one by one, each of the five major challenges mentioned above. We then summarize and discuss these results (Section 6) and present concluding remarks (Section 7), describing key limitations and directions for future work.

Notation
In this paper, we use the following notation, summarized in Table 1. By x we indicate a data point drawn from a domain X , represented via a set X of nonsensitive attributes (i.e., features). We use S to denote a sensitive attribute that takes values in S = {0, 1}, and by s ∈ S a value that S may take. 1 By Y we indicate a class (representing the target of a prediction task) taking values in a binary domain Y = {⊖, ⊕}, and by y ∈ Y a value that Y can take. The symbol σ denotes a sample, i.e., a non-empty set of data points drawn from X . By p σ (s) we indicate the true prevalence of an attribute value s in the sample σ, while byp q σ (s) we indicate the estimate of this prevalence obtained by means of a quantifier q, which we define as a function q : 2 X → [0, 1]. Since 0 ≤ p σ (s) ≤ 1 and 0 ≤p q σ (s) ≤ 1 for all s ∈ S, and since s∈S p σ (s) = s∈Sp q σ (s) = 1, the p σ (s)'s and thep q σ (s)'s form two probability distributions in S. We also introduce the random variableŶ , which denotes a predicted label. By Pr(V = v) we indicate, as usual, the probability that a random variable V takes value v, which we shorten as Pr(v) when V is clear from the context, since X, S, Y can also be seen as random variables. By h : X → Y we indicate a binary classifier that assigns classes in Y to data points in X ; by k : X → S we instead indicate a binary classifier that assigns sensitive attribute values in S to data points (e.g., that predicts the sensitive attribute value of a certain data item x). It is worth re-emphasizing that both h and k only use nonsensitive attributes X as input variables, For ease of use, we will interchangeably write h(x) = y or h y (x) = 1, and k(x) = s or k s (x) = 1.

Background
Several criteria for group fairness have been proposed in the machine learning literature, typically requiring equalization of some conditional or marginal property of the distribution Table 1: Main notational conventions used in this work.
x ∈ X a data point, i.e., a vector of non-sensitive attribute values s ∈ S a value for the sensitive attribute , with S = {0, 1} y ∈ Y a class from the target domain Y = {⊖, ⊕} X, S, Y,Ŷ random variables for data points, non-sensitive attributes, classes, and class predictions h(x) a classifier h : X → Y issuing predictions in Y for data points in X k(x) a classifier k : X → S issuing predictions in S for data points in X σ a sample, i.e., a non-empty set of data points drawn from X p σ (s) true prevalence of sensitive attribute value s in sample σ p σ (s) estimate of the prevalence of sensitive attribute value s in sample σ p q σ (s) estimatep σ (s) obtained via quantifier q q(σ) a quantifier q : 2 X → [0, 1] estimating the prevalence of the positive class of sensitive attribute S in a sample set of points x i ∈ X to which h(x) and q(σ) are to be applied a set derived from D according to an experimental protocol among those detailed in Sections 5.3-5.7 of sensitive variable S, ground truth Y , and classifier estimateŶ (Dwork et al., 2012;Hardt et al., 2016;Narayanan, 2018). The main criteria of observational group fairness (Barocas et al., 2019), i.e., the ones computed directly from groupwise confusion matrices, are defined as follows: Definition 1. Given a classifier h : X → Y issuing predictionsŷ = h(x), and given the respective ground truth labels y, the following groupwise disparities with respect to attribute S can be defined. Demographic disparity, for example, measures whether the prevalence of the positive class is the same across subpopulations identified by the sensitive attribute S; a value δ S,DD h = 0 indicates maximum fairness, while values of δ S,DD h = −1 or δ S,DD h = +1 indicate minimum fairness, i.e., maximum advantage for S = 0 over S = 1 or vice versa. We illustrate the problem of measuring fairness under unawareness using an example focused on demographic disparity.
Example 1. Assume that S stands for "race", S = 1 for "African-American" and S = 0 for "White", 2 and that the classifier, deployed by a bank, is responsible for recommending loan applicants for acceptance, classifying them as "grant" (⊕) or "deny" (⊖). For simplicity, let us assume that the outcome of the classifier will be translated directly into a decision without human supervision. The bank might want to check that the fraction of loan recipients out of the total number of applicants is approximately the same in the African-American and White subpopulations. In other words, the bank might want δ S,DD h to be close to 0. Of course, if the bank is aware of the race of each applicant, this constraint is very easy to check and, potentially, enforce. If the bank is unaware of the applicants' race, the problem is not trivial, and can be addressed by the method we propose in this paper.

Fairness Under Unawareness
Unavailability of sensitive attribute values poses a major challenge for internal and external fairness audits. When these values are unknown, it is sometimes possible to seek expert advice to obtain them (Buolamwini and Gebru, 2018). Alternatively, disclosure procedures have been proposed for subjects to provide their sensitive attributes to a trusted third party (Veale and Binns, 2017) or to share them encrypted (Kilbertus et al., 2018). Another line of research studies the problem of reliably estimating measures of group fairness, in classification (Awasthi et al., 2021;Chen et al., 2019;Kallus et al., 2020) and ranking (Ghazimatin et al., 2022;Kırnap et al., 2021), without access to sensitive attributes, via proxy variables. (Chen et al., 2019) is the work most closely related to ours. The authors study the problem of estimating the demographic disparity of a classifier, exploiting the values of nonsensitive attributes X as proxies to infer the value of the sensitive variable S. Starting from a naïve approach, dubbed threshold estimator (TE), which estimates µ(s) = Pr(Ŷ = ⊕|S = s) asμ i.e., by using a hard classifier k s : X → {0, 1} (which outputs Boolean decisions regarding membership in a sensitive group S = s), they propose a weighted estimator (WE) with better convergence properties.μ 2. While acknowledging its limitations (Strmic-Pawl et al., 2018), we follow the race categorization adopted by the US Census Bureau wherever possible.
WE exploits a soft classifier π s : X → [0, 1] that outputs posterior probabilities Pr(s|x i ). The posteriors represent the probability that the classifier attributes to the fact that x i belongs to the subpopulation with sensitive attribute S = s. The authors argue that the naïve estimator of Equation (1) has a tendency to exaggerate disparities, and show that WE mitigates this problem under the hypothesis that π s (x i ) outputs well-calibrated posterior probabilities. A contribution of our paper is to show that TE and WE are just instances of a broad family of estimators (Proposition 2). Moreover, we consider alternative methods from the same family, and show them to outperform both TE and WE on an extensive suite of experiments (Section 5). Kallus et al. (2020) study the problem of measuring a classifier's demographic disparity, true positive rate disparity, and true negative rate disparity in a setting with access to a primary dataset involving (Ŷ , Z) and an auxiliary dataset involving (S, Z), where Z is a generic set of proxy variables, potentially disjoint from X. They show that reliably estimating the demographic disparity of a classifier issuing predictionsŶ when Z is not highly informative with respect toŶ or S is infeasible. Moreover, they provide upper and lower bounds for the true value of the estimand in a setting where the primary and auxiliary datasets are drawn from marginalisations of a common joint distribution. Our work departs from this setting in two important ways, to focus on realistic conditions for internal fairness audits. Firstly, we take into account the nonstationarity of the processes generating the data and do not assume the primary and auxiliary dataset to be marginalisations of the same joint distribution. Rather, we identify different sources of distribution shift, and formalize them into protocols to test the performance of different estimators in a more realistic setting (Sections 5.3-5.7). Secondly, we hypothesize that, from within the company deploying a classifier h(x), the available proxy variables Z comprise X, and are thus highly informative with respect toŶ . Awasthi et al. (2021) characterize the structure of the best estimator for sensitive attributes when the final estimand is a classifier's disparity in true positive rates across protected groups. They show that the test accuracy of the attribute classifier and its performance as an estimator of the true positive rate disparity are not necessarily correlated. We contribute to this line of research, demonstrating the possibility to decouple the classification performance of a model when deployed for sensitive attribute inference at the individual level, which constitutes a privacy infringement, from its quantification performance in applications where it is used for group-level estimates (Section 5.9). This line of work opens the possibility of developing estimators that reliably measure group fairness under unawareness of sensitive attributes, while guaranteeing privacy at the individual level.

Quantification and Fairness
The application of quantification methods in algorithmic fairness research is not entirely new. Biswas and Mukherjee (2021) study the problem of enforcing fair classification under distribution shift, which potentially affects different demographic groups at different rates. They define a notion of fairness based on the proportionality between the prevalence of positives in a protected group S = s and the group-specific acceptance rate of a classifier issuing predictionsŶ . This notion, called proportional equality, is defined by the quantity calculated on a test set D, where low values of PE correspond to fairer predictionsŶ . In the presence of distribution shift between training and testing conditions, the true group-specific prevalences Pr(Y = ⊕|S = 1) and Pr(Y = ⊕|S = 0) are unknown. The authors use an approach from the quantification literature to estimate these prevalence values, integrating it in a wider system aimed at optimizing PE. In other words, prior work applying quantification to problems of algorithmic fairness concentrates on enforcing classifier fairness under unawareness of target labels. Our work, on the other hand, aims at measuring classifier fairness under unawareness of sensitive attributes.

Measuring Fairness Under Unawareness: A Quantification-based Method
In this section, we first present a primer on quantification (Section 4.1), and then show how to measure fairness under unawareness with quantification (Section 4.2), discussing the properties of the resulting estimators.

Learning to Quantify
Quantification (also known as supervised prevalence estimation, or learning to quantify) is the task of training, by means of supervised learning, a predictor that estimates the relative frequency (also known as prevalence, or prior probability) of the classes of interest in a sample of unlabelled data points, where the data used to train the predictor are a set of labelled data points; see González et al. (2017) for a survey of quantification research.
Definition 2. Given a sample σ of data points x ∈ X , with unknown target labels in domain S, a quantifier q(σ) is an estimator q : 2 X → [0, 1] that predicts the prevalence of class s in the sample σ asp q σ (s) = q(σ).
Remark 1. The above definition is deliberately broad to include the trivial classify and count baseline introduced below. In practice, a method is truly quantification-based when explicitly targeting prevalence estimates, rather than simply treating them as a by-product of classification. This includes methods that make use of dedicated loss functions, task-specific adjustments, and ad hoc model selection procedures. Typically, the prevalence estimates issued by these methods display desirable properties of unbiasedness and convergence.
Quantification can be trivially solved via classification, i.e., by classifying all the unlabelled data points by means of a standard classifier, counting, for each class, the data points that have been assigned to the class, and normalizing. However, it has unequivocally been shown (see, among many others, Fernandes Vaz et al. (2019); Forman (2008); González et al. (2017); González-Castro et al. (2013); Moreo and Sebastiani (2022)) that solving quantification by means of this classify and count (CC) method is suboptimal, and that more accurate quantification methods exist. The key reason behind this is the fact that many applicative scenarios suffer from distribution shift, therefore the class prevalence values in the training set may substantially differ from the class prevalence values in the unlabelled data that the classifier issues predictions for (Moreno-Torres et al., 2012). The presence of distribution shift means that the well-known IID assumption, on which most learning algorithms for training classifiers are based, does not hold; in turn, this means that CC will perform suboptimally on scenarios that exhibit distribution shift, and that the higher the amount of shift, the worse we can expect CC to perform. A wide variety of quantification methods have been defined in the literature. In the experiments presented in this paper, we compare six such methods, which we briefly present in this section. One of them is the trivial CC baseline; we have chosen the other five methods over other contenders because they are simple and proven, and because some of them (especially the ACC, PACC, SLD and HDy methods; see below) have shown topnotch performance in recent comparative tests run in other domains Sebastiani, 2021, 2022). We briefly describe them here, with direct reference to the application we are interested in, i.e., estimating the prevalence of a protected subgroup.
As mentioned above, an obvious way to solve quantification (used, among others, in Equation 1) is by aggregating the predictions of a "hard" classifier, i.e., a classifier k s : X → {0, 1} that outputs Boolean decisions regarding membership in a sensitive group (defined by constraint S = s). The (trivial) classify and count (CC) quantifier then comes down to computingp CC σ (s) = Alternatively, quantification methods can use a "soft" classifier π s : X → [0, 1] that produces posterior probabilities Pr(s|x i ). The resulting probabilistic classify and count quantifier (PCC) (Bella et al., 2010) is defined by the equation It should be noted that PCC and CC are clearly related to WE and TE, summarized by Equations (1) and (2), as shown later in Proposition 2. A different and popular quantification method consists of applying an adjustment to the prevalencep CC σ (s) estimated through "classify and count". It is easy to check that, in the binary case, the true prevalence p σ (s) and the estimated prevalencep CC σ (s) are such that where tpr ks and fpr ks stand for true positive rate and false positive rate of the classifier k s used to obtainp CC σ (s). The values of tpr ks and fpr ks are unknown, but can be estimated via k-fold cross-validation on the training data. This boils down to using the results k s (x i ) obtained in the k-fold cross-validation (i.e., x i ranges on the training items) in Equationŝ We obtain estimates of p ACC σ (s), which define the adjusted classify and count method (ACC) (Forman, 2008), by replacing tpr ks and fpr ks in Equation 5 with the estimates of Equation 6, i.e.,p If the soft classifier π s (x i ) is used in place of k s (x i ), analogues oft pr ks andf pr ks from Equation 6 can be defined aŝ We obtain p PACC σ (s) estimates, which define the probabilistic adjusted classify and count method (PACC) (Bella et al., 2010), by replacing all factors on the right-hand side of Equation 7 with their "soft" counterparts from Equations 4 and 8, i.e., A further method is the one proposed in (Saerens et al., 2002) (which we here call SLD, from the names of its proposers), which consists of training a probabilistic classifier and then using the Expectation-Maximization (EM) algorithm (i) to update (in an iterative, mutually recursive way) the posterior probabilities that the classifier returns, and (ii) to re-estimate the class prevalence values of the test set until convergence. This makes the method robust to distribution shift, since the iterative process allows the estimates of the prevalence values to become increasingly attuned to the changed conditions found in the unlabelled set. Pseudocode describing the SLD algorithm can be found in Appendix A. We consider HDy (González-Castro et al., 2013), a probabilistic binary quantification method that views quantification as the problem of minimizing the divergence (measured in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier, one coming from the unlabelled examples and the other coming from a validation set. HDy looks for the mixture parameter α that best fits the validation distribution (consisting of a mixture of a "positive" and a "negative" distribution) to the unlabelled distribution, and returns α as the estimated prevalence of the positive class. Here, robustness to distribution shift is achieved by the analysis of the distribution of the posterior probabilities in the unlabelled set, which reveals how conditions have changed with respect to the training data. A more detailed description of HDy can be found in Appendix B.
Lastly, we consider Maximum Likelihood Prevalence Estimator (MLPE), a dummy method that assumes there is no shift and always returns the class prevalence value as found in the training data, as the estimate of any future test sample. This method is not a serious contender, since MLPE makes no real attempt to address the problem. Notwithstanding this, MLPE is going to generate very low error values in all protocols in which the test prevalence is kept fixed.

Using Quantification to Measure Fairness Under Unawareness
We assume the existence, in the operational setup, of three separate sets of data points: h is the classifier whose fairness we want to measure. Given the difficulties inherent in demographic data procurement mentioned in the introduction, we assume that the sensitive attribute S is not part of the vectorial representation X.
• A small auxiliary set D 2 = {(x i , s i ) | x i ∈ X , s i ∈ S}, containing demographic data, employed to train quantifiers for the sensitive attribute.
• A set D 3 = {x i | x i ∈ X } of unlabelled data points, which are the data to which classifier h is to be applied, representing the deployment conditions. Alternatively, D 3 could also be a labelled held-out test set available at a company, if it has acted proactively rather than reactively, for pre-deployment audits (Raji et al., 2020). In our experiments we will use labelled data and call D 3 the test set, on which the fairness of the classifier h should be measured.
It is worth re-emphasizing that, from the perspective of the estimation task at hand, i.e., estimating the fairness of the classifier h, D 2 represents the quantifier's training set, while D 3 is its test set.
Proposition 1. Observational measures of algorithmic fairness, such as the ones introduced in Definition 1, can be computed, under unawareness of sensitive attributes, by estimating the prevalence of the sensitive attribute in specific subsets of the test set.
Proof. We prove this statement for TPRD in Definition 1, which we recall below: Both terms in the above equation can be written as In other words, TPRD can be calculated by estimating the prevalence of the sensitive attribute among the positives and the true positives in D 3 . Analogous results can be proven for other measures of observational fairness, under the assumption that Y andŶ are known.
Remark 2. This proposition is important for two reasons. First, it shows that inference of sensitive attributes at the individual level is not necessary to measure fairness under unawareness; rather, prevalence estimates in given subsets are sufficient. Second, it suggests that methods directly targeting prevalence estimates (i.e., quantifiers) are especially suited in this setting.
Notice that, for the purposes of a fairness audit, it is common to assume that the ground truth variable Y is available in D 3 . In the banking scenario of Example 1, this is only partially realistic, as the outcomes for the accepted applicants are eventually observed, but the outcomes for the rejected applicants remain unknown, leaving us with a problem of sample selection bias (Banasik et al., 2003). This is an instance of a general estimation problem, common to all fairness criteria that require knowledge of the ground truth variable Y , such as TPRD, TNRD, PPVD, and NPVD in Definition 1. This represents an open research problem (Sabato and Yom-Tov, 2020;Wang et al., 2021b) which is beyond the scope of this work and demands additional caution in the estimation and interpretation of these fairness measures.
In the remainder of this article, we focus on a detailed study of demographic disparity (DD). This allows us to thoroughly characterize and discuss DD estimators while avoiding the pitfalls and complexity of uncertain ground truth information. We leave additional measures of observational fairness for future work.
Following (Chen et al., 2019), we write DD as where is the acceptance rate of individuals in the group S = s. To estimate the demographic disparity of a classifier h(x) in the test set D 3 , we can use any quantification approach from Section 4.1. Applying Bayes' theorem to Equation (11), we obtain where we use p D 3 (⊕) as a shorthand of p D 3 (h(x) = ⊕), and where we have defined Since p D 3 (⊕) is known (it is the fraction of items in D 3 that have been assigned class ⊕ by the classifier h), in order to compute µ(s) through Equation (12), for s ∈ {0, 1}, we only need to estimate the prevalence valuesp D ⊕ 3 (s) andp D ⊖ 3 (s); the latter is needed to estimate the denominator of Equation (12), i.e., the prevalence p D 3 (s) of the sensitive attribute value s in the entire test set D 3 , since In order to compute p D ⊕ 3 (s) and p D ⊖ 3 (s) we can use a quantification-based approach, which can be easily integrated into existing machine learning workflows, as summarized by the method below.
Method. Quantification-Based Estimate of Demographic Disparity.
1. The classifier h : X → Y is trained on D 1 and ready for deployment, e.g., to estimate the creditworthiness of individuals. The assumption that, at this training stage, we are unaware of the sensitive attribute S is due to the inherent difficulties in demographic data procurement already mentioned in Section 1.
2. We use the classifier h to classify the auxiliary set D 2 , thus inducing a partition of 3. We use D ⊕ 2 as the training set for the quantifier q ⊕ (s), whose task will be to estimate the prevalence of value s (e.g., African-American applicants) on sets of data points labelled with class ⊕ (e.g., creditworthy applicants). Likewise, we use D ⊖ 2 as the training set for a quantifier q ⊖ (s) whose task will be to estimate the prevalence of s on sets of data points labelled with ⊖. Intuitively, separate quantifiers specialized on different subpopulations (of positively and negatively classified individuals) should perform better than a single quantifier. The ablation study in Section 5.10 supports this hypothesis.
4. The classifier h is deployed, classifying the test set D 3 , thus inducing a partition of 5. We apply the quantifier q ⊕ to D ⊕ 3 to obtain an estimatep Recall from Section 2.1 thatp q σ (s) denotes the prevalence of an attribute value s in a set σ as estimated via quantification method q.
6. To avoid numerical instability in the denominator of Equation (15) below, we apply Laplace smoothing to the estimated prevalence valuesp (s). We use the variant that uses known incidence rates, using D ⊖ 2 and D ⊕ 2 as the control populations, and assume a pseudocount α = 1/2. We thus compute the smoothed estimator 7. Finally, we estimate the demographic disparity of h, defined in Equation (10), aŝ where, as from Equations (12) and (13), Remark 3. Therefore, prevalence estimatesp (s), obtained with a quantification method of the type introduced in Section 4.1, can be translated into estimates of a classifier's demographic disparity using Equations (14) and (15). Importantly, the bias and variance of said estimate depend on the properties of the underlying quantification method, which have been characterized in the quantification literature. For example, SLD, ACC, and PACC have been shown to be Fisher-consistent, that is, unbiased, under prior probability shift (Fernandes Vaz et al., 2019;Tasche, 2017). In other words, we expect Equation 14 instantiated with SLD, PCC, and PACC to provide unbiased estimates when D 2 and D 3 are linked by prior probability shift. We verify this property in Sections 5.3 and 5.4.
It is worth noting that the weighted estimator (WE) introduced in (Chen et al., 2019), summarized by Equation (2), can be viewed as a special case of this approach, as shown by the proposition below.
Proposition 2. The weighted estimator of Equation (2) is a special case of quantificationbased estimation of demographic disparity, instantiated with the PCC quantification method. Moreover, the threshold estimator of Equation (1) corresponds to CC.
Proof. See Appendix C.
Remark 4. The above proposition shows that PCC and WE are equivalent, and that the trivial CC quantifier is equivalent to TE. We treat these methods as prior art and refer to them as CC and PCC for consistency of exposition.
This quantification-based method of addressing demographic disparity is suitable for internal fairness audits, since it allows unawareness of the sensitive attribute S (i) in the set D 1 used for training the classifier h to be audited, and (ii) in the set D 3 on which this classifier is going to be deployed; it only requires the availability of an auxiliary data set D 2 where the attribute S is labelled. Dataset D 2 may originate from a targeted effort, such as interviews (Baker et al., 2005), surveys sent to customers asking for voluntary disclosure of sensitive attributes (Andrus et al., 2021), or other optional means of sharing demographic information (Beutel et al., 2019a,b). Alternatively, it could derive from data acquisitions carried out for other purposes (Galdon Clavell et al., 2020). Finally, note that, in this paper, we assume the existence of a single binary sensitive attribute S only for ease of exposition; our approach can straightforwardly used in more complex scenarios.
Remark 5. Our method can deal with multiple, non-binary sensitive attributes.
If multiple sensitive attributes are present at the same time, one can simply measure fairness with respect to each sensitive attribute separately, if interested in independent assessments, or jointly, if emphasizing intersectionality (Ghosh et al., 2021b). Our approach can also be extended to deal with categorical, non-binary attributes. In this case, sample-prev-D 2 joint distribution of (S,Ŷ ) in D 2 , via sampling skewed auxiliary data, nonresponse bias § 5.4 sample-size-D 2 size of D 2 , via sampling variable response rates, issues with sensitive data procurement § 5.5 sample-prev-D 1 joint distribution of (S, Y ) in D 1 , via sampling censored data, sampling bias § 5.6 flip-prev-D 1 joint distribution of (S, Y ) in D 1 , via label flipping ground truth distortion, groupdependent annotation inaccuracy § 5.7 one needs (1) to extend the notion of demographic disparity to the case of non-binary attributes. This can be done, e.g., by considering, instead of the simple difference between two acceptance rates µ(s) as in Equation (10), the variance of the acceptance rates across the possible values of S, or the difference between the highest and lowest acceptance rate max s∈S µ(s) − min s∈S µ(s); and (2) to use a single-label multiclass (rather than a binary) quantification system. Concerning this, note that all the methods discussed in Section 4.1 except HDy admit straightforward extensions from the binary case to the single-label multiclass case (see (Moreo and Sebastiani, 2022) for details). HDy is a method for binary quantification only, but it can be adapted to the single-label multiclass scenario by training a binary quantifier for each class in one-vs-all fashion, estimating the prevalence of each class independently of the others, and normalising the obtained prevalence values so that they sum to 1.

General Setup
In this section, we carry out an evaluation of different estimators of demographic disparity. We propose five experimental protocols (Sections 5.3-5.7) summarized in Table 2. Each protocol addresses a major challenge that may arise in estimating fairness under unawareness, and does so by varying the size and the mutual distribution shift of the training, auxiliary, and test sets. Protocol names are in the form action-characteristic-dataset, as they act on datasets (D 1 , D 2 or D 3 ), modifying their characteristics (size or class prevalence) through one of two actions (sampling or flipping of labels). We investigate the performance of six estimators of demographic disparity in each of the five challenges/protocols, keeping the remaining factors constant. For every protocol, we perform an extensive empirical evaluation as follows: • We compare the performance of each estimation technique on three datasets (Adult, COMPAS, and CreditCard). The datasets and respective preprocessing are described in detail in Section 5.2. We focus our discussion (and we present plots -see  on the experiments carried out on the Adult dataset, while we summarise numerically the results on COMPAS and CreditCard (Tables 4-8), discussing them only when significant differences from Adult arise.
• We divide a given data set into three subsets D A , D B , D C of identical sizes and identical joint distribution over (S, Y ). We perform five random such splits; in order to test each estimator under the same conditions, these splits are the same for every method.
For each split, we permute the role of the stratified subsets D A , D B , D C , so that each subset alternatively serves as the training set (D 1 ), or auxiliary set (D 2 ), or test set (D 3 ). We test all (six) such permutations.
• Whenever an experimental protocol requires sampling from a set, for instance when artificially altering a class prevalence value, we perform 10 different samplings. To perform extensive experiments at a reasonable computational cost, every time an experimental protocol requires changing a dataset D into a versionD characterized by distribution shift, we also reduce its cardinality to |D| = 500. Further details and implications of this choice on each experimental protocol are provided in the context of the protocol's setup (e.g., Section 5.6.1).
• Different learning approaches can be used to train the sensitive attribute classifier k s underlying the quantification methods. We test Logistic Regression (LR) and Support Vector Machines (SVMs). 3 Sections 5.3-5.7 report results of quantification algorithms wrapped around a classifier trained via LR. Analogous results obtained with SVMs are reported in Appendix D.
• We train the classifier h, whose demographic disparity we aim to estimate, using LR with balanced class weights (i.e., loss weights inversely proportional to class frequencies).
• To measure the performance of different quantifiers, we report the signed estimation error, derived from Equations (10) and (14) as We refer to |e| as the Absolute Error (AE), and evaluate the results of our experiments by Mean Absolute Error (MAE) and Mean Squared Error (MSE), defined as 3. Some among the quantification methods we test in this study require the classifier to output posterior probabilities (as is the case for classifiers trained via LR). If a classifier natively outputs classification scores that are not probabilities (as is the case for classifiers trained via SVM), we convert the former into the latter via Platt (2000)'s probability calibration method.
where the mean of the signed estimation errors e i is computed over multiple experiments E. Overall, our experiments consist of over 700,000 separate estimations of demographic disparity.
The remainder of this section is organized as follows. Section 5.2 presents the datasets that we have chosen and the pre-processing steps we apply. Sections 5.3-5.7 motivate and detail each of the five experimental protocols, reporting the performance of different demographic disparity estimators. Section 5.8 presents an experiment on fairness-aware methods, where the classifier whose fairness we aim to estimate has been trained to optimize that measure. Section 5.9 shows that reliable fairness auditing may be decoupled from undesirable misuse aimed at inferring the values of the sensitive attribute at an individual level. Finally, Section 5.10 describes an ablation study, aimed at investigating the benefits of training and maintaining multiple class-specific quantifiers.

Datasets
We perform our experiments on three datasets. We choose Adult and COMPAS, the two most popular datasets in algorithmic fairness research (Fabris et al., 2022), and Credit Card Default (hereafter: CreditCard), which serves as a representative use case for a bank performing a fairness audit of a prediction tool used internally. For each dataset, we standardize the selected features by subtracting the mean and scaling to unit variance.
Adult. 4 One of the most popular resources in the UCI Machine Learning Repository, the Adult dataset was curated to benchmark the performance of machine learning algorithms. It was extracted from the March 1994 US Current Population Survey and represents respondents along demographic and socioeconomic dimensions, reporting, e.g., their sex, race, educational attainment, and occupation. Each instance comes with a binary label, encoding whether their income exceeds $50,000, which is the target of the associated classification task. We consider "sex" the sensitive attribute S, with a binary categorization of respondents as "Female" or "Male". From the non-sensitive attributes X, we remove "education-num" (a redundant feature), "relationship" (where the values "husband" and "wife" are near-perfect predictors of "sex"), and "fnlwgt" (a variable released by the US Census Bureau to encode how representative each instance is of the overall population). Categorical variables are dummy-encoded and instances with missing values (7%) are removed.
COMPAS. 5 This dataset was curated to audit racial biases in the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) risk assessment tool, which estimates the likelihood of a defendant becoming a recidivist . The dataset represents defendants who were scored for risk of recidivism by COMPAS in Broward County, Florida between 2013 and 2014, summarizing their demographics, criminal record, custody, and COMPAS scores. We consider the compas-scores-two-years subset published by ProPublica on github, consisting of defendants who were observed for two years after screening, for whom a binary recidivism ground truth is available. We follow standard pre-processing to remove noisy instances (ProPublica, 2016). We focus on "race" as a protected attribute S, restricting the data to defendants labelled "African-American" or "Caucasian". Our attributes X are the age of the defendant ("age", an integer), the number of juvenile felonies, misdemeanours, and other convictions ("juv fel count", "juv misd count", "juv other count", all integers), the number of prior crimes ("priors count", an integer) and the degree of current charge ("c charge degree", felony or misdemeanour, dummy-encoded).
CreditCard. 6 This resource was curated to study automated credit card default prediction, following a wave of defaults in Taiwan. The dataset summarizes the payment history of customers of an important Taiwanese bank, from April to October 2005. Demographics, marital status, and education of customers are also provided, along with the amount of credit given and a binary variable encoding the default on payment within the next month, which is the associated prediction task. We consider "sex" (binarily encoded) as the sensitive attribute S and keep every other variable in X, preprocessing categorical ones via dummy-encoding ("education", "marriage", "pay 0", "pay 2", "pay 3", "pay 4", "pay 5", "pay 6"). Differently from Adult, we keep marital status as its values are not trivial predictors of the sensitive attribute.
A summary of these datasets and related statistics is reported in Table 3.

Motivation and Setup
The first experimental protocol models a setting in which the test set D 3 shows a significant distribution shift with respect to the sets D 1 and D 2 available during training of h and k. In other words, in this protocol, D 1 and D 2 are marginalisations of the same joint distribution, while D 3 (more preciselyD 3 ) is drawn from a different joint distribution. We consider two sub-protocols (sample-prev-D ⊖ 3 and sample-prev-D ⊕ 3 ) that model changes in the distribution of a sensitive variable S in D ⊖ 3 and D ⊕ 3 , the test subsets of either negatively or positively predicted instances. More in detail, we let Pr(s|⊖) (or its dual Pr(s|⊕)) inD 3 range on eleven evenly spaced values between 0 and 1. For example, under subprotocol sample-prev-D ⊖ 3 , we vary the distribution of sensitive attribute S inD ⊖ 3 , so that Pr(s|⊖) ∈ {0.0, 0.1 . . . , 0.9, 1.0}, while keeping the distribution inD ⊕ 3 fixed. For both sub-protocols, in each repetition we sample subsets of the test set D 3 such that |D ⊖ 3 | = |D ⊕ 3 | = 500. Pseudocode 1 describes the protocol when acting on D ⊖ 3 ; the case for D ⊕ 3 is analogous and consists of swapping the roles of D ⊖ 3 and D ⊕ 3 in Lines 18 and 19. The pale red region highlights the part of the experimental protocol that is specific to Protocol sample-prev-D 3 ; the rest is common to all the experimental protocols mentioned in this paper.
This protocol accounts for the inevitable evolution of phenomena, especially those related to human behaviour. Indeed, it is common in real-world scenarios for data generation processes to be nonstationary and change across development and deployment, due, e.g., to seasonality, changes in the spatiotemporal application context, or any sort of unmodelled novelty and difference in populations (Ditzler et al., 2015;Malinin et al., 2021;Moreno-Torres et al., 2012). Given that most work on algorithmic fairness focuses on decisions or predictions about people, and given inevitable changes in human lives, values, and behaviour, the above considerations about non-stationarity seem particularly relevant. For example, data available from one population is often repurposed to train algorithms that will be deployed on a different population, requiring ad hoc fair learning approaches (Coston et al., 2019) and evoking the portability trap of fair machine learning (Selbst et al., 2019). In addition, agents can respond to novel technology in their social context and adapt their behaviour accordingly (Hu et al., 2019;Tsirtsis et al., 2019), causing ripple effects (Selbst et al., 2019) and feedback loops (Mansoury et al., 2020).
Finally, personalized pricing constitutes an increasingly possible practice with nontrivial fairness concerns (Kallus and Zhou, 2021) and inevitable shifts due to changing habits and environments (Sindreu, 2021).
In this protocol, quantifiers are tested on subsetsD ⊖ 3 ,D ⊕ 3 that exhibit a different prevalence of sensitive attribute s with respect to their counterparts D ⊖ 2 , D ⊕ 2 in the auxiliary set. More specifically, with this protocol we vary the joint distribution of (S,Ŷ ) to directly influence the demographic disparity of the classifier h in the test set D 3 , and move it away from the value δ S h of the same measure that we would obtain on the set D 2 . This is a fundamental evaluation protocol, as it makes our estimand different across D 2 and D 3 (or, more precisely, its modified versionD 3 ), which is typically expected in practice. If this was not the case, a practitioner could simply resort to an explicit calculation of the demographic disparity in the auxiliary set D 2 and consider it representative of any deployment condition, as in the MLPE trivial baseline. Given this reasoning, this protocol imposes sizeable variations in the demographic disparity of h between D 2 and D 3 , which act as the training set and the test set, respectively, for our quantifiers. For example, on Adult, δ S h is approximately equal to 0.3 in D 2 , while in D 3 we let it vary in the range [−0.7, 0.9]. Despite these sizeable variations, we expect that methods such as SLD, ACC, and PACC perform well, due to their proven unbiasedness in this setting (Remark 3).

Results
In Figure 1 we report the performance of CC, PCC, ACC, PACC, SLD, HDy, and MLPE on the Adult dataset under the sample-prev-D 3 experimental protocol. The estimation error (Equation 16) is reported on the y axis, as we vary the prevalence of the protected group in the test set, which is displayed on the x axis. Figure 1a concentrates on prevalence Similar trends emerge under both sub-protocols. CC, PCC, and MLPE display a clear trend along the x axis, vastly over-or underestimating the demographic disparity of h, and proving unreliable in settings where the prevalence values in the unlabelled (test) set shift away from the prevalence values of the training set. In sub-protocol sample-prev-D ⊕ 3 , summarised in Figure 1b, the prevalence of men (S = 1) inD ⊕ 3 , used to test one of the quantifiers, is almost always lower than the prevalence in the respective training set D ⊕ 2 , reported with a vertical green line. As a result, quantifiers trained on D ⊕ 2 tend to systematically overestimate the prevalence of males in D ⊕ 3 , thus also overestimating µ(1) and δ S h , according to Equations (14) and (15). Similar considerations hold for sub-protocol sample-prev-D ⊖ 3 , with a sign flip. ACC, PACC, SLD and HDy, on the other hand, display low bias, even under sizeable prevalence shift. Their variance is higher than CC and PCC, but their estimation error is moderate overall. The condition Pr(S = 1|Ŷ = ⊖) = 1 (right-most point in Figure 1a) is particularly critical for every method due to p D 3 (s = 0) dropping below 0.1, thus making small estimation errors for the denominator of Equation 15 especially impactful onμ(0).
The results of the COMPAS and CreditCard datasets are reported in Table 4, along with a summary of the results of the Adult dataset we have just discussed. The first and second columns indicate the MAE and MSE values (lower is better), while the third and fourth columns indicate the probability that the Absolute Error (AE) falls below 0.1 and 0.2 across the entire experimental protocol (higher is better). Boldface indicates the best method for a given dataset and metric. The superscripts † and ‡ denote the methods (if any) whose error scores (MAE, MSE) are not statistically significantly different from the best according to a paired sample, two-tailed t-test at different confidence levels. Symbol † indicates 0.001 < p-value < 0.05 while symbol ‡ indicates 0.05 ≤ p-value; the absence of any such symbol indicates p-value ≤ 0.001 (i.e., that the performance of the method is statistically significantly different from that of the best method). Overall, SLD strikes the best balance between bias and variance. PACC is the second-best approach, outperforming ACC and PCC, demonstrating the utility of combining posterior probabilities and adjustments when the latter can reliably be estimated. The trends we discussed also hold for COMPAS and CreditCard. Note that both datasets appear to provide a setting harder than Adult for the inference of the sensitive attribute S from the non-sensitive attributes X.

Distribution Shift
Affecting the Auxiliary Set: Protocol sample-prev-D 2

Motivation and Setup
This protocol is analogous to protocol sample-prev-D 3 (Section 5.3), but for the fact that it focuses on shifts in the auxiliary set D 2 , while D 1 and D 3 remain at their natural prevalence. Similarly to Section 5.3, we assess the signed estimation error under shifts that affect D ⊖ 2 or D ⊕ 2 , that is, the subsets of D 2 labelled positively or negatively by the classifier h. Here too, we consider two experimental sub-protocols, describing variations in the prevalence of sensitive attribute s in either subset. More specifically, we let Pr(s|⊖) (or its dual Pr(s|⊕)) take 9 evenly spaced values between 0.1 and 0.9. Pseudocode 3 describes the protocol when acting on D ⊖ 2 ; the case for D ⊕ 2 is analogous, and comes down to swapping the roles of D ⊖ 2 and D ⊕ 2 in Lines 12 and 13. This protocol captures issues of representativity in demographic data, e.g., due to nonuniform response rates across subpopulations (Schouten et al., 2009(Schouten et al., , 2012. Given the importance of trust for the provision of one's sensitive attributes, in some domains this provision is considered akin to a data donation (Andrus et al., 2021). Individuals from groups that were historically served with worse quality or had lower acceptance rates for a service can be reluctant to disclose their membership in those groups, fearing that it may be used against them as grounds for rejection or discrimination (Hasnain-Wynia and Baker, 2006). This may be especially true for individuals who perceive to be at high risk of rejection, and this can cause complex selection biases, jointly dependent on S and Y , or S andŶ if individuals have some knowledge of the classification procedure. For example, health care providers may be advised to collect information about the race of patients to monitor the quality of services across subpopulations. In a field study, 28% of patients reported discomfort in revealing their own race to a clerk, with African-American patients significantly less comfortable than white patients on average (Baker et al., 2005). Figure 2 shows the signed estimation error on the y axis, as we vary, on the x axis, the prevalence of the sensitive attribute in D ⊖ 2 ( Figure 2a) and D ⊕ 2 (Figure 2b). MLPE, CC, PCC, and HDy prove to be fairly sensitive to shifts in their training set. In sub-protocol sample-prev-D ⊕ 2 , symmetrically to the sub-protocol sample-prev-D ⊕ 3 discussed in the previous section, the prevalence of males (S = 1) in subset D ⊕ 2 , used to train one of the quantifiers, is almost always lower than the prevalence in the respective test subset D ⊕ 3 , indicated with a vertical green line. As a result, quantifiers trained on D ⊕ 2 tend to systematically underestimate the prevalence of males in D ⊕ 3 and underestimate the (signed) demographic disparity of the classifier h.

Results
ACC and PACC require splitting their training set to estimate the respective adjustments (Equations (6)-(9)), and suffer from a reduced cardinality |D 2 | = 1, 000. Their performance worsens substantially with respect to protocol sample-prev-D 3 , where |D 2 | > 15, 000. Indeed, these methods have been shown to be Fisher-consistent under prior prob- ability shift (Fernandes Vaz et al., 2019;Tasche, 2017), that is, they are guaranteed to be accurate, thanks to the respective adjustments, if D 2 is large enough and linked to D 3 by prior probability shift. While the latter condition holds, the former is violated under this protocol, hence ACC and PACC are unbiased (in expectation), but display a large variance, due to unstable adjustments. SLD, on the other hand, shows a moderate variance and bias. These effects are especially evident at the extremes of the x axis, which correspond to settings where few instances with either S = 0 or S = 1 are available for quantifier training. In turn, the few positives (negatives) make it particularly difficult to reliably estimate tpr ks (tnr ks ), as required by Equations 7 and 9. For example, in Figure 2a we see that the error of ACC ranges between −1.3 and 0.7. Given that the true demographic disparity of the classifier h is δ S h = 0.3, these are the worst possible errors, corresponding to extreme esti- matesδ S h = −1 andδ S h = 1, respectively. Finally, it is worth noting that PACC outperforms ACC, thanks to efficient use of posteriors π s (x i ) in place of binary decisions k s (x i ).
These trends also hold for COMPAS and CreditCard, as summarized in Table 5. Similarly to Table 4, we find that, under large shifts between the auxiliary and the test set, the estimation of demographic disparity is more difficult on COMPAS and CreditCard than on Adult. Overall, these experiments show that CC and PCC fare poorly under prior probability shift, and are outperformed by estimators with better theoretical guarantees.

Motivation and Setup
In this experimental protocol, we focus on the size of the auxiliary set D 2 , studying its influence on the estimation problem. Our goal is to understand how small this set can be before degrading the performance of our estimation techniques. We use subsetsD 2 of the auxiliary set, obtained by sampling instances uniformly without replacement from it. We let their cardinality |D 2 | take five values evenly spaced on a logarithmic scale, between a minimum size |D 2 |=1,000 and a maximum size |D 2 | = |D 2 |. In other words, we let the cardinality of the auxiliary set take five different values between 1,000 and |D 2 | in a geometric progression. This protocol is justified by the well-documented difficulties in the acquisition of demographic data for industry professionals, which vary depending on the domain, the company and other factors of disparate nature (Andrus et al., 2021;Beutel et al., 2019b;Bogen et al., 2020;Galdon Clavell et al., 2020;Holstein et al., 2019). As an example, Galdon Clavell et al.
(2020) perform an internal fairness audit of a personalized wellness recommendation app, for which sensitive features are not collected during production, following the principles of data minimization. However, sensitive features were available in a previously obtained auxiliary set. Furthermore, in the US, the collection of sensitive attributes is highly industry dependent, ranging from mandatory to forbidden, depending on the fragmented regulation applicable in each domain (Bogen et al., 2020). High-quality auxiliary sets can be obtained through optional surveys (Wilson et al., 2021), for which response rates are highly dependent on trust, and can be improved by making the intended use of the data clearer (Andrus et al., 2021), directly impacting the cardinality of D 2 .
Therefore, the cardinality of the auxiliary set D 2 is an interesting variable in the context of fairness audits. The estimation methods that we consider have peculiar data requirements, such as the need to estimate true/false positive rates. For this reason, interesting patterns should emerge from this protocol. We expect key trends for the estimation error to vary monotonically with |D 2 |, which is why we let it vary according to a geometric progression.

Results
The signed estimation error on the Adult dataset under this experimental protocol is illustrated in Figure 3, as we vary the cardinality |D 2 | along the x axis. Clearly, the variance for each approach decreases as the size ofD 2 increases. Additionally, slight biases may improve, as is the case with HDy, whose median error approaches zero as |D 2 | increases. These trends are a direct confirmation of hints already obtained from the protocols discussed above. The most striking trend is the unreliability of ACC and PACC (and especially the former) in the small data regime.
Similar results are obtained for COMPAS and CreditCard, as reported in Table 6. Across the three datasets, PACC and ACC perform quite poorly due to the difficulty in estimating tpr ks and fpr ks with the few labelled data points available fromD 2 . On the other hand, both SLD and HDy are fairly reliable. PCC and MLPE stand out as strong performers, with low bias and low variance. This is due to the fact that, under this experimental protocol, there is no shift between the auxiliary set D 2 , on which the quantifiers are trained, and the test set D 3 , on which they are tested. Since the current protocol focuses on the cardinality of the auxiliary set, D 2 and D 3 remain stratified subsets of the Adult dataset, with identical distributions over (S, Y ). In turn, this favours MLPE, which assumes no shift between D 2 and D 3 , and PCC, which relies on the fact that the posterior probabilities of its underlying classifier k are well-calibrated on D 3 . 7 7. Posterior probabilities Pr(s|x) are said to be well-calibrated when, given a sample σ drawn from X lim |σ|→∞ |{x ∈ s| Pr(s|x) = α}| |{x ∈ σ| Pr(s|x) = α}| = α.
i.e., when for big enough samples, α approximates the true proportion of data points belonging to class s among all data points for which Pr(s|x) = α.

Motivation and Setup
With this protocol we evaluate the impact of shifts in the training set D 1 , by drawing different subsetsD 1 as we vary Pr(Y = S). 8 More specifically, we vary Pr(Y = S) between 0 and 1 with a step of 0.1. In other words, we sample at random from D 1 a proportion p of instances (x i , s i , y i ) such that Y = S and a proportion (1 − p) such that Y ̸ = S, with p ∈ {0.0, 0.1, . . . , 0.9, 1.0}. We choose a limited cardinality |D 1 | = 500, which allows us to perform multiple repetitions at reasonable computational costs, as described in Section 5.1. Although this may impact the quality of the classifier h, this aspect is not the central focus of the present work.
This experimental protocol aligns with biased data collection procedures, sometimes referred to as censored data (Kallus and Zhou, 2018). Indeed, it is common for the groundtruth variable to represent a mere proxy for the actual quantity of interest, with nontrivial sampling effects between the two. For example, the validity of arrest data as a proxy for offence has been brought into question (Fogliato et al., 2021). In fact, in this domain, different sources of sampling bias can be in action, such as uneven allocation of police resources between jurisdictions and neighbourhoods (Holmes et al., 2008) and lower levels of cooperation in populations who feel oppressed by law enforcement (Xie and Lauritsen, 2012).
By varying Pr(Y = S) we impose a spurious correlation between Y and S, which may be picked up by the classifier h. In extreme situations, such as when Pr(Y = S) ≃ 1, a classifier h can confound the concepts behind S and Y . In turn, we expect this to unevenly affect 8. While Y and S take values from different domains, by Y = S we mean (Y = ⊕∧S = 1)∨(Y = ⊖∧S = 0), i.e. a situation where positive outcomes are associated with group S = 1 and negative outcomes with group S = 0. the acceptance rates for the two demographic groups, effectively changing the demographic disparity of h, i.e., our estimand δ S h . Pseudocode 5 describes the main steps to implement Protocol sample-prev-D 1 .

Results
In Figure 4, the y axis depicts the estimation error (Equation 16), as we vary Pr(Y = S) along the x axis. Each quantification approach outperforms vanilla CC, which overestimates the demographic disparity of the classifier h, i.e., its estimate is larger than the ground truth value, soδ S,CC h > δ S h . ACC, PCC, PACC, SLD, HDy, and MLPE display a negligible bias and a reliable estimate of demographic disparity. The absolute error for these techniques is always below 0.1, except for a few outliers.
Results for the COMPAS and CreditCard datasets are reported in Table 7. Confirming the results of previous protocols, these datasets provide a harder setting for the estimate of demographic disparity, as shown by higher MAE and MSE, which, for instance, increase by one order of magnitude for SLD and PACC moving from Adult to COMPAS. PCC is the best performer, for the same reasons discussed in Section 5.3, i.e., the absence of shift between D 2 and D 3 .

Motivation and Setup
Certain biases in the training set resulting from domain-specific practices, such as the use of arrest as a substitute for the offence, can be modelled as either a selection bias (Fogliato et al., 2021) or a label bias distorting the ground truth variable Y (Fogliato et al., 2020). With this experimental protocol, we impose the latter bias by actively flipping some ground truth labels Y in D 1 based on their sensitive attribute. Similarly to sample-prev-D 1 , this protocol achieves a given association between the target Y and the sensitive variable S in the training set D 1 . However, instead of sampling, it does so by flipping the Y label of some data points. More specifically, we impose Pr(Y = ⊖|S = 0) = Pr(Y = ⊕|S = 1) = p and let p take values across eleven evenly spaced values between 0 and 1. For every value of p, we first sample a random subsetD 1 of the training set with cardinality 500. Next, we actively flip some Y labels in both demographic groups, until both Pr(Y = ⊖|S = 0) and Pr(Y = ⊕|S = 1) reach the desired value of p ∈ {0.0, 0.1, . . . , 0.9, 1.0}. Finally, we train a classifier h on the attributes X and modified ground truth Y ofD 1 . This experimental protocol is compatible with settings where the training data capture a distorted ground truth due to systematic biases and group-dependent annotation accuracy (Wang et al., 2021a). As an example, the quality of medical diagnoses can depend on race, sex, and socioeconomic status (Gianfrancesco et al., 2018). In addition, health care expenditures have been used as a proxy to train an algorithm deployed nationwide in the US to estimate patients' health care needs, resulting in a systematic underestimation of the needs of African-American patients (Obermeyer et al., 2019). In the hiring domain, employer response rates to resumes have been found to vary with the perceived ethnic origin of an applicant's name (Bertrand and Mullainathan, 2004). These are all examples where the "ground truth" associated with a dataset is distorted to the disadvantage of a sensitive demographic group. Similarly to Section 5.6, we expect that this experimental protocol will cause significant variations in the demographic disparity of the classifier h due to the strong correlation we impose between S and Y by label flipping. The pseudocode that describes this protocol is essentially the same as in Pseudocode 5, simply replacing the sampling in line 8 with the label flipping procedure described above; therefore, we omit it. Figure 5 illustrates the key trends caused by this experimental protocol on the Adult dataset. A clear trend is visible along the x axis, which reports the true demographic disparity δ S h for the classifier h (Equation 10), quantized with a step of 0.1. We choose to depict the true demographic disparity on the x axis as it is the estimand, hence a quantity of interest by definition. The error incurred by CC displays a linear trend that goes from severe underestimation (for low values of the x axis) to severe overestimation (for large values of the x axis). In other words, the (signed) estimation error increases with the true demographic disparity of the classifier h, a phenomenon also noticed by Chen et al. (2019). All remaining approaches compensate for this weakness and display a good estimation error: PCC, ACC, PACC, SLD, HDy, and MLPE have low variance and a median estimation close to zero across different values of the estimand. Table 8 summarizes similar results on COMPASS and CreditCard; PCC remains well-calibrated and very effective, while SLD and HDy also have good performance.

Motivation and Setup
So far, we have considered classifiers h(x) which only maximize accuracy. In practice, it can be especially interesting to monitor fairness for methods that target this quantity, explicitly optimizing fairness during training. In fact, sensitive attributes may be available during training, allowing for a direct optimization of equity, but unavailable after deployment, complicating fairness evaluation of live systems. In this section, we replace the vanilla LR classifier from the previous experiments with a fairness-aware method. We train a decision tree h T , jointly optimizing accuracy and demographic parity, with the cost-sensitive method of Agarwal et al. (2018). This method makes use of s during training to adjust the cost of positive and negative predictions according to group membership. This learning scheme leads to a classifier h T (x) which is fairness-aware but does not require access to sensitive attributes to issue predictions on D 3 .

Results
We focus our exposition on protocol sample-prev-D 3 ; analogous results are obtained on the remaining protocols. The fairness-aware decision tree improves DD by one order of magnitude, with an average δ S h T = 0.017, down from δ S h = 0.158 for LR. Figure 6, reporting the estimation error from different quantifiers, shows the same patterns as its counterpart from Figure 1. CC and PCC have a sizeable bias, while ACC, PACC, SLD, and HDy display low estimation error for all the tested prevalence values. This experiment confirms the suitability of our method in measuring fairness under unawareness, also for fairnessaware classifiers.

Motivation and Setup
The motivating use case for this work are internal audits of group fairness, to characterize a model and its potential to harm sensitive categories of users. Following Awasthi et al. (2021), we envision this as an important first step in empowering practitioners to argue for resources and, more broadly, to advocate for a deeper understanding and careful evaluation of models. Unfortunately, developing a tool to infer demographic information, even if motivated by careful intentions and good faith, leaves open the possibility for misuse, especially at an individual level. Once a predictive tool, also capable of instance-level classification, is available, it will be tempting for some actors to exploit it precisely for this purpose. For example, the Bayesian Improved Surname Geocoding (BISG) method was designed to estimate population-level disparities in health care (Elliott et al., 2009), but later used to identify individuals potentially eligible for settlements related to discriminatory practices of auto lenders (Andriotis and Ensign, 2015;Koren, 2016). Automatic inference of sensitive attributes of individuals is problematic for several reasons. Such procedure exploits the co-occurrence of membership in a group and display of a given trait, running the risk of learning, encoding, and reinforcing stereotypical associations. Although also true of grouplevel estimates, this practice is particularly troublesome at the individual level, where it is likely to cause harms for people who do not fit the norm, resulting, for instance, in misgendering and the associated negative effects (McLemore, 2015). Even when "accurate", the mere act of externally assigning sensitive labels can be problematic. For example, gender assignment can be forceful and cause psychological harm for individuals (Keyes, 2018).
In this section, we aim to demonstrate that it is possible to decouple the objective of (group-level) quantification of sensitive attributes from that of (individual-level) classification. For each protocol in the previous sections, we compute the accuracy and F 1 score (defined below) of the sensitive attribute classifier k underlying the tested quantifiers, comparing it against their estimation error for class prevalence of the same sensitive attribute

Results
Figures 7 and 8 displays the quantification performance (MAE -dashed) and classification performance (F 1 , accuracy -solid) of CC, SLD and PACC on the Adult dataset under protocols sample-prev-D 2 and sample-prev-D 3 , respectively. As usual, we describe the results for LR-based learners and report their SVM-based duals in the appendix (Figures 12  and 13). To evaluate the quantification performance of each approach, we simply report their MAE in estimating the prevalence p D ⊖ 3 (S = 1), p D ⊕ 3 (S = 1) in either test subset, depending on the protocol at hand. To assess the performance of the sensitive attribute classifier k underlying each quantifier, we proceed as follows. For CC and PACC, we simply run k (LR) on either D ⊖ 3 or D ⊕ 3 , reporting its accuracy and F 1 score in inferring the sensitive attribute of individual instances. The classification performance scores of the classifiers underlying CC and PACC are equivalent, so we omit the latter from Figures 7 and 8 for readability. For SLD, we take the novel posteriors obtained by applying the EM algorithm to either test subset, and use them for classification with a threshold of 0.5.
Clearly, SLD improves both the quantification and classification performance of the classifier k. In terms of quantification, its MAE is consistently below that of CC, and in terms of classification, it displays better F 1 and accuracy. However, under large prevalence shifts across the auxiliary set D 2 and the test set D 3 , its classification performance becomes unreliable. In particular, under protocol sample-prev-D ⊖ 3 (resp. sample-prev-D ⊕ 3 ) in Figure 8a (resp. Figure 8b), for low values of the x axis, i.e., when the true prevalence values p D ⊖ 3 (S = 1) (resp. p D ⊕ 3 (S = 1)) becomes small, the SLD-based classifier starts acting as a trivial rejector with low recall, and hence low F 1 score. On the other hand, the quantification performance of SLD does not degrade in the same way, since its MAE is low and flat across the entire x axis in Figures 8a and 8b. This is a first hint of the fact that classification and quantification performance may be decoupled.
PACC is another method that significantly outperforms CC in estimating the prevalence of sensitive attributes in both test subsets D ⊖ 3 , D ⊕ 3 . Indeed, its MAE is well aligned with that of SLD, displaying low quantification error under all protocols (Figures 7-8). On the other hand, its classification performance is aligned with the accuracy and F 1 score of CC, which is unsatisfactory and can even become worse than random. This fact shows that it is possible to build models which yield good prevalence estimates for the sensitive attribute within a sample, without providing reliable demographic estimates for single instances. Indeed, quantification methods of type aggregative (that is, based on the output of a classifier  Figure 7: Performance of CC, SLD and PACC on the Adult dataset when used for quantification (MAE -lower is better, dashed) and classification (F 1 , accuracy -higher is better, solid) under protocol sample-prev-D 2 . The classification performance of PACC is equivalent to that of CC (both equal to the performance of the underlying LR), and we thus omit it for readability.
-like all methods we use in this study) are suited to repair the initial prevalence estimate (computed by classifying and counting) without precise knowledge of which specific data points have been misclassified. In the context of models to measure fairness under unawareness of sensitive attributes, we highlight this as a positive result, decoupling a desirable ability to estimate group-level disparities from the potential for undesirable misuse at the individual level.  Figure 8: Performance of CC, SLD and PACC on the Adult dataset when used for quantification (MAE -lower is better, dashed) and classification (F 1 , accuracy -higher is better, solid) under protocol sample-prev-D 3 . The classification performance of PACC is equivalent to that of CC (both equal to the performance of the underlying LR), and we thus omit it for readability.

Motivation and Setup
In the previous sections, we tested six approaches to estimate demographic disparity. For each approach, we used multiple quantifiers for the sensitive attribute S, namely one for each class in the codomain of the classifier h, as described in Step 3 of the method for quantification-based estimate of demographic disparity. In the binary setting adopted in this work, where Y = {⊖, ⊕}, we trained two quantifiers. A quantifier was trained on the set of positively-classified instances of the auxiliary set D ⊕ 2 = {(x i , s i ) ∈ D 2 | h(x) = ⊕} and deployed to quantify the prevalence of sensitive instances (such that S = s) within the test subset D ⊕ 3 . The remaining quantifier was trained on D ⊖ 2 and deployed on D ⊖ 3 . Training and maintaining multiple quantifiers is more expensive and cumbersome than having a single one. Firstly, quantifiers that depend on the classification outcomeŷ = h(x) require retraining every time h is modified, e.g., due to a model update being rolled out. Second, the maintenance cost is multiplied by the number of classes |Y| that are possible for the outcome variable. To ensure that these downsides are compensated by performance improvements, we perform an ablation study and evaluate the performance of different estimators of demographic disparity supported by a single quantifier.
In this section we concentrate on three estimation approaches, namely PCC, SLD, and PACC. SLD and PACC are among the best overall performers, displaying low bias or variance across all protocols. PCC shows great performance in situations where its posteriors are well-calibrated on D 3 . We compare their performance in two settings. In the first setting, adopted so far, two separate quantifiers q ⊖ and q ⊕ are trained on D ⊖ 2 , D ⊕ 2 and deployed on D ⊖ 3 , D ⊕ 3 , respectively. In the second setting, we train a single quantifier q on D 2 and deploy it separately on D ⊖ 3 and D ⊕ 3 to estimateδ S h using Equations (14) and (15), specialized so that q ⊖ and q ⊕ are the same quantifier. Figure 9 summarizes results for the Adult dataset under two protocols that are representative of the overall trends, namely sample-prev-D 2 (Figure 9a) and sample-prev-D 3 (Figure 9b). 9 The y axis depicts the estimation error of PCC, SLD, PACC, and their single-quantifier counterparts, denoted by the suffix "nosD2" to indicate that the auxiliary set D 2 is not split into D ⊖ 2 , D ⊕ 2 during training. The x axis depicts the quantity of interest varied under each protocol.

Results
Interestingly, PCC appears to be rather insensitive to the ablation study, so that the estimation errors of PCC and PCC-nosD2 are well-aligned. PCC-nosD2 performs slightly better under the protocol sample-prev-D 2 , where the auxiliary set is small, and splitting it to learn separate quantifiers may result in poor performance. The opposite is true for PACC-nosD2, showing a clear decline in performance in the single-quantifier setting. This is due to the fact that the estimates of tpr (and fpr) in D ⊕ 3 and D ⊖ 3 for the adjustment (Equation 9) are more precise when issued by dedicated estimators rather than a single one computed without splitting D 2 . SLD-nosD2 also shows a sizeable performance decay.
Under all protocols, the performance of SLD and PACC is compromised in the absence of class-specific quantifiers q ⊖ and q ⊕ . If a single quantifier is trained on the full auxiliary set D 2 , the corrections brought about by SLD and PACC can end up worsening, rather than improving, the prevalence estimates of vanilla CC. PCC is less sensitive to the ablation, showing small performance differences in both directions under the single quantifier setting. In general, it seems beneficial to partition the auxiliary set into subsets D ⊖ 2 and D ⊕ 2 according to the method in Section 4.2.
9. In the interest of brevity, the figures in this section refer to LR-based quantification on the Adult dataset under two protocols. Results for SVM-based quantifiers under every protocol are depicted in the Appendix (Figures 10 and 11). Analogous results hold on CreditCard and COMPAS.

Summary and Takeaway Message
Overall, our work shows that quantification approaches are suited to measure demographic parity under unawareness of sensitive attributes if a small auxiliary dataset, containing sensitive and non-sensitive attributes, is available. This is a common setting in real-world scenarios, where such datasets may originate from targeted efforts or voluntary disclosure. Despite an inevitable selection bias, these datasets still represent a valuable asset for fairness audits, if coupled with robust estimation approaches. Indeed, several quantification methods tested in this work provide precise estimates of demographic disparity despite the distribution shift across training and testing caused by selection bias, and other distribution shifts that arise in the context of human processes. This is an important improvement over CC and PCC, previously studied in the algorithmic fairness literature as the threshold estimator and weighted estimator (Chen et al., 2019). SLD strikes the best balance in performance across all protocols; we suggest its adoption, especially when the distribution shift between development and deployment conditions has not been carefully characterized. Moreover, while the development of proxy methods typically comes with a potential for misuse on individuals (e.g., profiling), quantification approaches demonstrate the potential to circumvent this issue. More in detail, from the above experimental section, we summarize the following trends concerning different approaches to measure demographic parity under unawareness.
Fairness under unawareness can be measured using quantification, for both vanilla and fairness-aware classifiers. Group fairness under unawareness can be cast as a prevalence estimation problem and effectively solved by methods of proven consistency from the quantification literature. We demonstrate several estimators that outperform the previously proposed methods (Chen et al., 2019), corresponding to CC and PCC, i.e., two weak baselines in the quantification literature.
CC is suboptimal. Naïve Classify-and-Count represents the default approach for practitioners unaware of quantification. Ad hoc quantification methods outperform CC in most combinations of 5 protocols, 3 datasets, and 2 underlying learners.
PCC suffers under distribution shift. As long as the underlying posteriors are wellcalibrated, PCC is a strong performer. However, when its training set and test set have different prevalence values for the sensitive attribute S, a common situation in practice, PCC displays a systematic estimation bias, which increases sharply with the prior probability shift between training and test.
HDy, ACC and PACC deteriorate in the small data regime. These methods require splitting their training set (that is, the auxiliary set D 2 ), so their performance drops faster when its cardinality is small. PACC and ACC display good median performance but a large variance; the former method always outperforms the latter.
SLD strikes a good balance. This method was shown to be the best performer under (the inevitable) distribution shift between the auxiliary set D 2 and the test set D 3 , with a moderate performance decrease when |D 2 | becomes small. However, in situations where it is not possible to maintain separate quantifiers for positively and negatively predicted instances, its performance may drop substantially.
Decoupling is possible. Methods such as SLD and PACC fare much better than CC in estimating group-level quantities (such as demographic parity), while if misused for individual classification of sensitive attributes, the improvement is minor (SLD) or zero (PACC).

Conclusion
Measuring the differential impact of models on groups of individuals is important to understand their effects in the real world and their tendency to encode and reinforce divisions and privilege across sensitive attributes. Unfortunately, in practice, demographic attributes are often not available. In this work, we have taken the perspective of responsible practitioners, interested in internal fairness audits of production models. We have proposed a novel approach to measure group fairness under unawareness of sensitive attributes, utilizing methods from the quantification literature. These methods are specifically designed for group-level prevalence estimation rather than individual-level classification. Since practitioners who try to measure fairness under unawareness are precisely interested in prevalence estimates of sensitive attributes (Proposition 1), it is useful for the fairness and quantification communities to exchange lessons.
We have studied the problem of estimating a classifier's fairness under unawareness of sensitive attributes, with access to a disjoint auxiliary set of data for which demographic information is available. We have shown how this can be cast as a quantification problem, and solved with established approaches of proven consistency. We have conducted a detailed empirical evaluation of different methods and their properties focused on demographic parity. Drawing from the algorithmic fairness literature, we have identified five important factors for this problem, associating each of them with a formal evaluation protocol. We have tested several quantification-based approaches, which, under realistic assumptions for an internal fairness audit, outperform previously proposed estimators in the fairness literature. We have discussed their benefits and limitations, including the unbiasedness guarantees of some methods, and the potential for misuse at an individual level.
Future work may require a deeper study of the relation between classification and quantification performance and the extent to which these two objectives can be decoupled. It would be interesting to explicitly target decoupling through learners aimed at maximizing quantification performance subject to a low classification performance constraint. Ideally, decoupling should provide precise privacy guarantees to individuals while allowing for precise group-level estimates. Another important avenue for future work is the study of confidence intervals for fairness estimates provided by quantification methods. A reliable indication of confidence for estimates of group fairness may be invaluable for a practitioner arguing for resources and attention to the disparate effects of a model on different populations. Finally, the estimators presented in this work may be plugged into optimization procedures aimed at improving, rather than measuring, algorithmic fairness. Mixed loss functions, jointly optimizing accuracy and fairness can be optimized, even under unawareness of sensitive attributes, with our methods providing fairness estimates at each iteration. It will be interesting to evaluate fairness estimators in this broader context and extend them, e.g., to ranking problems and counterfactual settings.
Plugging them into the denominator yieldŝ The equivalence between CC and TE is straightforward.
In this appendix we report the results of experiments, analogous to the ones in Sections 5.6-5.9, where quantifiers are wrapped around an SVM classifier rather than an LR classifier. The experimental protocols are summarized in Tables 9-13. The ablation study is depicted in Figures 10 and 11. Experiments on decoupling the quantification performance of a model from its classification performance are reported in Figures 12 and 13.       Figure 12: Performance of SVM-based methods CC, SLD and PACC on the Adult dataset when used for quantification (MAE -lower is better) and classification (F 1 , accuracyhigher is better) under protocol sample-prev-D 2 . The classification performance of PACC is equivalent to that of CC (both equal to the performance of the underlying SVM), and we thus omit it for readability.  Figure 13: Performance of SVM-based methods CC, SLD and PACC on the Adult dataset when used for quantification (MAE -lower is better) and classification (F 1 , accuracyhigher is better) under protocol sample-prev-D 3 . The classification performance of PACC is equivalent to that of CC (both equal to the performance of the underlying SVM), and we thus omit it for readability. Pseudocode 5: Protocol sample-prev-D 1 .