Loss Functions, Axioms, and Peer Review

It is common to see a handful of reviewers reject a highly novel paper, because they view, say, extensive experiments as far more important than novelty, whereas the community as a whole would have embraced the paper. More generally, the disparate mapping of criteria scores to ﬁnal recommendations by diﬀerent reviewers is a major source of inconsistency in peer review. In this paper we present a framework inspired by empirical risk minimization (ERM) for learning the community’s aggregate mapping. The key challenge that arises is the speciﬁcation of a loss function for ERM. We consider the class of L ( p, q ) loss functions, which is a matrix-extension of the standard class of L p losses on vectors; here the choice of the loss function amounts to choosing the hyperparameters p, q ∈ [1 , ∞ ]. To deal with the absence of ground truth in our problem, we instead draw on computational social choice to identify desirable values of the hyperparameters p and q . Speciﬁcally, we characterize p = q = 1 as the only choice of these hyperparameters that satisﬁes three natural axiomatic properties. Finally, we implement and apply our approach to reviews from IJCAI 2017.


Introduction
The essence of science is the search for objective truth, yet scientific work is typically evaluated through peer review -a notoriously subjective process (Church, 2005;Lamont, 2009;Bakanic et al., 1987;Hojat et al., 2003;Mahoney, 1977;Kerr et al., 1977).One prominent source of subjectivity is the disparity across reviewers in terms of their emphasis on the various criteria used for the overall evaluation of a paper.Lee (2015) refers to this disparity as commensuration bias, and describes it as follows: "In peer review, reviewers, editors, and grant program officers must make interpretive decisions about how to weight the relative importance of qualitatively different peer review criteria -such as novelty, significance, and methodological soundness -in their assessments of a submission's final/overall value.Not all peer review criteria get equal weight; further, weightings can vary across reviewers and contexts even when reviewers are given identical instructions."Lee (2015) further argues that commensuration bias "illuminates how intellectual priorities in individual peer review judgments can collectively subvert the attainment of communitywide goals" and that it "permits and facilitates problematic patterns of publication and funding in science."There have been, however, very few attempts to address this problem.
A fascinating exception, which serves as a case in point, is the 27th AAAI Conference on Artificial Intelligence (AAAI 2013).Reviewers were asked to score papers, on a scale of 1-6, according to the following criteria: technical quality, experimental analysis, formal analysis, clarity/presentation, novelty of the question, novelty of the solution, breadth of interest, and potential impact.The admirable goal of the program chairs was to select "exciting but imperfect papers" over "safe but solid" papers, and, to this end, they provided detailed instructions on how to map the foregoing criteria to an overall recommendation.For example, the preimage of 'strong accept' is "a 5 or 6 in some category, no 1 in any category," that is, reviewers were instructed to strongly accept a paper that has a 5 or 6 in, say, clarity, but is below average according to each and every other criterion (i.e., a clearly boring paper).It turns out that the handcrafted mapping did not work well, and many of the reviewers chose to not follow these instructions.Indeed, handcrafting such a mapping requires specifying an 8-dimensional function, which is quite a non-trivial task. 1 Consequently, in this paper we do away with a manual handcrafting approach to this problem.
Instead, we propose a data-driven approach based on ideas from machine learning, designed to learn a mapping from criteria scores to recommendations capturing the opinion of the entire (reviewer) community.From a machine learning perspective, the examples are reviews, each consisting of criteria scores (the input point) and an overall recommendation (the label).We make the assumption that each reviewer has a monotonic mapping in mind, in the sense that a paper whose scores are at least as high as those of another paper on every criterion would receive an overall recommendation that is at least as high; the reviews submitted by a particular reviewer can be seen as observations of that mapping.Given this data, our goal is to learn a single monotonic mapping that minimizes a loss function (which we discuss momentarily).We can then apply this mapping to the criteria scores associated with each review to obtain new overall recommendations (which can either replace the original ones or can be provided alongside the original ones as additional information for the program chairs).
Our approach to learn this mapping is inspired by empirical risk minimization (ERM).In more detail, for some loss function, our approach is to find a mapping that, among all monotonic mappings from criteria scores to the overall scores, minimizes the loss between its outputs and the overall scores given by reviewers across all reviews.However, the choice of loss function may significantly affect the final outcome, so that choice is a key issue.Specifically, we focus on the family of L(p, q) loss functions, with hyperparameters p, q ∈ [1, ∞], which is a matrix-extension of the more popular family of L p losses on vectors.Our question, then, is: What values of the hyperparameters p ∈ [1, ∞] and q ∈ [1, ∞] in the specification of the L(p, q) loss function should be used?
A challenge we must address is the absence of any ground truth in peer review.To this end, we take the perspective of computational social choice (Brandt et al., 2016), since our framework aggregates individual opinions over mappings into a consensus mapping.From this viewpoint, it is natural to select the loss function so that the resulting aggregation method satisfies socially desirable properties, such as consensus (if all reviewers agree then the aggregate mapping should coincide with their recommendations), efficiency (if one paper dominates another then its overall recommendation should be at least as high), and strategyproofness (reviewers cannot pull the aggregate mapping closer to their own recommendations by misreporting them).
With this background, the main contributions of this paper are as follows.We first provide a principled framework for addressing the issue of subjectivity regarding the various criteria in peer review.
Our main theoretical result is a characterization theorem that gives a decisive answer to the question of choosing the loss function for ERM: the three aforementioned properties are satisfied if and only if the hyperparameters are set as p = q = 1.This result singles out an instantiation of our approach that we view as particularly attractive and well grounded.
We also provide empirical results, which analyze properties of our approach when applied to a dataset of 9197 reviews from IJCAI 2017.One vignette is that the papers selected by L(1, 1) aggregation have a 79.2% overlap with the actual list of accepted papers, suggesting that our approach makes a significant difference compared to the status quo (arguably for the better).
Finally, we note that the approach taken in this paper may find other applications.Indeed, the problem of selecting a loss function is ubiquitous in machine learning (Rosasco et al., 2004;Masnadi-Shirazi and Vasconcelos, 2008;Mei et al., 2018), and the axiomatic approach provides a novel way of addressing it.Going beyond loss functions, machine learning researchers frequently face the difficulty of picking an appropriate hypothesis class or values for certain hyperparameters.2Thus, in problem settings where such choices must be made -particularly in emerging applications of machine learning (such as peer review)the use of natural axioms can help guide these choices.

Our Framework
Suppose there are n reviewers R = {1, 2, . . ., n}, and a set P of m papers, denoted using letters such as a, b, c.Each reviewer i reviews a subset of papers, denoted by P (i) ⊆ P. Conversely, let R(a) denote the set of all reviewers who review paper a.Each reviewer assigns scores to each of their papers on d different criteria, such as novelty, experimental analysis, and technical quality, and also gives an overall recommendation.We denote the criteria scores given by reviewer i to paper a by x ia , and the corresponding overall recommendation by y ia .Let X 1 , X 2 , . . ., X d denote the domains of the d criteria scores, and let X = X 1 × X 2 × • • • × X d .Also, let Y denote the domain of the overall recommendations.For concreteness, we assume that each X k as well as Y is the real line.However, our results hold more generally, even if these domains are non-singleton intervals in R, for instance.
We further assume that each reviewer has a monotonic function in mind that they use to compute the overall recommendation for a paper from its criteria scores.By a monotonic function, we mean that given any two score vectors x and x , if x is greater than or equal to x on all coordinates, then the function's value on x must be at least as high as its value on x .Formally, for each reviewer i, there exists g i ∈ F such that y ia = g i (x ia ) for all a ∈ P (i), where is the set of all monotonic functions.

Loss Functions
Recall that our goal is to use all criteria scores, and their corresponding overall recommendations, to learn an aggregate function f that captures the opinions of all reviewers on how criteria scores should be mapped to recommendations.Inspired by empirical risk minimization, we do this by computing the function in F that minimizes the L(p, q) loss on the data.In more detail, given hyperparameters . (1) ) is a matrix-extension of the more common L p losses on vectors, and represents a general and popular class as we discuss below.In words, the loss is computed by taking the L q norm over the loss associated with individual reviewers, where the loss associated to a reviewer is defined as the L p norm computed on the error of f with respect to the reviewer's overall recommendations.We refer to aggregation by minimizing L(p, q) loss as defined in Equation (1) as "L(p, q) aggregation."For a function f , the L(p, q) loss is simply the L(p, q) matrix norm of the difference between the matrix [y ia ] i∈R,a∈P and the matrix [f (x ia )] i∈R,a∈P (the entry of the matrices is set to zero if the reviewer does not review the paper).The class of L(p, q) norms represents the standard "entrywise" class of matrix norms.It includes various popular matrix norms as special cases such as the Frobenius norm (p = q = 2), the max norm (p = q = ∞), and the 1-norm (p = 1, q = ∞).This class has had numerous applications in machine learning, statistics, and signal processing (Kowalski, 2009;Ding et al., 2006;Kong et al., 2011;Nie et al., 2010;Zhaoshui and Cichocki, 2008;Rahimpour et al., 2017;Kashlak and Kong, 2021;Cai et al., 2011).Moreover, unlike some other matrix norm classes (like Schatten or induced norm classes) the entrywise L(p, q) class is quite interpretable; for instance, the L(1, 1) loss simply sums up the absolute differences between the overall scores given by reviewers and those given by the function f .Equation (1) does not specify how to break ties between multiple minimizers.For concreteness, we use the minimum L 2 norm for tie-breaking (although of our results hold under any reasonable tie-breaking method, such as the minimum L norm for any ∈ (1, ∞)).Formally, letting be the set of all L(p, q) loss minimizers, we break ties by choosing Observe that since the L(p, q) loss and constraint set are convex, F is also a convex set.Hence, f as defined by Equation ( 2) is unique.
Once the function f has been computed, it can be applied to every review (for all reviewers i and papers a) to obtain a new overall recommendation f (x ia ).There is a separate -almost orthogonal -question of how to aggregate the overall recommendations of several reviewers on a paper into a single recommendation.In our theoretical results we are agnostic to how this additional aggregation step is performed, but we return to it in our experiments in Section 4.
We remark that an alternative approach would be to learn a monotonic function g i : X → Y for each reviewer (which best captures their recommendations), and then aggregate these functions into a single function f .We chose not to pursue this approach, because in practice there are very few examples per reviewer, so it is implausible that we would be able to accurately learn the reviewers' individual functions.

Axiomatic Properties
In social choice theory, the most common approach -primarily attributed to Arrow (1951) for comparing different aggregation methods is to determine which desirable axioms they satisfy.We take the same approach in order to determine the values of the hyperparameters p and q for the L(p, q) aggregation in Equation (1).
We stress that axioms are defined for aggregation methods and not aggregate functions.Informally, an aggregation method is a function that takes as input all the reviews {(x ia , y ia )} i∈R,a∈P (i) , and outputs an aggregate function f : X → Y.We do not define an aggregation method formally to avoid introducing cumbersome notation that will largely be useless later.It is clear that for any choice of hyperparameters p, q ∈ [1, ∞], L(p, q) aggregation (with tie-breaking as defined by Equation 2) is an aggregation method.
Social choice theory essentially relies on counterfactual reasoning to identify scenarios where it is clear how an aggregation method should behave.To give one example, the Pareto efficiency property of voting rules states that if all voters prefer alternative a to alternative b, then b should not be elected; this situation is extremely unlikely to occur, yet Pareto efficiency is obviously a property that any reasonable voting must satisfy.With this principle in mind, we identify a setting in our problem where the requirements are very clear, and then define our axioms in that setting.
For all of our axioms, we restrict attention to scenarios where every reviewer reviews every paper, that is, P (i) = P for every i.Moreover, we assume that the papers have 'objective' criteria scores, that is, the criteria scores given to a paper are the same across all reviewers, so the only source of disagreement is how the criteria scores should be mapped to an overall recommendation.We can then denote the criteria scores of a paper a simply as x a , as opposed to x ia , since they do not depend on i.We stress that our framework does not require these assumptions to hold -they are only used in our axiomatic characterization, namely Theorem 1 in the next section.
An axiom is satisfied by an aggregation method if its statement holds for every possible number of reviewers n and number of papers m, and for all possible criteria scores and overall recommendations.We start with the simplest axiom, consensus, which informally states that if there is a paper such that all reviewers give it the same overall recommendation, then f must agree with the reviewers; this axiom is closely related to the unanimity axiom in social choice.
Axiom 1 (Consensus).For any paper a ∈ P, if all reviewers report identical overall recommendations Before presenting the next axiom, we require another definition: we say that paper a ∈ P dominates paper b ∈ P if there exists a bijection σ : R → R such that for all i ∈ R, y ia ≥ y σ(i)b .Equivalently (and less formally), paper a dominates paper b if the sorted overall recommendations given to a pointwise-dominate the sorted overall recommendations given to b. Intuitively, in this situation, a should receive a (weakly) higher overall recommendation than b, which is exactly what the axiom requires; it is similar to the classic Pareto efficiency axiom mentioned above.
Axiom 2 (Efficiency).For any pair of papers a, b ∈ P, if a dominates b, then f (x a ) ≥ f (x b ).
Our positive result, which will be presented shortly, satisfies this notion of efficiency.On the other hand, we also use this axiom to prove a negative result; an important note is that the negative result requires a condition that is significantly weaker than the aforementioned definition of efficiency.We revisit this point about requiring a much weaker condition for the negative result at the end of Section 3.2.1.
Our final axiom is strategyproofness, a game-theoretic property that plays a major role in social choice theory (Moulin, 1983).For the application of peer review, we consider strategyproofness motivated by the many instances of strategic behavior uncovered and studied recently in peer review (Balietti et al., 2016;Xu et al., 2019;Vijaykumar, 2020a,b;Jecmen et al., 2020;Stelmakh et al., 2021a).Intuitively, in our problem setting, strategyproofness means that reviewers have no incentive to misreport their overall recommendations: they cannot bring the aggregate recommendations -the community's consensus about the relative importance of various criteria -closer to their own through strategic manipulation. 3xiom 3 (Strategyproofness).For each reviewer i ∈ R, and all possible manipulated recommendations y i ∈ Y m , if y i = (y i1 , y i2 , . . ., y im ) is replaced with y i , then where f and g are the aggregate functions obtained from the original and manipulated reviews, respectively.
The use of the L 2 norm in the definition (3) of the strategyproofness axiom is made only for concreteness, and all our results hold for any norm L , ∈ [1, ∞].

Main Result
In Section 2, we introduced L(p, q) aggregation as a family of rules for aggregating individual opinions towards a consensus mapping from criteria scores to recommendations.But that definition, in and of itself, leaves open the question of how to choose the values of p and q in a way that leads to the most socially desirable outcomes.The axioms of Section 2.2 allow us to give a satisfying answer to this question.Specifically, our main theoretical result is a characterization of L(p, q) aggregation in terms of the three axioms.
We remark that for p = q, Equation ( 1) does not distinguish between different reviewers, that is, the aggregation method pools all reviews together.We find this interesting, because the L(p, q) aggregation framework does have enough power to make that distinction, but the axioms guide us towards a specific solution, L(1, 1), which does not.
Turning to the proof of the theorem, we start from the easier 'if' direction.
Proof.The key idea of the proof lies in the form taken by the minimizer of L(1, 1) loss.
When each reviewer reviews every paper and the papers have objective criteria scores, L(1, 1) aggregation reduces to computing where ties are broken by picking the minimizer with minimum L 2 norm.We claim that the aggregate function is given by where left-med(•) of a set of points is their left median.We prove this claim by showing four parts: (i) f is a valid function, (ii) f is an unconstrained minimizer of the objective in ( 4), (iii) f satisfies the constraints of (4), i.e., f ∈ F, and (iv) f has the minimum L 2 norm among all minimizers of (4).
We start by proving part (i).This part can only be violated if there are two papers a and b such that x a = x b , but left-med({y ia } i∈R ) = left-med({y ib } i∈R ), leading to f having two function values for the same x-value.However, we assumed that each reviewer i has a function g i used to score the papers.So, for the two papers a and b, we would have Therefore, f is a valid function.
For part (ii), consider the optimization problem (4) without any constraints.Denote the objective function as G(f ).Rearranging terms, we obtain (5) Consider the inner summation i∈R |y ia − f (x a )|; it is well known that this quantity is minimized when f (x a ) is any median of the {y ia } i∈R values.Hence, we have where f is an arbitrary function.Therefore, f minimizes the objective function even in the absence of any constraints, proving part (ii).
Turning to part (iii), we show that f satisfies the monotonicity constraints, i.e., f ∈ F. Suppose a, b ∈ P are such that x a ≥ x b .Using the fact that each reviewer i scores papers based on the function g i , we have y ia = g i (x a ) and y ib = g i (x b ).And since g i ∈ F obeys monotonicity constraints, we obtain y ia ≥ y ib for every i.This trivially implies that left-med( Finally, we prove part (iv).Observe that Equation ( 6) is a strict inequality if there is a paper a for which f (x a ) is not a median of the {y ia } i∈R values.In other words, the only functions f that have the same objective function value as f are of the form where med(•) of a collection of points is the set of all points between (and including) the left and right medians.Hence, all other minimizers of (4) must satisfy Equation ( 7).Observe that f is pointwise smaller than any of these functions, since it computes the left median at each of the x-values.Therefore, f has the minimum L 2 norm among all possible minimizers of (4), completing the proof of part (iv).Combining all four parts proves that f is indeed the aggregate function chosen by L(1, 1) aggregation.We use this to prove that L(1, 1) aggregation satisfies consensus, efficiency and strategyproofness.
Consensus.Let a ∈ P be a paper such that y 1a = y 2a = • • • = y ma = r for some r.Then, left-med({y ia } i∈R ) = r.Hence, f (x a ) = r, satisfying consensus.
Efficiency.Let a, b ∈ P be such that a dominates b.In other words, the sorted overall recommendations given to a pointwise-dominate the sorted overall recommendations given to b.So, by definition, left-med({y ia } i∈R ) is at least as large as left-med({y ib } i∈R ).That is, f (x a ) ≥ f (x b ), satisfying efficiency.
Strategyproofness.Let i be an arbitrary reviewer.Observe that in this setting, the aggregate score f (x a ) of a paper a depends only on the score y ia and not on other scores {y ib } b =a given by reviewer i.In other words, the only way to manipulate f (x a ) = left-med({y i a } i ∈R ) is by changing y ia .Consider three cases.Suppose y ia < f (x a ).In this case, if reviewer i reports y ia ≤ f (x a ), then there is no change in the aggregate score of a.On the other hand, if y ia > f (x a ), then either the aggregate score of a remains the same or increases, making things only worse for reviewer i.The other case of y ia > f (x a ) is symmetric to y ia < f (x a ).Consider the third case, y ia = f (x a ).In this case, manipulation can only make things worse since we already have |y ia − f (x a )| = 0.In summary, reporting y ia instead of y ia cannot help decrease |y ia − f (x a )|.Also, recall that y ia does not affect the aggregate scores of other papers, and hence manipulation of y ia does not help them either.Therefore, by manipulating any of the y ia scores, reviewer i cannot bring the aggregate recommendations closer to her own, proving strategyproofness.

Violation of the Axioms When
We now tackle the harder 'only if' direction of Theorem 1.We do so in three steps: efficiency is violated by p ∈ (1, ∞) and q = 1 (Lemma 2), strategyproofness is violated by L(p, q) aggregation for all q > 1 (Lemma 3), and consensus is violated by p = ∞ and q = 1 (Lemma 4).Together, the three lemmas leave p = q = 1 as the only option.Below we state the lemmas and give some proof ideas; the theorem's full proof is relegated to Appendix A.
It is worth noting that, although we have presented the lemmas as components in the proof of Theorem 1, they also have standalone value (some more than others).For example, if one decided that only strategyproofness is important, then Lemma 3 below would give significant guidance on choosing an appropriate method.

Violation of Efficiency
In our view, the following lemma presents the most interesting and counter-intuitive result in the paper.
It is quite surprising that such reasonable loss functions violate the simple requirement of efficiency.In what follows explain this phenomenon via a connection between our problem and the notion of the 'Fermat point' of a triangle (Spain, 1996).The explanation provided here demonstrates the negative result for L(2, 1) aggregation.The complete proof of the lemma for general values of p ∈ (1, ∞) is quite involved, as can be seen in Appendix A.
The construction of the negative result is illustrated in Figure 1 and described in more detail here.Consider a setting with 3 reviewers and 2 papers, where each reviewer reviews both papers.We let x 1 and x 2 denote the respective objective criteria scores of the two papers.Assume that no score in {x 1 , x 2 } is pointwise greater than or equal to the other score in that set; an example is shown in Figure 1(a).Let the overall recommendations ( f (x1), f (x2)) is the Fermat point of triangle with vertices (y11, y12), (y21, y22), (y31, y32): The Fermat point is (.21, .21)when z = 1 (black circle), but (.12, .15)when z = 1 /2 (red triangle).Hence L(2, 1) aggregation with z given by the reviewers be y 11 = 0, y 21 = 1, y 31 = 0 to the first paper and y 12 = 0, y 22 = 0 and y 23 = z to the second paper.Under these scores, let f denote the aggregate function that minimizes the L(2, 1) loss.We see that in this data, when z < 1, the first paper dominates the second, and hence the efficiency axiom mandates f (x 1 ) ≥ f (x 2 ).The outcome of the L(2, 1) aggregation is related to the notion of the Fermat point of a triangle.The Fermat point of a triangle is a point such that the sum of its (Euclidean) distances from all three vertices is minimum.Consider a triangle in R 2 with vertices (y 11 , y 12 ), (y 21 , y 22 ), (y 31 , y 32 ); see Figure 1(b).Then by definition, the Fermat point of this triangle is exactly ( f (x 1 ), f (x 2 )).Intuitively, the p = 2 in the L(p = 2, q = 1) loss relates to the Euclidean distances used in the Fermat point, and the q = 1 relates to summing the distances to all vertices.
As a final but important remark, the proof of Lemma 2 only requires a significantly weaker notion of efficiency.In this weaker notion, we first consider two papers 1 and 2 such that their reviews are symmetric: formally, switching the labels "1" and "2" and switching the labels of some reviewers and criteria leaves the data unchanged. 4The weaker version of efficiency says that reducing the review scores of paper 2 mandates f (x 1 ) ≥ f (x 2 ).In the example of Figure 1(a), when z = 1, switching the labels of the two papers, the labels of reviewers 2 and 3, and the labels of the two criteria yields data identical to the original.In the example above, reducing z to z < 1 breaks the symmetry and makes paper 2 inferior to paper 1 in this data.The axiom requires f (x 1 ) ≥ f (x 2 ) in this case.

Violation of Strategyproofness
Lemma 3. L(p, q) aggregation with q ∈ (1, ∞] violates strategyproofness. We prove the lemma via a simple construction with just one paper and two reviewers, who give the paper overall recommendations of 1 and 0, respectively.For q ∈ (1, ∞), the aggregate score is and for q = ∞, it is Either way, the unique minimum is obtained at an aggregate score of 0.5.If reviewer 1 reported an overall recommendation of 2, however, the aggregate score would be 1, which matches her 'true' recommendation, thereby violating strategyproofness.See Appendix A.2 for the complete proof.

Violation of Consensus
Lemma 4. L(p, q) aggregation with p = ∞ and q = 1 violates consensus.
Lemma 4 is established via another simple construction: two papers, two reviewers, and overall recommendations where y ia denotes the overall recommendation given by reviewer i to paper a. Crucially, the two reviewers agree on an overall recommendation of 1 for paper 2, hence the aggregate score of this paper must also be 1.But we show that L(∞, 1) aggregation would not return an aggregate score of 1 for paper 2. The formal proof appears in Appendix A.3.

Implementation and Experimental Results
In this section, we provide an empirical analysis of a few aspects of peer review through the approach of this paper.We employ a dataset of reviews from the 26 th International Joint Conference on Artificial Intelligence (IJCAI 2017), which was made available to us by the program chair.To our knowledge, we are the first to use this dataset.At submission time, authors were asked if review data for their paper could be included in an anonymized dataset, and, similarly, reviewers were asked whether their reviews could be included; the dataset provided to us consists of all reviews for which permission was given.Each review is tagged with a reviewer ID and paper ID, which are anonymized for privacy reasons.The criteria used in the conference are 'originality', 'relevance', 'significance', 'quality of writing' (which we call 'writing'), and 'technical quality' (which we call 'technical'), and each is rated on a scale from 1 to 10. Overall recommendations are also on a scale from 1 to 10.In addition, information about which papers were accepted and which were rejected is included in the dataset.
The number of papers in the dataset is 2380, of which 649 were accepted, which amounts to 27.27%.This is a large subset of the 2540 submissions to the conference, of which 660 were accepted, for an actual acceptance rate of 25.98%.The number of reviewers in the dataset is 1725, and the number of reviews is 9197.All but nine papers in the dataset have three reviews (485 papers), four reviews (1734 papers), or five reviews (152 papers).Table 1 shows the distribution of the number of papers reviewed by reviewers.
We apply L(1, 1) aggregation (i.e., p = q = 1), as given in Equation ( 1), to this dataset to learn the aggregate function.Let us denote that function by f . 5The optimization problem in Equation ( 1) is convex, and standard optimization packages can efficiently compute the minimizer.Hence, importantly, computational complexity is a nonissue in terms of implementing our approach.
Once we compute the aggregate function f , we calculate the aggregate overall recommendation of each paper a by taking the median of the aggregate reviewer scores for that paper obtained by applying f to the objective scores: In case of multiple medians in (8), we took the mean of all medians.Recalling that 27.27% of the papers in the dataset were actually accepted to the conference, in our experiments we define the set of papers accepted by the aggregate function f as the the top 27.27% of papers according to their respective y f values.We now present the specific experiments we ran, and their results.5. Code available at https://github.com/ritesh-noothigattu/choosing-how-to-choose-papers.

Varying Number of Reviewers
In our first experiment, for each value of a parameter k ∈ {1, . . ., 5}, we subsampled k distinct reviews for each paper uniformly at random from the set of all reviews for that paper (if the paper had fewer than k to begin with then we retained all the reviews).We then computed an aggregate function, f k , via L(1, 1) aggregation applied only to these subsampled reviews.Next, we found the set of top 27.27% papers as given by f k applied to the subsampled reviews.Finally, we compared the overlap of this set of top papers for every value of k with the set of top 27.27% papers as dictated by the overall aggregate function f .The results from this experiment are plotted in Figure 2, and lead to several observations.First, the incremental overlap from k = 4 to 5 is very small because there are very few papers that had 5 or more reviews.Second, we see that the amount of overlap monotonically increases with the number of reviewers per paper k, thereby serving as a sanity check on the data as well as our methods.Third, we observe the overlap to be quite high (≈ 60%) even with a single reviewer per paper.

Loss Per Reviewer
Next, we look at the loss of different reviewers, under f (obtained by L(1, 1) aggregation).In order for the losses to be on the same scale, we normalize each reviewer's loss by the number of papers reviewed by them.Formally, the normalized loss of reviewer i (for p = 1) is 1 The normalized loss averaged across reviewers is found to be 0.470, and the standard deviation is 0.382.Figure 3 shows the distribution of the normalized loss of all the reviewers.Note that the normalized loss of a reviewer can fall in the range [0, 9].These results thus indicate that the function f is indeed at least a reasonable representation of the mapping of the broader community.

Overlap of Accepted Papers
We also compute the overlap between the set of top 27.27% papers selected by L(1, 1) aggregation f with the actual 27.27% accepted papers.It is important to emphasize that we believe the set of papers selected by our method is better than any hand-crafted or rulebased decision using the scores, since this aggregate represents the opinion of the community.Hence, to be clear, we do not have a goal of maximizing the overlap.Nevertheless, a very small overlap would mean that our approach is drastically different from standard practice, which would potentially be disturbing.We find that the overlap is 79.2%, which we think is quite fascinating -our approach does make a significant difference, but the difference is not so drastic as to be disconcerting.
Out of intellectual curiosity, we also computed the pairwise overlaps of the papers accepted by L(p, q) aggregation, for p, q ∈ {1, 2, 3}.We find that the choice of the reviewernorm hyperparameter q has more influence than the paper-norm hyperparameter p; we refer the reader to Appendix B.1 for details.

A Visualization of the Learnt Mapping
In Appendix B.2 we present visualizations and interpretations of L(1, 1) aggregate mapping learnt from the IJCAI 2017 data, which provide insights into the preferences of the community.We present here the key takeaways based on visual inspection of the visualizations, and refer the reader to the appendix for more detail.First, we observe that writing and relevance do not have a significant influence on the overall recommendations: Really bad writing or relevance is a significant downside, excellent writing or relevance is appreciated, but everything else in between in irrelevant.Second, technical quality and significance exert a high and approximately linear influence.Third, if modeling this mapping, linear models are partially applicable -for some criteria one may indeed assume a linear model, but not for all.

Limitations, Discussion, and Open Problems
We address the problem of subjectivity in peer review by combining approaches from machine learning and social choice theory.A key challenge in the setting of peer review (e.g., when choosing a loss function) is the absence of ground truth, and we overcome this challenge via a principled, axiomatic approach.
Our work also contributes to recent endeavors in understanding the peer review process (Lawrence and Cortes, 2014;Shah et al., 2018;Tomkins et al., 2017;Stelmakh et al., 2021bStelmakh et al., , 2020Stelmakh et al., , 2021c)).Specifically, the mapping learnt via L(1,1) aggregation can be used to understand the community's aggregate preferences over various criteria.We illustrate this via an empirical analysis in Section 4 and in Appendix B.
A critical aspect of peer review is confidentiality or privacy (Ding et al., 2020) with respect to who reviewed which paper.There are other values of p and q where the aggregate mapping could potentially reveal some information about individual reviewers (e.g., that two specific reviews were written by the same person).On the other hand, L(1,1) aggregation performs the optimization (1) by simply pooling all reviews together, and does not use any association of who reviewed which paper.It thus guards against this issue, and can even be executed on publicly available data for conferences following open review (i.e., where all reviews are public but reviewer identities are private).
One can think of the theoretical results of Section 3 as supporting L(1, 1) aggregation using the tools of social choice theory, whereas the empirical results of Section 4 focus on studying its behavior on real data.Understanding this helps clear up a possible source of confusion: are we not overfitting by training on a set of reviews, and then applying the aggregate function to the same reviews?The answer is negative, because the process of learning the function f amounts to an aggregation of opinions about how criteria scores should be mapped to overall recommendations.Applying it to the data yields recommendations in Y, whereas this function from X to Y lives in a different space.
We now conclude with a discussion of the limitations of our work and relevant open problems.
• Our framework assumes that the set of criteria listed by the program chairs encapsulates the criteria used by any reviewer for evaluating a paper.To address situations where this is violated, the program chairs may solicit information on the insufficiency of the criteria from reviewers directly, and this information can also help improve the choice of criteria used in subsequent conferences.On a technical front, this leads to an open problem of designing statistical methods to detect the insufficiency of given criteria in conference peer review (see also Shah et al., 2018, Section 3.9).
• It is of interest to understand the statistical aspects of estimating the community's consensus mapping function, assuming the existence of a ground truth.In more detail, suppose each reviewer's true function g i is a noisy version of some underlying function f that represents the community's beliefs.Then how can one recover f in a statistically efficient manner, perhaps via L(1, 1) aggregation or otherwise?Conceptually this non-parametric estimation problem is related to isotonic regression (Shah et al., 2017;Gao and Wellner, 2007;Chatterjee et al., 2018).The key difference is that the observations in our setting consist of evaluations of multiple functions, where each such function is a noisy version of the original monotonic function.In contrast, isotonic regression is primarily concerned with noisy evaluations of a common function.
Nevertheless, insights from isotonic regression suggest that the monotonicity assumption of our setting can yield attractive -and sometimes near-parametric (Shah et al., 2017(Shah et al., , 2019(Shah et al., , 2020) ) -rates of estimation.
• It is of interest to further incorporate additional information from reviews such as self-reported confidence (MacKay et al., 2017) or self-reported expertise of reviewers (by, e.g., reweighting the terms in the L(1, 1) aggregation accordingly) or even the review text (Hua et al., 2019;Manzoor and Shah, 2021).
• There are various other problems in peer review such as miscalibration (Ge et al., 2013;Roos et al., 2011;Wang and Shah, 2019), noise (Stelmakh et al., 2019a), fraud (Vijaykumar, 2020a,b;Jecmen et al., 2020), biases (Tomkins et al., 2017;Stelmakh et al., 2019b;Nielsen et al., 2021).These problems have been treated independently of each other in the literature, and addressing them jointly along with the problem of subjectivity is a challenging and important open problem.
• Our framework assumes that reviewers use criteria to come up with an overall score.
In practice, some reviewers may first arrive at a overall judgment and tailor criteria scores to fit their overall judgment.The instructions for reviewing could be designed to mitigate this issue.
• Our work focuses on learning one representative aggregate mapping for the entire community of reviewers.Instead, the program chairs of a conference may wish to allow for multiple mappings that represent the aggregate opinions of different subcommunities (e.g., theoretical or applied researchers).In this case, one may modify our framework to also learn this (unknown) partition of reviewers and/or papers into multiple sub-communities with different mapping functions, and frame the problem in terms of learning a mixture model.The design of computationally efficient algorithms for L(p, q) aggregation under such a mixture model is a challenging open problem.
As a final remark, we see our work as an unusual synthesis between computational social choice and machine learning.We hope that our approach will inspire exploration of additional connections between these two fields of research, especially in terms of viewing choices made in machine learning -often in an ad hoc fashion -through the lens of computational social choice.
For the overall proof to be easier to follow, proofs of all claims are given at the end of this proof.Also, just to re-emphasize, the whole proof assumes z > 1.
Claim 1. G z is a strictly convex objective function.
Claim 1 states that G z is strictly convex, implying that it has a unique minimizer f (z).Hence, there is no need to consider tie-breaking.
Claim 2. f 1 (z) and f 2 (z) are bounded.In particular, Claim 2 states that the aggregate score of both papers lies in the interval [0, 1] irrespective of the value of z.This allow us to restrict ourselves to the region [0, 1] 2 when computing the minimizer of (10).Hence, for the rest of the proof, we only consider the space [0, 1] 2 .In this region, the optimization problem ( 9) can be rewritten as To start off, we analyze the objective function as we take the limit of z going to infinity.Later, we show that the observed property holds even for a sufficiently large finite z.
For the limit to exist, redefine the objective function as For any value of z, the function H z has the same minimizer as G z , that is, Claim 3.For any (fixed) where The proof proceeds by analyzing some important properties of the limiting function H . where Observe that Claim 6 is the desired result, but for the limiting objective function H .The remainder of the proof proceeds to show that this result holds even for the objective function H z , when the score z is large enough.Define ∆ = v 2 − v 1 > 0. We first show that (i) there exists z > 1 such that f (z) − v 2 < ∆ 4 , and then (ii) show that in this case, we have f 1 (z) < f 2 (z).
To prove part (i), we first analyze how functions H z and H relate to each other.Using Claim 3, for any fixed f 1 , f 2 , by definition of the limit, for any > 0, there exists z (which could be a function of f 1 , f 2 ) such that, for all z > z , we have For a given f 1 , f 2 , denote the corresponding value of z by z (f 1 , f 2 ).And, let Z (f 1 , f 2 ) denote the set of all values of z > 1 for which Equation ( 14) holds for (f 1 , f 2 ).
Claim 7 says that if Equation ( 14) holds for a particular value of z for f 1 = f 2 = 1, then for the same value of z it holds for every other value of (f 1 , f 2 ) ∈ [0, 1] 2 as well.So, define By definition, z ∈ Z (1, 1).And by Claim 7, So, set z = z .Then, Equation ( 14) holds for all (f 1 , f 2 ) ∈ [0, 1] 2 simultaneously.In other words, for all (f 1 , f 2 ) ∈ [0, 1] 2 , we simultaneously have i.e.H z is in an -band around H throughout this region.And observe that this band gets smaller as is decreased (which is achieved at a larger value of z).
To bound the distance between v, the minimizer of H , and f (z), the minimizer of H z , we bound the distance between the objective function values at these points.
Although f (z) does not minimize H , Claim 8 says that the objective value at f (z) cannot be more than 2 larger than its minimum, H ( v).We use this to bound the distance between f (z) and the minimizer v. Observe that f (z) falls in the [H ( v) + 2 ]-level set of H . So, we next look at a specific level set of H . Define Observe that a minimum exists (infimum is not required) for the minimization in (17) because we are minimizing over the closed set {f ∈ For any fixed p ∈ (1, ∞), Equation (13) shows that v 1 is bounded away from 0. Hence, Claim 4 shows that H is strictly convex at and in the region around v. Further, H is convex everywhere else.Coupling this with the fact that (17) minimizes along points not arbitrarily close to the minimizer v, we have τ > H ( v).
Define the level set of H with respect to τ :

2
, and set = o .Then, set z = z o as before.Applying Claim 8, we obtain In other words, f ( z o ) ∈ C τ .And applying Claim 9, we obtain f Using these properties, we have where the first inequality holds because of the first part of ( 18), the equality holds because ∆ = v 2 − v 1 and the second inequality holds because of the second part of (18).Therefore, for z = z o > 1, the aggregate scores of the two papers are such that violating efficiency.
Proof of Claim 1 Take arbitrary f , g ∈ R 2 with f = g, and let θ ∈ (0, 1).We show that For this, we will first show that either (i) the vector [(z, 0) − f ] is not parallel to the vector [(z, 0) − g], (ii) the vector [(0, 1) − f ] is not parallel to the vector [(0, 1) − g] or (iii) the vector f is not parallel to the vector g.For the sake of contradiction, assume that this is not true.That is, assume and f is parallel to g.This implies that for some r, s, t ∈ R.6 Note that, none of r, s, t can be 1 because f = g.The second equation tells us that f 1 = sg 1 and the third one tells us that , it says that r = 1 which is not possible.Therefore, s = t.The third equation now tells us that f 2 = tg 2 = sg 2 .But, the second equation gives us 1 − f 2 = s − sg 2 , which implies that s = 1.But again this is not possible, leading to a contradiction.Therefore, at least one of (i), (ii) and (iii) is true.
Further, since p ∈ (1, ∞), the inequality in ( 19) is strict if x is not parallel to y.For our objective (in Equation ( 9)), Because of convexity of the L p norm, each of the three terms on the RHS of Equation ( 20) satisfies inequality (19).Further, because at least one of the pair of vectors in the three terms is not parallel (since either (i), (ii) or (iii) is true), at least one of them gives us a strict inequality.Therefore we obtain

Proof of Claim 2
The claim has four parts: Observe that parts (i), (iii) and (iv) are more intuitive, since they show that the aggregate score of a paper is no higher than the maximum score given to it, and no lower than the minimum score given to it.Part (ii) on the other hand is stronger; even though paper 1 has a score of z > 1 given to it, this part shows that f 1 (z) ≤ 1 (which is much tighter than an upper bound of z, especially when z is large).We prove the simpler parts (i), (iii) and (iv) first.
Next, for the sake of contradiction assume that f 2 (z) > 1.Then contradicting the fact that ( f 1 (z), f 2 (z)) is optimal.Therefore, we also have f 2 (z) ≤ 1, completing proof of (iv).
Finally, we prove the more non-intuitive part, (ii).Suppose for the sake of contradiction, f 1 (z) > 1.Then, where the first inequality comes from the fact that the L p norm of each vector is at least as high as the absolute value of its first element, and the second inequality follows from the triangle inequality.Using the assumption that f 1 (z) > 1, we obtain Proof of Claim 3 Take any arbitrary f 1 ∈ [0, 1] and f 2 ∈ [0, 1].Subtracting Equations ( 11) and ( 12) we obtain Observe that since f 2 ≥ 0, the RHS of Equation ( 21) is non-negative.Hence, the equation does not change on using an absolute value, i.e., To prove the required result, we take a small detour and define φ(x) = (x p + f p 2 ) 1 p − x.We show that φ(x) → 0 as x → ∞.For this, rewrite φ(x) as follows Taking the limit of x to infinity, we have Observe that for both the numerator and denominator in the RHS of Equation ( 23 x p+1 1 + where lim x→∞ f p 2 x p−1 = 0 because p > 1.Hence, we proved the required result, lim x→∞ φ(x) = 0. Going back to Equation ( 22), we rewrite it as Taking the limit of z to infinity, we obtain where the second step follows by setting t = z − f 1 .Equation ( 24) implies that Proof of Claim 4 In the region [0, 1] 2 , using (12), the function H can be written as Observe that each term on the RHS of ( 25) is a convex function of f .Hence, their sum is also convex in f .The proof of strict convexity closely follows the proof of claim 1.Take arbitrary f , g ∈ (0, 1] × [0, 1] with f = g, and let θ ∈ (0, 1).We show that H (θf + (1 − θ)g) < θH (f ) + (1 − θ)H (g).For this, we will first show that either (i) [(0, 1) − f ] is not parallel to [(0, 1) − g] or (ii) f is not parallel to g.For the sake of contradiction, assume that this is not true.That is, assume [(0, 1) − f ] is parallel to [(0, 1) − g], and f is parallel to g.This implies that where r, s ∈ R. Note that, neither r nor s can be 1 because f = g.The first equation tells us that f 1 = rg 1 and the second one tells us that f 1 = sg 1 .And since g 1 = 0, this implies that r = s.The second part of the second equation now tells us that f 2 = sg 2 = rg 2 .The second part of the first equation becomes 1 − f 2 = r − rg 2 which implies that r = 1, leading to a contradiction.Therefore, at least one of (i) and (ii) is true.
Recall, L p norm with p ∈ (1, ∞) is a convex norm, i.e. for any x, y ∈ R And since p ∈ (1, ∞), the inequality in ( 26) is strict if x is not parallel to y.For H (using Equation ( 25)), Because of convexity of the L p norm, both the third and fourth term on the RHS of Equation (27) satisfy inequality (26).Further, because at least one of the pair of vectors in these two terms is not parallel (since either (i) or (ii) is true), at least one of them gives us a strict inequality.Therefore we obtain Proof of Claim 5 To compute the minimizer of H , we compute its gradients with respect to f 1 and f 2 .Using Equation ( 12), the partial derivative with respect to f 1 is and with respect to Observe that at f 2 = 1 2 , irrespective of the value of f 1 , the partial derivative (29) is So, set v 2 = 1 2 .Next, we find v 1 such that the other derivative ( 28) is also zero at v = ( v 1 , v 2 ).Setting (28) to zero at v, we obtain Proof of Claim 6 For any p > 1, we know This implies that 2 p p−1 − 1 > 1 and hence 1 2 Finally, using the values from Claim 5, we obtain As in the proof of Claim 3, on subtracting Equations ( 11) and ( 12), and taking an absolute value, we obtain Equation ( 22), that is, Combining Equation (30) with the fact that 0 ≤ f 2 ≤ 1, we obtain A.2 Proof of Lemma 3 Consider L(p, q) aggregation with arbitrary q ∈ (1, ∞].We show that strategyproofness is violated.The construction for this is as follows.Suppose there is one paper a and two reviewers.The first reviewer gives the paper an overall recommendation of 1 and the second reviewer gives it an overall recommendation of 0. Let x a be the (objective) criteria scores of this paper.
Let us first consider q ∈ (1, ∞).For a function f : X → Y, all we care about in this example is its value at x a .Hence, for simplicity, let f a denote the value of function f at x a , i.e, f a := f (x a ).Then our aggregation becomes We claim that f a = 0.5 is the unique minimizer.Observe that if f a = 0.5, then the value of our objective is 0.5 q + 0.5 q < 1 when q ∈ (1, ∞).On the other hand, if f a ≥ 1 or if f a ≤ 0 then the value of our objective is at least 1.Hence f a ∈ (0, 1).By symmetry, we can restrict attention to the range [0.5, 1) since if there is a minimizer in (0, 0.5) then there must also be a minimizer in (0.5, 1).Consequently, we rewrite the optimization problem as Consider the function h : [0.5, 1] → R defined by h(x) = x q .This function is strictly convex (the second derivative is strictly positive in the domain) whenever q ∈ (1, ∞).Hence from the definition of strict convexity, we have 0.5 (1 − f a ) q + f q a > 0.5(1 − f a + f a ) q = 0.5 q whenever f a ∈ (0.5, 1).Consequently, the objective value of ( 34) is greater at f a ∈ (0.5, 1) than at f a = 0.5.We conclude that f a = 0.5 whenever q ∈ (1, ∞).When q = ∞, we equivalently write the optimization problem as This objective has a value of 0.5 if f a = 0.5 and strictly greater if f a = 0.5.Hence, f a = 0.5 for q = ∞ as well.
The true overall recommendation of reviewer 1 differs from the aggregate f a by 0.5 (in every L norm).However, if reviewer 1 reported an overall recommendation of 2, then an argument identical to that above shows that the minimizer is g a = 1.Reviewer 1 has thus successfully brought down the difference between her own true overall recommendation and the aggregate g a to 0. We conclude that strategyproofness is violated whenever q ∈ (1, ∞].

A.3 Proof of Lemma 4
The construction showing that L(∞, 1) aggregation violates consensus is as follows.Suppose there are two papers, two reviewers and both reviewers review both papers.Assume that the 0 1 2 0 1 2 Figure 4: The shaded region depicts the set of all minimizers of (35).f 1 is on the x-axis and f 2 is on the y-axis.
Pictorially, this set is given by the shaded square in Figure 4.It is the square with vertices at (0, 1), (1, 0), (2, 1) and (1, 2).This shows that almost all minimizers violate consensus.For the specific tie-breaking considered, the minimizer chosen is the one with minimum L 2 norm, i.e., the projection of (0, 0) onto this square.This gives us (0.5, 0.5), violating consensus.
Observe that tie-breaking using minimum L k norm, for k ∈ (1, ∞], also chooses (0.5, 0.5) as the aggregate function, violating consensus.For k = 1, all points on the line segment f 1 + f 2 = 1 (0 ≤ f 1 ≤ 1) would be tied winners, almost all of which violate consensus.Further, even if one uses other reasonable tie-breaking schemes like maximum L k norm, they suffer from the same issue, i.e., there is a tied winner which violates consensus.Our framework is not only useful for computing an aggregate mapping to help in acceptance decisions, but also for understanding the preferences of the community for use in subsequent modeling and research.We illustrate this application by providing some visualizations and interpretations of the aggregate function f obtained from L(1, 1) aggregation on the IJCAI review data.
The function f lives in a 5-dimensional space, making it hard to visualize the entire aggregate function.Instead, we fix the values of 3 criteria at a time and plot the function in terms of the remaining two criteria.In all of the visualizations below, the fixed criteria are set to their respective (marginal) modes: For 'quality of writing' the mode is 7 (715 reviews), for 'originality' it is 6 (826 reviews), for 'relevance' it is 8 (888 reviews), for 'significance' it is 5 (800 reviews), and for 'technical quality' it is 6 (702 reviews).
The key takeaways from this experiment are as follows.First, writing and relevance do not have a significant influence (Figure 5(e)).Really bad writing or relevance is a significant downside, excellent writing or relevance is appreciated, but everything else in between in irrelevant.Second, technical quality and significance exert a high influence (Figure 5(f)).Moreover, the influence is approximately linear.Third, linear models (i.e., models that are linear in the criteria) are quite popular in machine learning, and our empirical observations reveal that linear models are partially applicable for the mapping -for some criteria one may indeed assume a linear model, but not for all.

Figure 1 :
Figure 1: An example of aggregation under L(2, 1) loss, and violation of the efficiency axiom.

Figure 2 :Figure 3 :
Figure2: Fraction overlap as number of reviews per paper is restricted.Error bars depict 95% confidence intervals, but may be too small to be visible for k = 4, 5.

Table 1 :
Distribution of number of papers reviewed by a reviewer.

Table 2 :
Percentage of overlap (in selected papers) between different L(p, q) aggregation methodsB.2Visualizing the Community Aggregate Mapping