Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack Constraint

Constrained submodular maximization problems encompass a wide variety of applications, including personalized recommendation, team formation, and revenue maximization via viral marketing. The massive instances occurring in modern day applications can render existing algorithms prohibitively slow, while frequently, those instances are also inherently stochastic. Focusing on these challenges, we revisit the classic problem of maximizing a (possibly non-monotone) submodular function subject to a knapsack constraint. We present a simple randomized greedy algorithm that achieves a 5 . 83 approximation and runs in O ( n log n ) time, i


Introduction
Constrained submodular maximization is a fundamental problem at the heart of discrete optimization. The reason for this is as simple as it is clear: submodular functions capture the notion of diminishing returns present in a wide variety of real-world settings.
Consequently to its striking importance and coinciding NP-hardness (Feige, 1998), extensive research has been conducted on submodular maximization since the seventies (e.g., Edmonds, 1971;Nemhauser et al., 1978), with focus lately shifting towards handling the massive datasets emerging in modern applications. With a wide variety of possible constraints, often regarding cardinality, independence in a matroid, or knapsack-type restrictions, the number of applications is vast. To name just a few, there are recent works on feature selection in machine learning Kempe, 2008, 2018;Khanna et al., 2017), influence maximization in viral marketing (Babaei et al., 2013;Kempe et al., 2015), data summarization (Sipos et al., 2012;Mirzasoleiman et al., 2013;Tschiatschek et al., 2014;Dütting et al., 2022), and decision making under uncertainty (Shperberg and Shimony, 2017;Boodaghians et al., 2020). Many of these applications have non-monotone submod-ular objectives, meaning that adding an element to an existing set might actually decrease its value. Two such examples are discussed in detail in Section 5.
Modern-day applications increasingly force us to face two distinct, but often entangled challenges. First, the massive size of occurring instances fuels a need for very fast algorithms. As the running time is dominated by the objective function evaluations (also known as value oracle calls), it is typically measured (as in this work) by their number. So, here the goal is to design algorithms requiring an almost linear number of such evaluations. There is extensive research focusing on this issue, be it in the standard algorithmic setting (Mirzasoleiman et al., 2016), in streaming (Badanidiyuru et al., 2014;Chekuri et al., 2015a;Alaluf et al., 2020), in distributed submodular maximization da Ponte Barbosa et al., 2015), or in the adaptive complexity framework (Kuhnle, 2021;Amanatidis et al., 2021). The second challenge is the inherent uncertainty in problems like sensor placement or revenue maximization, where one does not learn the exact marginal value of an element until it is added to the solution (and thus "paid for"). This, too, has motivated several works on adaptive submodular maximization (Golovin and Krause, 2011;Gotovos et al., 2015;Gupta et al., 2017;Mitrovic et al., 2019) or on adaptivity gap techniques (Bradac et al., 2019;Boodaghians et al., 2020). Note that even estimating the expected value to a partially unknown objective function can be very costly and this makes the reduction of the number of such calls all the more important.
Knapsack constraints are one of the most natural types of restriction that occurs in real-world problems and are often hard budget, time, or size constraints. Other combinatorial constraints like partition matroid constraints, on the other hand, model less stringent requirements, e.g., avoiding too many similar items in the solution. As the soft versions of such constraints can be often hardwired in the objective itself (see the Video Recommendation application in Section 5), we do not deal with them directly here.
The nearly-linear time requirement, without large constants involved, leaves little room for using sophisticated approaches like continuous greedy methods (Feldman et al., 2011) or enumeration of initial solutions (Sviridenko, 2004). To further highlight the delicate balance between function evaluations and approximation, it is worth mentioning that, even for the monotone case, the first result combining O(n log n) oracle calls with an approximation better than 2 is the very recent e e−1 -approximation algorithm of . While this is a very elegant theoretical result, the huge constants involved render it unusable in practice.
At the same time, the strikingly simple 2.8-approximation adapted density greedy algorithm of Wolsey (1982) deals well with both issues in the monotone case: Sort the items in decreasing order according to their marginal value over cost ratio and pick as many items as possible in that order without violating the constraint. Finally, return the best among this solution and the best single item. 1 When combined with lazy evaluations (Minoux, 1978), this algorithm requires only O(n log n) value oracle calls and can be adjusted to work equally well for adaptive submodular maximization (Golovin and Krause, 2011). For non-monotone objectives, however, the only practical algorithm is the (10 + ε)-approximation FANTOM algorithm of Mirzasoleiman et al. (2016) requiring O(n 2 log n) value oracle calls (see also Remark 1). Moreover, there is no known algorithm for the adaptive setting that can handle anything beyond a cardinality constraint (Gotovos et al., 2015).
We aim to tackle both aforementioned challenges for non-monotone submodular maximization under a knapsack constraint, by revisiting the simple algorithmic principle of Wolsey's algorithm. Our approach is along the lines of recent results on random greedy combinatorial algorithms (Buchbinder et al., 2014;Feldman et al., 2017), which show that introducing randomness into greedy algorithms can extend their guarantees to the nonmonotone case. Here we give the first such algorithm for a knapsack constraint.
To get some intuition regarding the need for (and use of) randomization, consider the following example which also demonstrates that the adapted density greedy algorithm may produce arbitrarily poor solutions when the objective is non-monotone. Suppose there are n + 1 items, x 1 , . . . , x n and y. The value of a set S is either its cardinality if y / ∈ S, or 1 + otherwise (it is easy to check that this is a non-monotone submodular function). All items have weight 1/n and our knapsack size is 1. Note that the adapted density greedy and the standard greedy algorithms would both start by adding y, thus producing a final solution of value 1 + , whereas the optimal value is n. Now suppose that we allow adapted density greedy to avoid each item with a certain constant probability p: it avoids y with probability p in its first step, leading to a solution with expected value (1 − p) · (1 + ) + p 2 · n = Θ(n).

Contribution and Outline
In this work we show that introducing some randomization to the adapted density greedy algorithm leads to a simple algorithm, SampleGreedy, that outperforms existing algorithms both in theory and in practice. SampleGreedy flips a coin before greedily choosing any item in order to decide whether to include it to the solution or ignore it. The algorithmic simplicity of such an approach keeps SampleGreedy fast, easy to implement, and flexible enough to adjust to other related settings. At the same time the added randomness prevents it from getting trapped in solutions of poor quality.
In Section 3 we show that SampleGreedy is a (3 + 2 √ 2 + ε)-approximation algorithm using O(nε −1 log (n/ε)) value oracle calls; specifically, this implies a 5.83-approximation algorithm that makes only O(n log n) calls. When all singletons have small value compared to an optimal solution, the approximation factor improves to almost 4. This is the first constant-factor approximation algorithm for the non-monotone case using this few queries. The only other algorithm fast enough to be suitable for large instances is the aforementioned FANTOM (Mirzasoleiman et al., 2016) which, for a knapsack constraint, 2 achieves an approximation factor of (10 + ε) with O(nrε −1 log n) queries, where r is the size of the largest feasible set and can be as large as Θ(n). Even if we modify FANTOM to use lazy evaluations, we still improve the query complexity by a logarithmic factor (see also Remark 1).
Then we study the problem in the adaptive submodular maximization framework of Golovin and Krause (2011) and Gotovos et al. (2015), where the stochastic submodular objective is learned as we build the solution and its value depends only on the state of the elements in the evaluated set. For this adaptive variant, we show in Section 4 that a natural adaptation of our algorithm, AdaptiveGreedy, still guarantees a (9 + ε)-approximation to the best adaptive policy. This is not only a relatively small loss given the considerably stronger benchmark, but is in fact the first constant approximation known for the problem in this framework. Hence we fill a notable theoretical gap, given that models with incomplete prior information, or those capturing evolving settings, are becoming increasingly important in practice.
From a technical point of view, our algorithm combines the simple principle of always choosing a high-density item with maintaining a careful exploration-exploitation balance, as is the case in many stochastic learning problems. Adding greedily elements might result in local maxima of poor quality, so we need some randomization to use some budget on the exploration of other elements. Given these features, this paper is therefore directly related to the recent simple randomized greedy approaches for maximizing non-monotone submodular objectives subject to other (i.e., non-knapsack) constraints (Buchbinder et al., 2014;Chekuri et al., 2015a;Feldman et al., 2017). However, there are underlying technical difficulties that make the analysis for knapsack constraints significantly more challenging. Every single result in this line of work critically depends on making a random choice in each step, in a way so that "good progress" is consistently made. This is not possible under a knapsack constraint. Instead, we argue globally about the value of the SampleGreedy output via a comparison with a carefully maintained almost integral solution. When it comes to extending this approach to the adaptive non-monotone submodular maximization framework, we crucially use the fact that the algorithm builds the solution iteratively, committing in every step to all the past choices. This is the main technical reason why it is not possible to adjust algorithms with multiple "parallel" runs, like FANTOM, to the adaptive setting.
Our algorithms provably handle well the aforementioned emerging, modern-day challenges, i.e., stochastically evolving objectives and rapidly growing real-world instances. In Section 5 we showcase the fact that our theoretical results indeed translate into applied performance. We focus on three applications that fit within the framework of non-monotone submodular maximization subject to a knapsack constraint, namely video recommendation, influence-and-exploit marketing, and influence maximization. We run experiments on real and synthetic data that indicate that SampleGreedy consistently performs better than FANTOM while being much faster. For AdaptiveGreedy we highlight the fact that its adaptive behavior results in a significant improvement over non-adaptive alternatives.

Related Work
There is an extensive literature on submodular maximization subject to knapsack or other constraints, going back several decades (see, e.g., Nemhauser et al., 1978;Wolsey, 1982). For a monotone submodular objective subject to a knapsack constraint there is a deterministic e e−1 -approximation algorithm (Khuller et al., 1999;Sviridenko, 2004) which is tight, unless P = NP (Feige, 1998). This algorithm has a running time of O(n 5 ) but there are other, much faster, greedy approaches with weaker approximation guarantees, like Wolsey's 2.8approximation algorithm (Wolsey, 1982) used as a starting point here, and the recent 2approximation algorithm of Yaroslavtsev et al. (2020).
On non-monotone submodular functions, Lee et al. (2010) provided a 5-approximation algorithm for k knapsack constraints, which was the first constant factor algorithm for the problem. Fadaei et al. (2011) building on the approach of Lee et al. (2010), reduced this factor to 4. One of the most interesting algorithms for a single knapsack constraint is the 6-approximation algorithm of Gupta et al. (2010). As this is a greedy combinatorial algorithm based on running Sviridenko's algorithm twice, it is often used as a subroutine by other algorithms in the literature, e.g., (da Ponte Barbosa et al., 2015), despite its running time of O(n 4 ). A number of continuous greedy approaches (Feldman et al., 2011;Kulik et al., 2013;Chekuri et al., 2014) led to the current best factor of e when a knapsackor even a general downwards closed-constraint is involved. However, continuous greedy algorithms are impractical for most real-world applications. The fastest such algorithm for our setting is the (e + ε)-approximation algorithm of Chekuri et al. (2015b) requiring O(n 3 ε −4 polylog(n)) function evaluations. Possibly the only algorithm that is directly comparable to our SampleGreedy in terms of running time is FANTOM by Mirzasoleiman et al. (2016). FANTOM achieves a (1+ε)(p+1)(2p+2 +1)/p-approximation for knapsack constraints and a p-system constraint in time O(nrpε −1 log(n)), where r is the size of the largest feasible solution.
As mentioned above, there is a number of recent results on randomizing simple greedy algorithms so that they work for non-monotone submodular objectives (Buchbinder et al., 2014;Chekuri et al., 2015a;Gotovos et al., 2015;Feldman et al., 2017;Feldman and Zenklusen, 2018). Our paper extends this line of work, as we are the first to successfully apply this approach for a knapsack constraint. Golovin and Krause (2011) introduced the notions of adaptive monotonicity and submodularity and showed it is possible to achieve guarantees with respect to the optimal adaptive policy that are similar to the guarantees one gets in the standard algorithmic setting with respect to an optimal solution. Our Section 4 fits into this framework as it was generalized by Gotovos et al. (2015) for non-monotone objectives. Gotovos et al. (2015) showed that a variant of the random greedy algorithm of Buchbinder et al. (2014) achieves a e e−1 -approximation in the case of a cardinality constraint. Tang (2021a) recently presented a different analysis of the algorithm of Gotovos et al. (2015) that yields the same approximation factor when the objective is adaptive submodular and possibly non-pointwise submodular. Finally, in a very recent unpublished manuscript, Tang (2021b) claims that (an equivalent version of) SampleGreedy achieves a constant factor approximation for the problem of adaptive submodular maximization subject to a knapsack constraint even without resorting to the assumption of pointwise adaptive submodularity that is typically made in the literature (Krause et al., 2008;Gotovos et al., 2015;Amanatidis et al., 2020).
Implicitly related to our quest for few value oracle calls is the recent line of work on the adaptive complexity of submodular maximization that measures the number of sequential rounds of independent value oracle calls needed to obtain a constant factor approximation (see Balkanski and Singer, 2018;Balkanski et al., 2019;Fahrbach et al., 2019b,a;Kuhnle, 2021;Amanatidis et al., 2021, and references therein). For non-monotone functions and a knapsack constraint, in particular,  and Amanatidis et al. (2021) give O(1)-approximation algorithms that need O(log 2 (n)) and O(log(n)) rounds of independent value oracle calls, respectively. While the elaborate continuous approach of  requests O(n 2 ) value queries, Amanatidis et al. (2021) uses only O(n log 3 n) value queries. While the latter result is posterior to our work, we are still faster by a log 2 n factor, while retaining a better approximation factor.

Preliminaries
In this section we formally introduce the problem of submodular maximization with a knapsack constraint in both the standard and the adaptive setting.
We consider general (i.e., not necessarily monotone), normalized (i.e., v(∅) = 0), non-negative submodular valuation functions. Since marginal values are extensively used, we adopt the shortcut v(T | S) for the marginal value of set T with respect to a set S, i.e. v(T | S) = v(T ∪S)−v(S). If T = {i} we write simply v(i | S). While this is the most standard definition of submodularity in this setting, there are alternative equivalent definitions that will be useful later.
Theorem 1 (Nemhauser et al. (1978)). Given a function v : 2 E → R, the following are equivalent: Moreover, we restate a key result which connects random sampling and submodular maximization. The original version of the theorem was due to Feige et al. (2011), although here we use a variant from Buchbinder et al. (2014).
Lemma 1 (Lemma 2.2 of Buchbinder et al. (2014)). Let v : 2 E → R be a (possibly not normalized) submodular set function, let X ⊆ E and let X(p) be a sampled subset, where each element of X appears with probability at most p (not necessarily independent). Then We assume access to a value oracle that returns v(S) when given as input a set S.
Knapsack Constraint. We associate a positive cost c i with each element i ∈ E and consider a given budget B. The goal of our problem is to find a subset of E of maximum value among the subsets whose total cost is at most B. Formally, we want some S * ∈ arg max{v(S) | S ⊆ E, i∈S c i ≤ B}. In the following we denote with OP T the value of S * . Without loss of generality, we may assume that c i ≤ B for all i ∈ E, since any element with cost exceeding B is not contained in any feasible solution and can be discarded. Given the hardness of this optimization problem, in this work we focus on finding solution that approximates the optimal solution; in particular, given an algorithm, we denote with ALG the expected value of the set output by it and we say that it gives an α approximation, for Adapive submodularity. We now present the adaptive optimization framework (Krause et al., 2008). On a high level, in many applications (e.g. sensor placing, traffic control and influence maximization) we do know how the world works and what situations occur with which probability. However, which of those we will be actually dealing with is inferred over time by the bits of information we learn. To model these situations, along with set E, we introduce the state space Ω which is endowed with some probability measure. By ω = (ω i ) i∈E ∈ Ω we specify the state of each element in E. The adaptive valuation function v is then defined over E × Ω; the value of a subset S ⊆ E depends on both the subset and ω. Due to the probability measure over Ω, v(S, ω) is a random variable. We define v(S) = E [v(S, ω)], the expectation being with respect to ω. Like before, the costs c i are deterministic and known in advance.
For each ω ∈ Ω and S ⊆ E, we define the partial realization of state ω on S as the couple (S, ω |S ), where ω |S = (ω i ) i∈S . It is natural to assume that the true value of a set S does not depend on the whole state, but only on ω |S , i.e., v(S, ω) = v(S, ψ), for all ω, ψ ∈ Ω such that ω |S = ψ |S . Therefore, sometimes we overload the notation and use v(S, ω |S ) instead of v(S, ω). There is a clear partial ordering on the set of all possible partial realizations: We are now ready to introduce the concepts of adaptive submodularity and monotonicity.
In Section 4 we assume access to a value oracle that given an element i and a partial realization returns the expected marginal value of i. Using the properties of conditional expectation, it is straightforward to show that if v(·, ·) is adaptive submodular, then its expected value v(·) is submodular. In analogy with Gotovos et al. (2015), we assume v to be pointwise submodular, i.e., v(·, ω) is a submodular set function for each ω ∈ Ω.
In this framework it is possible to define adaptive policies to maximize v. An adaptive policy π is a function which associates with every partial realization a distribution on the next element to be added to the solution. The optimal solution to the adaptive submodular maximization problem is to find an adaptive policy that maximizes the expected value while respecting the knapsack constraint (the expectation being taken over Ω and the randomness of the policy itself). Notice that the knapsack constraint has to be respected point-wise, i.e. for each realization of the randomization of the algorithm and of the state. The definition of approximation is defined similarly as in the non-adaptive case.
The following result of Gotovos et al. (2015) that holds even for non-monotone objectives will come handy later.
Lemma 2 (Lemma 1 of Gotovos et al. (2015)). If v is adaptive submodular, then, for any k ∈ N, any policy π that terminates after k steps, and any partial realization where S (π) is the (random) set selected according to the policy π and M k the set containing k elements with the highest marginal values given (S, ω |S ). The expectation on the left side is with respect to the randomness of S (π).
We present now a brief example to help the reader familiarizing with the notion of adaptive submodularity. Consider the task of finding a max cut in a graph, where edge weights are randomly extracted from some known distribution. In this problem a state corresponds to an actual realization of the edge weights on the whole graph, while partial realizations of a state refer to the restriction of the realized edge weights on a specific subset of the edges. The distributions on different edges may be correlated, so every time the actual weight of an edge is revealed all the prior relative to unknown edge weights are updated. Finally, the adaptive submodular objective function depends on the topology of the graph (that is known up front) and on the realized weights. More on this example can be found in the experimental section.

A Simple Algorithm for the Standard Algorithmic Setting
We present and analyze SampleGreedy, a randomized 5.83-approximation algorithm for maximizing a submodular function subject to a knapsack constraint. As we mentioned already, SampleGreedy is based on the adapted density greedy algorithm of Wolsey (1982). Since the latter may perform arbitrarily bad for non-monotone objectives, we add a sampling phase, similar to the sampling phase of the Sample Greedy algorithm of Feldman et al. (2017). First we present in detail the algorithm and its analysis, then we show how a lazy implementation is enough to achieve O(n log n) sample complexity and finally how better approximation guarantees are achievable in the large instance scenario.
The algorithm. SampleGreedy first selects a random subset E of E by independently picking each element with probability p. Then it runs Wolsey's algorithm only on E . To formalize this second step, using v(i) as a shorthand for v({i}), let If is the largest integer such that i=1 c j i ≤ B, then S = {j 1 , . . . , j }. In the end, the output is the one yielding the largest value between S and an element from arg max i∈E v(i).
We formally present this algorithm in pseudocode below. Notice that to simplify the analysis, instead of selecting the entire set E immediately at the start of the algorithm, we defer this decision and toss a coin with probability of success p each time an item is considered to be added to the solution. Both versions of the algorithm behave identically.
Proof. For the analysis of the algorithm we are going to use the auxiliary set O, an extension of the set S that respects the knapsack constraint and uses feasible items from an optimal solution. In particular, let S * be an optimal solution and let s 1 , s 2 , . . . , s r be its elements, sorted in increasing order of cost. Then, O is a fuzzy set that is initially equal to S * and during each iteration of the while loop it is updated as follows: • Else (i.e., if If an item j was considered (in line 6) in some iteration of the while loop, then let S j and O j denote the sets S and O, respectively, at the beginning of that iteration. Moreover, let O j denote O at the end of that iteration. If j was never considered, then S j and O j (or O j ) denote the final versions of S and O, respectively. In fact, in what follows we exclusively use S and O for their final versions. It should be noted that, for all j ∈ E, S j ⊆ O j and also no item in O j \ S j has been considered in any of the previous iterations of the while loop.
Before stating the next lemma, let us introduce some notation for the sake of readability. Note that, by construction, O \ S is either empty or consists of a single fractional itemî. In case O \ S = ∅, byî we denote the last item removed from O. For every j ∈ E, we define . Note that if j was never considered during the execution of the algorithm, then Q j = ∅.

Lemma 3. For every realization of the Bernoulli random variables, it holds that
Proof of Lemma 3. Assume that the random bits r 1 , r 2 , . . . are fixed. Also, without loss of generality, assume the items are numbered according to the order in which they are considered by SampleGreedy, with the ones not considered by the algorithm numbered arbitrarily (but after the ones considered). That is, item j-if considered-is the item considered during the j th iteration. Consider now any round j of the while loop of SampleGreedy. An item is removed from O j in two cases. First, it could be item j itself that was originally in S * but r j = 0 (and hence it will never get back in O k for any k > j). Second, it could be some other item that was in S * and is taken out to make room for the new item j. In the latter case the only possibility for the removed item to return in O k for some k > j is to be selected by the algorithm and inserted in S. We can hence conclude that Q j ∩ Q k = ∅ for all j = k. In addition to that, it is clear that Therefore, if items 1, 2, . . . , where all the items ever considered, using submodularity and the fact that where in a slight abuse of notation we consider c x to be the fractional (linear) cost if x ∈ Q j is a fractional item. While the first three inequalities directly follow from the submodularity of v, for the last inequality we need to combine the optimality of v(j | S j )/c j at the step j was selected with the fact that every single item x appearing in the sum j=1 x∈Q j v(x | S j ) was feasible (as a whole item) at that step. The latter is true because of the way we remove items from O. If x is removed, it is removed before (any part of)î is removed. Thus, x is removed when the available budget is still at least cî. Given that c x ≤ cî, we get that x is feasible until removed.
To conclude the proof of the Lemma it is sufficient to note that c(Q j ) = 0 for all items that were not considered.
While the previous Lemma holds for each realization of the random coin tosses in the algorithm, we next consider inequalities holding in expectation over the randomness of the {r i } |E| i=1 in SampleGreedy. The indexing of the elements is hence to be considered deterministic and fixed in advance, not as in the proof of Lemma 3.
Proof of Lemma 4. For all i ∈ E, we define G i to be the random gain because of i at the time i is added to the solution (G i = v(i | S i ) if i is added and 0 otherwise) Since v(S) = i∈E G i , by linearity, it suffices to show that the following inequalities hold in expectation over the coin tosses: In order to achieve that, following Feldman et al. (2017), let E i be any event specifying the random choices of the algorithm up to the point i is considered (if i is never considered, E i captures all the randomness). If E i is an event that implies i is not considered, then the Eq.
(2) is trivially true, due to G i = 0 and Q i = ∅. We focus now on the case E i implies that i is considered. Analyzing the algorithm, it is clear that It is here that we use the fuzziness of O: without the fractional items it would be hopeless to bound c(Q t ) with c t . At this point, we exploit the fact that E i contains the information on S i , i.e., S i = S i (E i ) deterministically. Recall that S i is the solution set at the time item i is considered by the algorithm.
We can hence conclude the proof by using the law of total probability over E i and the monotonicity of the conditional expectation: Proof of Lemma 5. Let S * be an optimal set for the constrained submodular maximization problem. We define g : 2 E → R + as follows: g(T ) = v(T ∪ S * ). It is a simple exercise to see that such function is indeed submodular, moreover g(∅) = v(S * ). If we now apply Lemma 1 to g, observing that the elements in the set S output by the algorithm are chosen with probability at most p, we conclude that: Combining Lemmata 3, 4 and 5 we get By substituting √ 2 − 1 for p, this yields where the second inequality follows from Jensen's inequality.
Lazy implementation of SampleGreedy. A naive implementation of SampleGreedy needs Θ(n 2 ) value oracle calls in the worst case. Indeed, in each iteration all the remaining elements have their marginals updated and for large enough B the greedy solution may contain a constant fraction of E. Applying lazy evaluations (Minoux, 1978), however, we can cut the number of queries down to O(nε −1 log (n/ε)) losing only an additive ε in the approximation factor . To achieve this, instead of recomputing all the marginals at every step, we maintain an ordered queue of the elements sorted by their last known densities (i.e., their marginal value per cost ratios) and use it to get a sufficiently good element to add. More formally, the lazy implementation of SampleGreedy maintains the elements in a priority queue in decreasing order of density, which is initialised using the ratios v(i)/c i . At each step we pop the element on top of the queue. If its density with respect to the current solution is within a 1 + ε factor of its old one, then it is picked by the algorithm, otherwise it is reinserted in the queue according to its new density and we pop the next element. Submodularity guarantees that the density of a picked element is at least 1/(1 + ε) of the best density for that step. As soon as an element has been updated log(n/ε)/ε times, we discard it.
Proof. For a given ε ∈ (0, 1) let ε = ε/6. We perform lazy evaluations using ε , with log denoting the binary logarithm. It is straightforward to argue about the number of value oracle calls. Since the marginal value of each element i has been updated at most log(n/ε ) ε times, we have a total of at most n log(n/ε ) ε = O n log(n/ε) ε function evaluations. The approximation ratio is also easy to show. There are two distinct sources of loss in approximation. We first bound the total value of the discarded elements due to too many updates. This value appears as the upper bound of an extra additive term in the first line of (1). Indeed, now besides j=1 v Q j S ∪ r=j+1 Q r we need to account for the elements of O that were ignored because of too many updates. Such elements, once they become "inactive" do not contribute to the cost of the current O and are never pushed out as new elements come into S. The definition of the Q j s in the proof of Theorem 2 should be adjusted accordingly. That is, if W j are the elements of O that become inactive because they were updated too many times during iteration j, we have However, by noticing that for x ∈ (0, 1) it holds that x ≤ log(1 + x), we have For the second source of loss in approximation, recall that the marginals only decrease due to submodularity. So, we know that if some item j is considered during iteration j (following the renaming of Lemma 3), then (1 + ε )v(j | S j )/c j ≥ arg max k∈F v(k | S j )/c k . The only difference this makes (compared to the proof of Theorem 2) is that in the last inequality of (1) we have an extra factor of 1 + ε .
Combining the above, we get the following analog of Lemma 3: which carries over to inequality (3), while Lemmata 4 and 5 are not affected at all. It is then a matter of simple calculations to see that for p = √ 2 − 1, we still get v(S * ) ≤ (3 + 2 √ 2 + ε) max{E [v(S)] , v(i * )}. The final passage is as before.
Large instance scenario. Additionally, our analysis implies that SampleGreedy performs significantly better in the large instance scenario, i.e., when the value of the optimal solution is much larger than the value of any single element. While it is not expected to have exact knowledge of the factor δ in the following proposition, often some estimate is accessible. Especially for massive instances, it is reasonable to assume that δ is bounded by a very small constant.
Proof. Starting from inequality (3) and exploiting the large instance property, we get: Rearranging the terms and assuming p + δ < 1, we have: Optimizing for p ∈ (0, 1 − δ) we get the desired statement.

Adaptive Submodular Maximization
In this section we modify SampleGreedy to achieve a good approximation guarantee in the adaptive framework. Recall that the adaptive valuation function v(· , ·) depends on the state of the system which is discovered a bit at a time, in an adaptive fashion. Indeed, SampleGreedy is compatible with this framework and can be applied nearly as it is. We stick to the interpretation of SampleGreedy discussed right before Theorem 2. That is, there is no initial sampling phase. Instead, we directly begin to choose greedily with respect to the density (marginal value with respect to the current solution over cost). Each time we are about to pick an element of E, we throw a p-biased coin that determines whether we keep or discard the element.
Here the main difference with the greedy part of SampleGreedy is that the marginals are to be considered with respect to the partial realization relative to the current solution. Moreover, since it is not possible to return the largest between max i∈E v(i) and the result of the greedy exploration, the choice between these two quantities has to be settled before starting the exploration. Formally, at the beginning of the algorithm a p 0 -biased coin is tossed to decide between the two. The pseudo-code for the resulting algorithm, AdaptiveGreedy, is given below.
Before proving that AdaptiveGreedy works as promised, we need some observations. Let us denote by S the output of a run of our algorithm, and S * the output of a run of the optimal adaptive strategy. Fix a realization ω ∈ Ω. Now, using pointwise submodularity and directly applying Lemma 1 we have Observe ω i * and return (i * , ω i * ) 5 S = ∅, R = B /* greedy solution and remaining knapsack capacity */ Let r i ∼ Bernoulli(p) /* independent random bit */ 10 if r i = 1 then 11 Observe Since ω (and therefore, S * ) is fixed, the only randomness is due to the coin flips in our algorithm. We stress that the union between S and S * has to be intended in the following sense: run our algorithm, and independently, also the optimal one, both for the same realization ω. The previous inequality is true for any ω. So, by the law of total probability, we also have For the next observation, assume our algorithm has picked (and therefore observed) exactly set S. That is, we know only ω |S . We number all items j ∈ E with positive marginal value with respect to (S, ω |S ) by decreasing ratio v i | (S, ω |S ) /c i , i.e., j k = arg max i∈E\[k−1] v i | (S, ω |S ) /c i for k = 1, . . . , |E|, where [0] = ∅. Given this ordering, we can refer to the k best-looking items given S in F for any specific k ≤ |E|. In particular, for k = min{i ∈ N| i l=1 c l ≥ B}, we may apply Lemma 2 to get For the sake of the future analysis the last element k is considered fractionally, so that k =1 c = B. Note that it could be the case that k is not well defined, as there may not be enough elements with positive marginal to fill the knapsack. If that is the case, just consider k to be the number of elements with positive marginals. In the future, we refer to this (possibly fraction) set of best-looking items given S fitting into the budget as D.
The point of inequality (5) is that, given (S, ω | S) and the coin tosses realized so far, the set D is deterministic (as is its cardinality k), while S * is not, because it corresponds to the set selected by the best adaptive policy. Moreover, in the middle term notice that the conditioning influences the valuation, but not the policy, since we are assuming to run it obliviously. This is fundamental for the analysis.
Since this holds for any set S, we can again generalize to the expectation over all possible runs of the algorithm to all the terms in inequality (5), therefore obtaining We remark that k above is a random variable which depends on S. We use these observations to prove the ratio of our algorithm.
Theorem 5. For p 0 = 1/3 and p = 1/6, AdaptiveGreedy yields a 9-approximation of opt Ω , while its lazy version achieves a (9 + ε)-approximation using O(nε −1 log (n/ε)) value oracle calls. Moreover, when max i∈E v(i) ≤ δ · opt Ω for δ ∈ (0, 1/2), then for p 0 = 0 and p = ( Proof. For any run of the algorithm, i.e. a fixed set S, the corresponding partial realization ω |S and the coin flips observed, define for convenience the set C as those items in D that have been considered during the algorithm and then not added to S because of the coin flips. Define U = D \ C. Additionally, define C to be the set of all items that are considered, but not chosen during the run of our algorithm which have positive expected marginal contribution to S. C captures the best-looking items given S fitting into the budget that we missed due to coin tosses. We have that C ⊆ C , in fact C contains all the elements with positive marginal value with respect to S that were discarded by the coin tosses, while C contains only the best of those. We can then split the left-hand side term of inequality (6) into two parts: the sum over C (upper bounded by the sum over C ), and the sum over U. Now we control separately these terms using linear combinations of v(S) and v(i * ).
Proof. Since C ⊆ C and C contains all considered elements with nonnegative expected contribution to S, it is sufficient to show E[v(S)] ≥ p · E i∈C v i | (S, ω |S ) . We proceed as in Lemma 4. Let's consider for each i ∈ E all the events E i capturing the story of a run of the algorithm up to the point element i is considered (all the history if it is never considered).
Let G i be the marginal contribution of element i to the solution set S. If E i corresponds to a story in which element i is not considered, then it does not contribute-neither in the left, nor in the right-hand side of the inequality we are trying to prove. Else, let (S i , ω i ) be the partial solution when it is indeed considered: The statement follows from the law of total probability with respect to E i , and pointwise submodularity of v.
Proof. Now let us turn towards the items U that were not considered by the algorithm. The intuition behind the claim is that if they were not considered then they were not good enough, in expectation, to compare with S. The proof, though, has to deal with some probabilistic subtleties. Let's start fixing a story of the algorithm, i.e., the coin tosses and (S, ω S ), S = s 1 , s 2 , . . . , s T , numbered according to their insertion in S, i.e., s j is the j th element to be added to S.
There are two cases. If during the whole algorithm the elements in U have ratio v i | (S j ,ω |S j ) c i smaller than that of the item which was instead considered, then one can easily argue, by adaptive submodularity, that: where S t = (s 1 , . . . , s t−1 ) and ω t is the restriction of ω |S to S t . Note that the last element u 1 is added to account for the unspent budget by the solution. The first inequality holds because our solution fills all the budget (up to at most one item) with densities which are better than all the v i | (S, ω |S ) We claim that the above inequality holds also in the case in which there is an element in U whose marginal over cost is greater than that of some in S. Such an element can exist because of the budget constraint: during the algorithm it had better marginal over cost, but was discarded because there was not enough room for it. We observe there can exist at most one such element, due to the budget constraint and because its value is upper bounded by u 1 , so the above formula still holds.
Once we know that, we can conclude the proof applying the law of total probability: Combining the two Lemmata we get: Inequality (6) implies Also, with some rewriting and inequality (4): Altogether, denoting as OPT the E[v(S * )], we get: Let's call ALG the expected value of the solution output by the algorithm. Since the algorithm chooses with a coin flip either the best expected single item or S, we have that Picking p 0 = p 3p+1 , The right-hand side is minimized for p = 1 3 , concluding the proof of the first part of the statement.
The lazy version of AdaptiveGreedy is analogous to the non-adaptive setting, both for the orithm and the analysis, so we omit repeating the proof.
In order to prove the last part of the statement, we start from inequality (7) and apply the large instance property: Rearranging terms and assuming p + δ < 1 we have that: Optimizing for p ∈ (0, 1 − δ), we get the claimed result. Specifically, for p = ( √ 3 − 2δ − 1)/2 the approximation factor is (4 + 2 √ 3 + ε δ ), with

Experiments
Out of the numerous applications of submodular maximization subject to a knapsack constraint, we evaluate SampleGreedy and AdaptiveGreedy on two selected examples, using real and synthetic graph topologies. Variants of these have been studied in a similar context; see Mirzasoleiman et al. (2016). As our algorithms are randomized, but extremely fast, we use the best output out of 5 iterations. All graphs presenting the experimental results contain error bars, indicating the standard deviation between the different runs of the experiments. This is usually insignificant due to the concentrating effect of the large size of the instances, despite the randomly initialized weights and inherent randomness of the algorithms used. For all algorithms involved, we use lazy evaluations with ε = 0.01. A delicate point is tuning the probabilities of acceptance p (line 9 of AdaptiveGreedy) for improved performance. While the choices of p in Theorems 2 and 5 minimize our analysis of the theoretical worst-case approximation, there are two factors that suggest a value much closer to 1 works best in practice: the small value of any singleton solution, and the much better guarantee of Lemma 5 for most widely used non-monotone submodular objectives. We do not micro-optimize for p but rather choose uniformly at randomly from [0.9, 1].
Video Recommendation: Suppose we have a large collection E of videos from various categories (represented as possibly intersecting subsets C 1 , . . . , C k ⊆ E) and we want to design a recommendation system. When a user inputs a subset of categories and a target total length B, the system should return a set of videos from the selected categories of total duration at most B that maximizes an appropriate objective function. (Of course, instead of time here, we could use costs and a budget constraint.) Each video has a rating and there is some measure of similarity between any two videos. We use a weighted graph on E to model the latter: each edge {i, j} between two videos i and j has a weight w ij ∈ [0, 1] capturing the percentage of their similarity. To pave the way for our v(·), we start from the auxiliary objective f (S) = i∈S j∈E w ij − λ i∈S j∈S w ij , for some λ ≥ 1 (Lin and Bilmes, 2010;Mirzasoleiman et al., 2016). This is a maximal marginal relevance inspired objective (Carbinell and Goldstein, 2017) that rewards coverage, while penalizing similarity. For λ = 1, internal similarities are irrelevant and f becomes a cut function. However, one can penalize similarities even more severely as f is submodular for λ ≥ 1 (e.g., Lin and Bilmes (2010) use λ = 5).
In order to mimic the effect of a partition matroid constraint, i.e., the avoidance of many videos from the same category, we may use two parameters λ ≥ 1, µ ≥ 0. While λ is as above, µ puts extra weight on similarities between videos that belong to the same category. That leads to a more general auxiliary objective g(S) = i∈S j∈E w ij − i∈S j∈S (λ + χ ij µ)w ij , where χ ij is equal to 1 if there exists such that i, j ∈ C and 0 otherwise. To interpolate between choosing highly rated videos and videos that represent well the whole collection, here we use the submodular function v(S) = α i∈S ρ i + βg(S) for α, β ≥ 0, where ρ i is the rating of video i. We use λ = 3, µ = 7 and set the parameters α, β so that the two terms are of comparable size.
We evaluate SampleGreedy on an instance based on the latest version of the Movie-Lens dataset (Harper and Konstan, 2015), which includes 62000 movies, 13816 of which have both user-generated tags and ratings.
We calculate the weights w ij using these tags while the costs are drawn independently from U (0, 1). These tag vectors are not normalized and have no additional structure, other than each coordinate being restricted to [0, 1]. We define the similarity w ij between two movies i, j as: In other words, it is the L2 norm of the coordinate-wise minimum of t i and t j . This metric was chosen so that if both movies have a high value in some tag, this counts as a much stronger similarity than one having a high value and the other a low one. For example, if we consider an inner product metric, any movie with all tags set to 1 would be as similar as possible to all other movies, even though it would include many tags that would be missing from the others. In particular, any movie would appear more similar to the all 1 movie than to itself! Choosing the minimum of both tags avoids this issue. Another possibility would be to normalize each tag vector before taking the inner product, to obtain the cosine similarity. Although this alleviates some of the issues, there is some information loss as one movie could meaningfully have higher scores in all tags than another one; tags are not mutually exclusive. Ultimately any sensible metric has advantages and disadvantages and the exact choice has little bearing on our results. The similarity scores are then divided by their maximum as a final normalization step. We compare against the FANTOM algorithm of Mirzasoleiman et al. (2016) as it is the only other algorithm with a provable approximation guarantee that runs in reasonable time. Continuous greedy approaches (Feldman et al., 2011) or the repeated greedy of Gupta et al. (2010) are prohibitively slow. SampleGreedy consistently performs better than FANTOM for a wide range of budgets (Fig. 1a). Plotting the number of function evaluations against the budget, SampleGreedy is much faster (Fig. 1d) despite the fact that it is run 5 times! The experiment was repeated 5 times. The budget is represented as a fraction of the total cost starting at 1/100 and geometrically increasing to 1/10 in 10 steps. The total computation time was around 3 hours.
Remark 1. The running time of FANTOM for fixed ε is O(nr log n), where r is the cardinality of the largest feasible solution. For a knapsack constraint this translates to O(n 2 log n). To be as fair as possible, we implemented FANTOM using lazy evaluations, which improves the number of evaluations of the objective function to O(n log 2 n) and is indeed much faster in practice, for the knapsack sizes we consider. Even so, our SampleGreedy is faster by a factor of Ω(log n) which, including the improvement in the constants involved, still makes a huge difference. Note that in both Figs. 1d and 1e one can discern the superlinear increase of the function evaluations for FANTOM but not for SampleGreedy.
Remark 2. The role of our parameter p is different from that of the ε precision parameter in FANTOM. The role of ε in (Mirzasoleiman et al., 2016) is to control the density of the search space grid. The behaviour of the solution for varying ε is roughly monotone, i.e., the smaller the ε, the finer the grid, generally yielding better solutions and requiring more value queries. In our case, the instance-specific optimal p is unknown and reflects, to some extent and roughly speaking, "how non-monotone the function is". Picking smaller values for ε in our experiments, would slightly improve the quality of the solution output by FANTOM, but would also increase its run time to the point where it would become impractical to run it for any large instances. Choosing ε = 1, we believe that we give a fair and reasonable comparison.
Influence-and-Exploit Marketing: Consider a seller of a single digital good (i.e., producing extra units of the good comes at no extra cost) and a social network on a set E of potential buyers. Suppose that the buyers influence each other and this is quantified by a weight w ij on each edge {i, j} between buyers i and j. Each buyer's value for the good depends on who owns it within her immediate social circle and how they influence   of SampleGreedy and FANTOM on the video recommendation problem for the MovieLens dataset (a), (d) and on the maximum weighted cut problem on random graphs (b), (e). Since no ε ≤ 1 affected the performance of FANTOM noticeably before becoming too computationally expensive, we used ε = 1 to achieve the maximum possible speedup. The plots on the far right illustrate the performance of AdaptiveGreedy (ignoring single item solutions, i.e., p 0 = 0) on the influenceand-exploit problem for two distinct topologies: the YouTube graph (c) and random graphs (f). All budgets are shown as fractions of the total cost. her. A possible revenue-maximizing strategy for the seller is to first give the item for free to a selected set S of influential buyers (influence phase) and then extract revenue by selling to each of the remaining buyers at a price matching their value for the item due to the influential nodes (exploit phase). Here we further assume, similarly to the adaptation of this model by Mirzasoleiman et al. (2016), that each buyer comes with a cost of convincing her to advertise the product to her friends. The seller has a budget B and the set S should be such that i∈S c i ≤ B.
We adopt the generalization of the Concave Graph Model of Hartline et al. (2008) to non-monotone functions (Babaei et al., 2013). Each buyer i ∈ E is associated with a nonnegative concave function f i . For any i ∈ E and any set S ⊂ E \ {i} of agents already owning the good, the value of i for it is v i (S) = f i j∈S∪{i} w ij . The total potential revenue v(S) = i∈E\S v i (S) that we aim to maximize is a non-monotone submodular function. Besides the theoretical guarantees for influence-and-exploit marketing in the Bayesian setting (Hartline et al., 2008), there are strong experimental evidence of its performance in practice (Babaei et al., 2013). The problem generalizes naturally to different stochastic versions. We assume that the valuation function of each buyer i is of the form where a i is drawn independently from a Pareto Type II distribution with λ = 1, α = 2. We only learn the exact value of a buyer when we give the good for free to someone in her neighborhood. We evaluate AdaptiveGreedy on an instance based on the YouTube graph (Yang and Leskovec, 2015), containing 1,134,890 vertices. The (known) weights are drawn independently from U (0, 1), and the costs are proportional to the sum of the weights of the incident edges. As AdaptiveGreedy is the first adaptive algorithm for the problem, we compare with non-adaptive alternatives like Greedy 3 and Density Greedy 4 for different values of the budget. AdaptiveGreedy outperforms the alternatives by up to 20% (Fig. 1c).
We observe similar improvements for Erdős-Rényi random graphs of different sizes and edge probability 5/ √ n and a fixed budget of 10% of the total cost (Fig. 1f). For the YouTube graph, the experiment was repeated 5 times for a budget starting at 1/100 of the total cost and geometrically increasing to 1/3 in 20 steps, leading to a total computation time of 7 hours. For the Erdős-Rényi graph with n vertices and edge probability 5/ √ n it was repeated 10 times, for n starting at 50 and geometrically increasing to 2500 in 20 steps, taking approximately 10 minutes. i.e., p 0 = 0) on the influence maximization problem for two distinct topologies: the Epinions graph (a) and random graph (b). All budgets are shown as fractions of the total cost. Influence Maximization: This setting is similar to influence-and-exploit marketing, focusing just on maximizing the number of people influenced. Consider a social network where each user could become an 'influencer', by posting about the product we are trying to promote. Their friends on the platform would see this and might favourably change their opinion. Similar to Breuer et al. (2020), we assume that each user has an independent probability q of being influenced by any influencer adjacent to them. Formally given the set of influencers S, for the probability that each user i is influenced is: One difference from Breuer et al. (2020) is that influencers do not count towards maximizing the objective: this makes the instance non-monotone submodular. Moreover, as we did for influence-and-exploit marketing, we consider the adaptive version: whenever an influencer is added to the set S, any of their uninfluenced neighbours might indendently become influenced with probability q, updating the expected marginals accordingly.
In Figure 2 we evaluate AdaptiveGreedy on an instance based on the Epinions network graph (Richardson et al., 2003), containing 75,879 nodes and 508,837 edges. We have set q = 0.2 and the cost of each node is drawn from U [0, 1]. As before, we compare Adap-tiveGreedy to Greedy and Density Greedy, repeating the experiment 5 times for budgets ranging from 1/100 to 1/30 of the total cost. We observe that AdaptiveGreedy is tied with Density Greedy and performs significantly better than Greedy. The most likely reason for this decreased separation between the different algorithms is that this objective is more benign: especially for small budgets, the non-monotone aspect is muted and the knowledge obtained from adding an element to S is not as useful as it was for influence-and-exploit marketing. We also repeat the same experiment for the G(3000, 0.01) random graph.
Maximum Weighted Cut: Beyond the above applications, we would like to compare SampleGreedy to FANTOM with respect to both their performance and the number of value oracle calls as n grows. We turn to weighted cut functions-one of the most prominent subclasses of non-monotone submodular functions-on dense Erdős-Rényi random graphs with edge probability 0.2. The weights and the costs are drawn independently and uniformly from [0, 1] and the budget is fixed to 15% of the total cost. Again SampleGreedy consistently performs better than FANTOM, albeit by 5-15% (Fig. 1b). In terms of running time, there is a large difference in favor of SampleGreedy (even for multiple runs), while the superlinear increase for FANTOM is evident (Fig. 1e). The experiment was repeated 10 times for n starting at 10 and increasing geometrically to 300 in 20 steps, requiring approximately 5 minutes.
Since here SampleGreedy and FANTOM are quite close to each other in terms of performance and Greedy would lie only slightly below the plot of SampleGreedy, so to improve the readability of Fig. 1b we have removed Greedy from this comparison.
Remark 3. Based on the theoretical query complexities, one would expect the comparison between FANTOM and SampleGreedy in Figs. 1d and 1e to be qualitatively similar to n log 2 n vs. n log n. However, while FANTOM clearly exhibits a superlinear dependence of the query complexity on the input size, SampleGreedy does not. The reason for this is that, in practice, lazy evaluations often result in much less than log n evaluations per element. So, what we see in Figs. 1d and 1e is closer to n log n vs. n.
A roadmap for practitioners. Although we consider our work to be a theoretical paper, SampleGreedy is the first constant factor approximation algorithm for non-monotone submodular maximization with knapsack constraints that needs only O(n log n) value queries and can be applied in practice wherever Density Greedy can. Moreover, in the considered experiments it outperforms the state of the art, FANTOM. In some instances, the heuristic Density Greedy outputs marginally better solutions than SampleGreedy, albeit at the expense of offering no theoretical approximation guarantee (as we have argued in the Introduction). From a practical point of view, a reasonable strategy is to start with Density Greedy (SampleGreedy for p = 1), and then gradually reduce p, tuning it to the effective non-monotonicity of the instance at hand. Note that as soon as p is a constant factor away from 1, SampleGreedy guarantees a constant factor approximation to the optimal solution.

Conclusions
The proposed random greedy method yields a considerable improvement over state-of-theart algorithms, especially, but not exclusively, regarding the handling of huge instances. With all the subtleties of our work affecting solely our analysis, the algorithm remains strikingly simple and we are confident this will also contribute to its use in practice. Simultaneously, this very simplicity translates into a generality that can be employed to achieve comparably good results for a variety of settings; we demonstrated this in the case of the adaptive submodularity setting.
Specifically, we expect that our approach can be directly utilised to improve the performance and running time of algorithms that now use some variant of the algorithm of Gupta et al. (2010). Such examples include the distributed algorithm of da Ponte Barbosa et al. (2015) and the streaming algorithm of Mirzasoleiman et al. (2018) in the case of a knapsack constraint. We further suspect that the same algorithmic principle can be applied in the presence of incentives. This would largely improve the current state of the art in budget-feasible mechanism design for non-monotone objectives (Chen et al., 2011;Amanatidis et al., 2019).
A different direction would be to try other greedy algorithms for monotone objectives as a starting point. For instance, the 2-approximation algorithm of Yaroslavtsev et al. (2020) could potentially yield a better approximation ratio for the standard algorithmic setting. Unfortunately, it does not seem possible to translate more involved algorithms like that one to the adaptive setting, where one has to commit to all of their past choices.
Finally, a major question here is whether the same high level approach is valid even in the presence of additional combinatorial constraints. In particular, is it possible to achieve similar guarantees as FANTOM for a p-system and multiple knapsack constraints using only O(n log n) value queries?