
Our work generalizes existing studies in several directions, including contextual information, position discounts, and a more general cascading bandit model. We design a UCB-type algorithm, C^3-UCB, for this problem, prove an n-step regret bound \tildeO(\sqrtn) in the general setting, and give finer analysis for two special cases. We consider position discounts in the list order, so that the agent’s reward is discounted depending on the position where the stopping criterion is met. In online recommendation, the stopping criterion might be the first item a user selects in network routing, the stopping criterion might be the first edge blocked in a path. %X We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by some stopping criterion.
.png)
%C Proceedings of Machine Learning Research %B Proceedings of The 33rd International Conference on Machine Learning %T Contextual Combinatorial Cascading Bandits Experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.Ĭite this =

We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by some stopping criterion.
