Project Overview This notebook implements and compares two multi-armed bandit algorithms: Upper Confidence Bound (UCB) and Thompson Sampling (TS). These algorithms are designed to solve the 'exploration vs. exploitation' dilemma, common in scenarios like A/B testing for advertisements, where the goal is to maximize rewards by intelligently choosing between different options.
The dataset used simulates user interactions with 10 different advertisements over 15,000 rounds, with each entry indicating whether a user clicked on a specific ad (1) or not (0).
The UCB algorithm is a deterministic approach that balances exploration and exploitation by selecting the arm (ad) that maximizes an 'upper confidence bound'. This bound is calculated as the sum of the average reward for that arm and a confidence interval. The confidence interval increases for arms that have been explored less, encouraging the algorithm to try out less-known options.
How it works:
- Initialize: Keep track of the number of times each ad has been selected (
no_of_selections) and the sum of rewards received for each ad (sum_of_rewards). - Calculate Upper Bound: For each ad, calculate its average reward. Then, compute a confidence interval based on the number of times the ad has been selected and the total number of rounds. Ads with fewer selections will have a wider confidence interval.
- Select Ad: Choose the ad with the highest upper confidence bound.
- Update: Record the selection, update the counts and rewards for the selected ad, and repeat for the next round.
This ensures that over time, the algorithm converges towards the best-performing ad while still giving less-explored ads a chance.
Thompson Sampling is a probabilistic algorithm that uses Bayesian inference to balance exploration and exploitation. Instead of a deterministic upper bound, it samples from the posterior distribution of the reward probability for each arm and selects the arm with the highest sampled value.
How it works:
- Initialize: For each ad, maintain two counters:
num_of_rewards_1(number of times the ad resulted in a reward of 1) andnum_of_rewards_0(number of times the ad resulted in a reward of 0). - Sample from Beta Distribution: For each ad, draw a random sample from a Beta distribution. The parameters of the Beta distribution are derived from
num_of_rewards_1 + 1andnum_of_rewards_0 + 1(a common prior is Beta(1,1), representing a uniform distribution). - Select Ad: Choose the ad that yielded the highest random sample value.
- Update: Observe the reward. If it's 1, increment
num_of_rewards_1for that ad; if it's 0, incrementnum_of_rewards_0. Repeat for the next round.
Thompson Sampling inherently balances exploration (by sampling from the distributions) and exploitation (by choosing the ad with the highest sampled probability), and it tends to perform very well in practice.