Skip to content

Rohith80231/my-bandit-algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

my-bandit-algorithm

Project Overview This notebook implements and compares two multi-armed bandit algorithms: Upper Confidence Bound (UCB) and Thompson Sampling (TS). These algorithms are designed to solve the 'exploration vs. exploitation' dilemma, common in scenarios like A/B testing for advertisements, where the goal is to maximize rewards by intelligently choosing between different options.

The dataset used simulates user interactions with 10 different advertisements over 15,000 rounds, with each entry indicating whether a user clicked on a specific ad (1) or not (0).

Upper Confidence Bound (UCB) Algorithm

The UCB algorithm is a deterministic approach that balances exploration and exploitation by selecting the arm (ad) that maximizes an 'upper confidence bound'. This bound is calculated as the sum of the average reward for that arm and a confidence interval. The confidence interval increases for arms that have been explored less, encouraging the algorithm to try out less-known options.

How it works:

  1. Initialize: Keep track of the number of times each ad has been selected (no_of_selections) and the sum of rewards received for each ad (sum_of_rewards).
  2. Calculate Upper Bound: For each ad, calculate its average reward. Then, compute a confidence interval based on the number of times the ad has been selected and the total number of rounds. Ads with fewer selections will have a wider confidence interval.
  3. Select Ad: Choose the ad with the highest upper confidence bound.
  4. Update: Record the selection, update the counts and rewards for the selected ad, and repeat for the next round.

This ensures that over time, the algorithm converges towards the best-performing ad while still giving less-explored ads a chance.

Thompson Sampling (TS) Algorithm

Thompson Sampling is a probabilistic algorithm that uses Bayesian inference to balance exploration and exploitation. Instead of a deterministic upper bound, it samples from the posterior distribution of the reward probability for each arm and selects the arm with the highest sampled value.

How it works:

  1. Initialize: For each ad, maintain two counters: num_of_rewards_1 (number of times the ad resulted in a reward of 1) and num_of_rewards_0 (number of times the ad resulted in a reward of 0).
  2. Sample from Beta Distribution: For each ad, draw a random sample from a Beta distribution. The parameters of the Beta distribution are derived from num_of_rewards_1 + 1 and num_of_rewards_0 + 1 (a common prior is Beta(1,1), representing a uniform distribution).
  3. Select Ad: Choose the ad that yielded the highest random sample value.
  4. Update: Observe the reward. If it's 1, increment num_of_rewards_1 for that ad; if it's 0, increment num_of_rewards_0. Repeat for the next round.

Thompson Sampling inherently balances exploration (by sampling from the distributions) and exploitation (by choosing the ad with the highest sampled probability), and it tends to perform very well in practice.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors