So the simulation's training iteration is 0? So how does the model be trained? And why does the plot show that regret is decreasing?
So the simulation's training iteration is 0? So how does the model be trained? And why does the plot show that regret is decreasing?