This is to simulate a single run for multi-armed bandit problem in a 10-arm test bed, where each arm is a Gaussian distribution with variance of 1. The mean of each arm, is drawn from a standard Normal distribution. See the details of the testbed here. The python code for simulation is here.
Below we will use three different policies, namely, greedy, ε-greedy, and upper confidence bound (UCB).