Q1: It will converge to a single mapping, unless more than one mapping is optimal.
Q2: Q-Learning has an exploration parameter that determines how often it takes random, potentially sub-optimal moves. Rewards will fluctuate as long as this parameter is non-zero.
Q3: Reward graphs, as in the link you provided. Check http://rl-community.org.