Q-Learning: Can you move backwards?

Question

You seem to assume that you should look at the values of the state in the next time step. This is incorrect. The Q-function answers the question:

If I'm in state x, which action should I take?

In non-deterministic environments you don't even know what the next state will be, so it would be impossible to determine which action to take in your interpretation.

The learning part of Q-learning indeed acts on two subsequent timesteps, but after they are already known, and they are used to update the values of Q-function. This has nothing to do with how these samples (state, action, reinforcement, next state) are collected. In this case, samples are collected by the agent interacting with the environment. And in Q-learning setting agents interact with the environment according to a policy, which is based on current values of Q-function here. Conceptually, a policy works in terms of answering the question I quoted above.

In steps 1 and 2, the Q-function is modified only for states 1,A and 2,A. In step 3 the agent is in state 3,A so that's the only part of Q-function that's relevant.

In the 3rd step, how come the action taken is 'right' rather than 'up' (back to A2).

In state 3,A the action that has the highest Q-value is "right" (0.2). All other actions have value 0.0.

Also, how come 2,C has a reward value of 2 for action 'right' even though there's a wall there and not possible to go right? Do we just assume thats not a possible move and ignore its Q value?

As I see it, there is no wall to the right from 2,C. Nevertheless, the Q-function is given and it's irrelevant in this task whether it's possible to reach such Q-function using Q-learning. And you can always start Q-learning from an arbitrary Q-function anyway.

In Q-learning your only knowledge is the Q-function, so you don't know anything about "walls" and other things - you act according to Q-function, and that's the whole beauty of this algorithm.

Then in step 6, the Q values for going 'down' and 'right' at state 1,C are equal. At that point does the agent just pick randomly? So then for this question I would just pick the best move since it's possible the agent would pick it?

Again, you should look at the values for the state the agent is currently in, so for 1,B "right" is optimal - it has 0.1 and other actions are 0.0.

To answer the last question, even though it's irrelevant here: yes, if the agent is taking the greedy step and multiple actions seem optimal, it chooses one at random in most common policies.

Would it be true to say the agent doesn't return to the state he previously came from? Will an agent ever explore the same state more than once (not including starting a new instance of the maze)?

No. As I've stated above - the only guideline agent is using in pure Q-learning is the Q-function. It's not aware that it has been in a particular state before.