Q-Learning for a Grid World Navigation

Q-learning is a powerful reinforcement learning algorithm used to train an agent to make optimal decisions in an environment. This challenge asks you to implement Q-learning to train an agent to navigate a simple grid world to reach a goal state while avoiding obstacles. This is a fundamental problem in reinforcement learning and demonstrates the core principles of learning through trial and error.

Problem Description

You are tasked with implementing the Q-learning algorithm to train an agent to navigate a 2D grid world. The grid world consists of states (cells) where the agent can move up, down, left, or right. Some states are designated as obstacles, and the agent cannot move into them. A single goal state exists, and the agent receives a positive reward upon reaching it. The agent starts at a designated starting state.

What needs to be achieved:

Implement the Q-learning algorithm to update the Q-table iteratively.
Implement an agent that selects actions based on the Q-table (using an epsilon-greedy policy).
Train the agent to navigate from a starting state to a goal state in the grid world.

Key Requirements:

Grid World Representation: The grid world should be represented as a 2D list (or similar data structure) where 0 represents a free cell, 1 represents an obstacle, and 2 represents the goal state.
Q-Table Initialization: The Q-table should be initialized with zeros.
Epsilon-Greedy Action Selection: The agent should use an epsilon-greedy policy to select actions. Epsilon represents the probability of taking a random action (exploration) versus choosing the action with the highest Q-value (exploitation).
Q-Table Update: The Q-table should be updated using the Q-learning update rule.
Episode Termination: An episode terminates when the agent reaches the goal state or a maximum number of steps is reached.

Expected Behavior:

After training, the agent should consistently navigate from the starting state to the goal state, avoiding obstacles. The Q-table should converge to values that reflect the optimal policy for navigating the grid world.

Edge Cases to Consider:

No Path to Goal: If there is no path from the starting state to the goal state, the agent should explore the environment and eventually terminate the episode after reaching the maximum number of steps.
Obstacles Blocking All Paths: The agent should learn to avoid obstacles and find alternative routes if possible.
Starting State is the Goal State: The agent should immediately terminate the episode with a positive reward.

Examples

Example 1:

Grid World:
[
    [0, 0, 0, 0],
    [0, 1, 1, 0],
    [0, 0, 0, 2]
]
Starting State: (0, 0)
Goal State: (2, 3)
Epsilon: 0.1
Learning Rate (alpha): 0.1
Discount Factor (gamma): 0.9
Max Steps: 100

Output: (After sufficient training, the agent should consistently choose a path like (0,0) -> (1,0) -> (2,0) -> (2,1) -> (2,2) -> (2,3))

Explanation: The agent starts at (0,0) and learns to navigate around the obstacle at (1,1) and (1,2) to reach the goal at (2,3). The Q-table will be updated iteratively to reflect the optimal path.

Example 2:

Grid World:
[
    [0, 0, 0, 0],
    [0, 1, 1, 0],
    [0, 1, 1, 2]
]
Starting State: (0, 0)
Goal State: (2, 3)
Epsilon: 0.2
Learning Rate (alpha): 0.1
Discount Factor (gamma): 0.9
Max Steps: 100

Output: (After sufficient training, the agent should consistently choose a path like (0,0) -> (1,0) -> (2,0) -> (2,1) -> (2,2) -> (2,3))

Explanation: The agent learns to navigate around the obstacles, even though the path is slightly longer.

Example 3:

Grid World:
[
    [2, 0, 0, 0],
    [0, 1, 1, 0],
    [0, 0, 0, 0]
]
Starting State: (0, 0)
Goal State: (0, 0)
Epsilon: 0.1
Learning Rate (alpha): 0.1
Discount Factor (gamma): 0.9
Max Steps: 100

Output: The episode terminates immediately with a reward of 1.

Explanation: The starting state is the goal state, so the agent receives the reward and the episode ends.

Constraints

Grid Size: The grid world will be a 2D list with dimensions between 3x3 and 5x5.
Number of Obstacles: The number of obstacles will be between 2 and 5.
Epsilon: Epsilon should be between 0.01 and 0.3.
Learning Rate (alpha): Alpha should be between 0.01 and 0.5.
Discount Factor (gamma): Gamma should be between 0.8 and 0.99.
Max Steps: The maximum number of steps per episode should be between 50 and 200.
Input Format: The grid world will be provided as a 2D list of integers (0, 1, or 2). The starting and goal states will be specified as tuples (row, column).

Notes

Consider using NumPy for efficient array operations.
The Q-learning update rule is: Q(s, a) = Q(s, a) + alpha * [R(s, a) + gamma * max_a' Q(s', a') - Q(s, a)] where:
- Q(s, a) is the Q-value for state s and action a.
- alpha is the learning rate.
- R(s, a) is the reward for taking action a in state s.
- gamma is the discount factor.
- s' is the next state.
- a' is the next action.
The reward function should be: +1 for reaching the goal state, 0 for all other states, and -1 for hitting an obstacle (optional, but can speed up learning).
Focus on implementing the core Q-learning algorithm. Visualization is not required.
Think about how to handle invalid actions (e.g., moving into an obstacle).