Reinforcement Learning: Policy Gradient Implementation

Policy gradient methods are a powerful class of reinforcement learning algorithms that directly optimize the policy without explicitly learning a value function. This challenge asks you to implement a basic policy gradient agent to solve a simple environment. Successfully completing this challenge will demonstrate your understanding of policy gradient concepts and your ability to translate them into Python code.

Problem Description

You are tasked with implementing a policy gradient agent to navigate a simple 1D environment. The environment consists of a single line, and the agent can move left or right. The goal is to reach a target location on the line. The agent receives a reward of +1 when it reaches the target and a reward of -0.1 for each step taken. The agent's policy is represented by a neural network that outputs probabilities for moving left or right, given the current state.

What needs to be achieved:

Implement a policy gradient agent using a neural network to represent the policy.
Train the agent to maximize the cumulative reward by adjusting the policy parameters.
Evaluate the trained agent's performance by running it in the environment for a fixed number of episodes.

Key Requirements:

Environment: A simple 1D environment with a target location. The environment should provide the current state and allow the agent to take actions (left or right).
Policy Network: A neural network (e.g., using PyTorch or TensorFlow) that takes the current state as input and outputs probabilities for moving left or right.
Policy Gradient Update: Implement the policy gradient update rule to adjust the policy parameters based on the observed rewards.
Training Loop: A training loop that iterates over episodes, collects experience, and updates the policy.
Evaluation: A function to evaluate the trained agent's performance by running it in the environment and calculating the average cumulative reward.

Expected Behavior:

The agent should learn to move towards the target location and minimize the number of steps taken. The cumulative reward should increase over training episodes. The agent should eventually reach the target location with a high probability.

Edge Cases to Consider:

Stuck States: The agent might get stuck in states where it doesn't know which action to take. Consider adding exploration (e.g., epsilon-greedy) to help the agent escape these states.
Reward Shaping: The reward function is simple. More complex reward functions might be needed for more challenging environments.
Hyperparameter Tuning: The learning rate and other hyperparameters will significantly impact the agent's performance. Experiment with different values to find the optimal settings.

Examples

Example 1:

Input: Environment: 1D line, target at position 5, agent starts at position 0, discount factor = 0.99, learning rate = 0.01, number of episodes = 1000
Output: Trained policy network that consistently moves the agent towards the target. Average cumulative reward after 1000 episodes: > -4.5
Explanation: The agent learns to move right repeatedly until it reaches the target, minimizing the negative reward for each step.

Example 2:

Input: Environment: 1D line, target at position 10, agent starts at position 0, discount factor = 0.95, learning rate = 0.005, number of episodes = 5000
Output: Trained policy network that efficiently navigates the environment. Average cumulative reward after 5000 episodes: > -9.0
Explanation: With a longer training duration and a smaller learning rate, the agent can fine-tune its policy to achieve a higher cumulative reward.

Constraints

Environment: The environment should be implemented as a Python class with methods for reset(), step(action), and is_done().
Neural Network: You can use any deep learning framework (e.g., PyTorch, TensorFlow) to implement the policy network.
Training Time: The training loop should run for a reasonable number of episodes (e.g., 1000-5000).
State Space: The state space is discrete and represents the agent's position on the 1D line (e.g., integers from 0 to 10).
Action Space: The action space is discrete and consists of two actions: left (0) and right (1).
Performance: The agent should demonstrate a clear improvement in performance over training episodes, as measured by the average cumulative reward.

Notes

Start with a simple environment and gradually increase the complexity.
Consider using a discount factor to prioritize immediate rewards.
Experiment with different neural network architectures and hyperparameters.
Implement exploration strategies (e.g., epsilon-greedy) to prevent the agent from getting stuck in local optima.
Visualize the agent's performance over training episodes to gain insights into its learning process.
The core of the policy gradient algorithm involves calculating the gradient of the expected return with respect to the policy parameters and updating the parameters in the direction of the gradient. Remember to use the log probabilities of the actions taken to calculate this gradient.