Implementing Backpropagation for a Simple Neural Network

Backpropagation is the cornerstone of training most modern neural networks. This challenge asks you to implement the backpropagation algorithm for a simple, feedforward neural network with a single hidden layer. Successfully completing this challenge demonstrates a fundamental understanding of how neural networks learn.

Problem Description

You are tasked with implementing the backpropagation algorithm to train a neural network. The network will have:

An input layer with a fixed number of neurons (determined by the input size).
A single hidden layer with a configurable number of neurons.
An output layer with a single neuron (for binary classification).

The network will use the sigmoid activation function for both the hidden and output layers. Your implementation should calculate the gradients of the loss function (binary cross-entropy) with respect to the weights and biases of the network, and return these gradients. The input data will be provided as a NumPy array, and the target values as another NumPy array.

Key Requirements:

Implement the forward pass through the network.
Calculate the binary cross-entropy loss.
Implement the backward pass (backpropagation) to compute gradients for weights and biases in both layers.
Return the calculated gradients in a dictionary format.

Expected Behavior:

The function should take the input data, target data, network architecture (input size, hidden layer size), and the current weights and biases as input. It should return a dictionary containing the gradients of the loss with respect to each weight matrix and bias vector. The gradients should be in the same shape as the corresponding weights and biases.

Edge Cases to Consider:

Input data and target data should have compatible shapes.
Handle potential numerical instability issues (e.g., sigmoid function approaching 0 or 1). While not strictly required to solve the instability, be mindful of it.
Ensure the gradients are calculated correctly for different network architectures and input data.

Examples

Example 1:

Input: X = np.array([[0, 0]]), Y = np.array([[1]]), input_size = 2, hidden_size = 4, W1 = np.random.rand(2, 4), b1 = np.zeros((1, 4)), W2 = np.random.rand(4, 1), b2 = np.zeros((1, 1))
Output: {'dW1': ..., 'db1': ..., 'dW2': ..., 'db2': ...}
Explanation: The function should calculate the gradients for the given input, target, and network parameters. The exact values of the gradients will depend on the random initialization of the weights and biases.

Example 2:

Input: X = np.array([[0, 1], [1, 0]]), Y = np.array([[0], [1]]), input_size = 2, hidden_size = 3, W1 = np.random.rand(2, 3), b1 = np.zeros((1, 3)), W2 = np.random.rand(3, 1), b2 = np.zeros((1, 1))
Output: {'dW1': ..., 'db1': ..., 'dW2': ..., 'db2': ...}
Explanation:  This example uses a slightly larger dataset and hidden layer size. The gradients should still be calculated correctly.

Example 3: (Edge Case - Single Input)

Input: X = np.array([[1]]), Y = np.array([[0]]), input_size = 1, hidden_size = 2, W1 = np.random.rand(1, 2), b1 = np.zeros((1, 2)), W2 = np.random.rand(2, 1), b2 = np.zeros((1, 1))
Output: {'dW1': ..., 'db1': ..., 'dW2': ..., 'db2': ...}
Explanation:  Demonstrates the algorithm's ability to handle a single input sample.

Constraints

Input Shape: X will be a NumPy array of shape (m, input_size), where m is the number of samples. Y will be a NumPy array of shape (m, 1).
Network Architecture: input_size and hidden_size will be positive integers.
Weight and Bias Initialization: W1 will be a NumPy array of shape (input_size, hidden_size). b1 will be a NumPy array of shape (1, hidden_size). W2 will be a NumPy array of shape (hidden_size, 1). b2 will be a NumPy array of shape (1, 1).
Performance: The function should complete within a reasonable time (e.g., less than 1 second) for moderately sized inputs (e.g., m = 100, input_size = 10, hidden_size = 20).
NumPy: You are required to use NumPy for all array operations.

Notes

Remember to apply the chain rule when calculating gradients.
The sigmoid function is defined as sigmoid(x) = 1 / (1 + np.exp(-x)). Its derivative is sigmoid(x) * (1 - sigmoid(x)).
Binary cross-entropy loss is defined as - [Y * log(A) + (1 - Y) * log(1 - A)], where A is the output of the network.
The gradients should be calculated for each sample individually and then summed across all samples.
Consider using intermediate variables to improve code readability.
Debugging can be challenging; start with small, simple examples and gradually increase the complexity.
The output dictionary should contain the keys 'dW1', 'db1', 'dW2', and 'db2', corresponding to the gradients of the weights and biases for the first and second layers, respectively.