How to Manually Compute a Neural Network’s Predictions and Backpropagation Step by Step

A complete mathematical walkthrough for the 2–2–1 network with sigmoid activations and mean squared error.

1. Motivation: Why Compute a Neural Network by Hand?

Before we dive into the math, let’s clarify why this exercise matters.

When you compute a forward and backward pass manually, you see exactly what each weight and bias is doing.
You demystify the “black box” — it’s just linear algebra plus calculus.
You understand how loss functions and activation derivatives link together.

By the end of this article, you’ll be able to write down every quantity in a small neural network, compute the loss, derive gradients, and perform one full parameter update by hand.

2. Setup and Notation

We’ll work with a tiny feed-forward neural network with two inputs, one hidden layer of two neurons, and one output neuron.

That’s a 2–2–1 architecture.

Network overview

Input (x1, x2)
   ↓
Hidden layer (2 neurons, sigmoid activation)
   ↓
Output layer (1 neuron, sigmoid activation)
   ↓
Loss (MSE)

3. Entities and Dependencies

Before jumping into derivatives, let’s list everything that exists in this system — the entities and their algebraic relationships.

This is the full dependency structure we’ll use later for the chain rule:

$$ \text{Input:}\quad x \in \mathbb{R}^2 $$

$$ \text{Hidden layer:}\quad z^{(1)} = W^{(1)}x + b^{(1)} $$

$$ \quad a^{(1)} = \sigma\big(z^{(1)}\big) $$

$$ \text{Output layer:}\quad z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} $$

$$ \quad y = \sigma\big(z^{(2)}\big) $$

$$ \text{Loss:}\quad L = \dfrac{1}{2}\big(y - t\big)^2 $$

Sigmoid activation

$$ \sigma(u) = \frac{1}{1 + e^{-u}}, \quad \sigma’(u) = \sigma(u)(1 - \sigma(u)) $$

Formal objective (mathematical statement)

We want to find parameters $$ \Theta = {W^{(1)},; b^{(1)},; W^{(2)},; b^{(2)}} $$ that minimize the loss function:

$$ J(\Theta) = \frac{1}{2}(y - t)^2 = \frac{1}{2}\left[\sigma\big(W^{(2)}\sigma(W^{(1)}x + b^{(1)}) + b^{(2)}\big) - t\right]^2 $$

To minimize $J(\Theta)$, we need the gradients of $J$ with respect to every parameter in $\Theta$:

$$ \frac{\partial J}{\partial W^{(2)}},\quad \frac{\partial J}{\partial b^{(2)}},\quad \frac{\partial J}{\partial W^{(1)}},\quad \frac{\partial J}{\partial b^{(1)}} $$

Each of these will be derived through the chain rule, following the dependency structure defined above.

That’s what backpropagation really is: a structured application of the chain rule across this computation graph.

4. Network Parameters and Example Values

To make this concrete, we fix specific values for all quantities. (All symbols and values are now rendered in math mode for clarity.)

Symbol	Meaning	Value
$x$	Input vector	$\begin{bmatrix} 1.0 \ 0.5 \end{bmatrix}$
$W^{(1)}$	Hidden-layer weights	$\begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix}$
$b^{(1)}$	Hidden-layer biases	$\begin{bmatrix} 0 \ 0 \end{bmatrix}$
$W^{(2)}$	Output-layer weights	$\begin{bmatrix} 0.3 & -0.2 \end{bmatrix}$
$b^{(2)}$	Output-layer bias	$0.1$
$t$	Target	$1.0$
$\eta$	Learning rate	$0.1$

5. Forward Pass (Prediction)

We compute the network’s output step by step, boxing each major result for clarity.

Step 1 – Hidden layer pre-activation

The hidden layer computes a weighted sum of the inputs plus bias:

$$ z^{(1)} = W^{(1)}x + b^{(1)} $$

Substitute numbers:

$$ z^{(1)} = \begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} + \begin{bmatrix} 0 \ 0 \end{bmatrix} $$

Calculate each component:

$$ \begin{aligned} z^{(1)}_1 &= 0.2 \cdot 1.0 + (-0.4) \cdot 0.5 + 0 = 0.2 - 0.2 = 0.0 \ z^{(1)}_2 &= 0.7 \cdot 1.0 + 0.1 \cdot 0.5 + 0 = 0.7 + 0.05 = 0.75 \end{aligned} $$

Boxed result:

$$ \boxed{ z^{(1)} = \begin{bmatrix} 0.0 \ 0.75 \end{bmatrix} } $$

Step 2 – Hidden layer activation

Apply the sigmoid function to each component:

$$ a^{(1)} = \sigma(z^{(1)}) = \begin{bmatrix} \sigma(0.0) \ \sigma(0.75) \end{bmatrix} $$

Calculate each value:

$$ \begin{aligned} \sigma(0.0) &= \frac{1}{1 + e^{0}} = 0.5 \ \sigma(0.75) &= \frac{1}{1 + e^{-0.75}} \approx 0.6791787 \end{aligned} $$

Boxed result:

$$ \boxed{ a^{(1)} = \begin{bmatrix} 0.5 \ 0.6791787 \end{bmatrix} } $$

Step 3 – Output pre-activation

The output neuron computes:

$$ z^{(2)} = W^{(2)}a^{(1)} + b^{(2)} $$

Substitute values:

$$ z^{(2)} = [0.3, -0.2] \begin{bmatrix} 0.5 \ 0.6791787 \end{bmatrix} + 0.1 $$

Calculate:

$$ 0.3 \cdot 0.5 + (-0.2) \cdot 0.6791787 + 0.1 = 0.15 - 0.1358357 + 0.1 = 0.1141643 $$

Boxed result:

$$ \boxed{z^{(2)} = 0.1141643} $$

Step 4 – Output activation

Apply the sigmoid function:

$$ y = \sigma(z^{(2)}) = \frac{1}{1 + e^{-0.1141643}} \approx 0.5285101 $$

Boxed result:

$$ \boxed{y \approx 0.5285} $$

Step 5 – Loss value

Compute the mean squared error:

$$ L = \frac{1}{2}(y - t)^2 = \frac{1}{2}(0.5285 - 1)^2 = 0.11115 $$

Boxed result:

$$ \boxed{L \approx 0.11115} $$

The network predicts 0.5285 when the target is 1.0, so the error is about 0.47 and the loss ≈ 0.11.

6. Before Backpropagation: What Are We Minimizing?

Let’s pause before diving into gradients.

We have a scalar loss function $L(W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})$.

All the internal variables $z^{(1)}$, $a^{(1)}$, $z^{(2)}$, $y$ depend on the parameters.

Our goal is to find the direction in parameter space that decreases (L) the fastest.

Mathematically:

$$ \nabla_\Theta L = \begin{bmatrix} \frac{\partial L}{\partial W^{(1)}}, & \frac{\partial L}{\partial b^{(1)}}, & \frac{\partial L}{\partial W^{(2)}}, & \frac{\partial L}{\partial b^{(2)}} \end{bmatrix} $$

We’ll compute these by applying the chain rule systematically across our dependency chain from Section 3.

7. Backpropagation — General Idea

We start from the scalar loss (L) and move backward through the network graph.

The core chain-rule logic is:

$$ \frac{\partial L}{\partial z^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} $$

and

$$ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial z^{(l)}} , (a^{(l-1)})^T, \qquad \frac{\partial L}{\partial b^{(l)}} = \frac{\partial L}{\partial z^{(l)}}. $$

This pattern repeats for every layer (l).

The term $(\frac{\partial L}{\partial z^{(l)}})$ is often called the delta of that layer:

$$ \delta^{(l)} = \frac{\partial L}{\partial z^{(l)}} $$

So “backpropagation” literally means computing each $(\delta)$ layer by layer, backward.

8. Backpropagation in the 2–2–1 Network (Numerical)

Now, we’ll compute all gradients explicitly using the example values.

Step 1 — Loss derivative wrt output

The derivative of the loss with respect to the output is:

$$ \frac{\partial L}{\partial y} = y - t = 0.5285101 - 1 = -0.4714899 $$

Boxed result:

$$ \boxed{\frac{\partial L}{\partial y} = -0.4714899} $$

Step 2 — Derivative of sigmoid at output

The derivative of the sigmoid at the output is:

$$ \frac{dy}{dz^{(2)}} = y(1 - y) = 0.5285101(1 - 0.5285101) = 0.2491872 $$

Boxed result:

$$ \boxed{\frac{dy}{dz^{(2)}} = 0.2491872} $$

Step 3 — Delta of output layer

The error signal (delta) for the output layer is:

$$ \delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial y} \cdot \frac{dy}{dz^{(2)}} = (-0.4714899)(0.2491872) = -0.1174892 $$

Boxed result:

$$ \boxed{\delta^{(2)} = -0.1174892} $$

Step 4 — Gradients for output layer

Recall:

$$ \frac{\partial L}{\partial W^{(2)}} = \delta^{(2)} (a^{(1)})^T, \qquad \frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} $$

Compute:

$$ \frac{\partial L}{\partial W^{(2)}} = -0.1174892 \begin{bmatrix}0.5 & 0.6791787\end{bmatrix} = \begin{bmatrix}-0.0587446 & -0.0797962\end{bmatrix} $$

Boxed result:

$$ \boxed{\frac{\partial L}{\partial W^{(2)}} = \begin{bmatrix}-0.0587446 & -0.0797962\end{bmatrix}} $$

and

$$ \frac{\partial L}{\partial b^{(2)}} = -0.1174892 $$

Boxed result:

$$ \boxed{\frac{\partial L}{\partial b^{(2)}} = -0.1174892} $$

Step 5 — Backpropagate to hidden layer

First, propagate through the weights:

$$ \frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \delta^{(2)} = \begin{bmatrix}0.3 \ -0.2\end{bmatrix}(-0.1174892) = \begin{bmatrix}-0.0352468 \ 0.0234978\end{bmatrix} $$

Then multiply elementwise by the derivative of the hidden sigmoid:

$$ \frac{da^{(1)}}{dz^{(1)}} = a^{(1)} \odot (1 - a^{(1)}) = \begin{bmatrix} 0.5(1-0.5) \ 0.6791787(1-0.6791787) \end{bmatrix} = \begin{bmatrix} 0.25 \ 0.217895 \end{bmatrix} $$

So:

$$ \delta^{(1)} = \frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \odot \frac{da^{(1)}}{dz^{(1)}} = \begin{bmatrix} -0.0352468(0.25) \ 0.0234978(0.217895) \end{bmatrix} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} $$

Boxed result:

$$ \boxed{\delta^{(1)} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix}} $$

Step 6 — Gradients for hidden layer

$$ \frac{\partial L}{\partial W^{(1)}} = \delta^{(1)} x^T, \quad \frac{\partial L}{\partial b^{(1)}} = \delta^{(1)} $$

Compute:

$$ \frac{\partial L}{\partial W^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} \begin{bmatrix}1 & 0.5\end{bmatrix} = \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix} $$

Boxed result:

$$ \boxed{\frac{\partial L}{\partial W^{(1)}} = \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix}} $$

and

$$ \frac{\partial L}{\partial b^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} $$

Boxed result:

$$ \boxed{\frac{\partial L}{\partial b^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix}} $$

9. Parameter Update

We apply gradient descent:

$$ \theta_{\text{new}} = \theta_{\text{old}} - \eta \frac{\partial L}{\partial \theta} $$

Output layer updates

Update the output layer weights and bias using gradient descent:

$$ W^{(2)}_{\text{new}} = [0.3, -0.2] - 0.1[-0.0587446, -0.0797962] = [0.3058745, -0.1920204] $$

Boxed result:

$$ \boxed{W^{(2)}_{\text{new}} = [0.3058745, -0.1920204]} $$

$$ b^{(2)}_{\text{new}} = 0.1 - 0.1(-0.1174892) = 0.1117489 $$

Boxed result:

$$ \boxed{b^{(2)}_{\text{new}} = 0.1117489} $$

Hidden layer updates

Update the hidden layer weights and biases (each operation shown on its own line for clarity):

$$ \begin{aligned} W^{(1)}_{\text{new}} &= \begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix} - 0.1 \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix} \ &= \begin{bmatrix} 0.2008812 & -0.3995594 \ 0.6994880 & 0.0997440 \end{bmatrix} \end{aligned} $$

Boxed result:

$$ \boxed{W^{(1)}_{\text{new}} = \begin{bmatrix} 0.2008812 & -0.3995594 \ 0.6994880 & 0.0997440 \end{bmatrix}} $$

For the biases:

$$ \begin{aligned} b^{(1)}_{\text{new}} &= \begin{bmatrix} 0 \ 0 \end{bmatrix} - 0.1 \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} \ &= \begin{bmatrix} 0.0008812 \ -0.0005120 \end{bmatrix} \end{aligned} $$

Boxed result:

$$ \boxed{b^{(1)}_{\text{new}} = \begin{bmatrix} 0.0008812 \ -0.0005120 \end{bmatrix}} $$

10. New Forward Pass After Update

Let’s verify that the loss decreases.

Hidden layer

The new hidden layer pre-activation is:

$$ \boxed{ z^{(1)}_\text{new} = \begin{bmatrix} 0.00198 \ 0.74885 \end{bmatrix} } $$

The new hidden layer activation is:

$$ \boxed{ a^{(1)}_\text{new} = \begin{bmatrix} 0.5005 \ 0.6789 \end{bmatrix} } $$

Output

$$ z^{(2)}_{\text{new}} = [0.3058745, -0.1920204] \begin{bmatrix} 0.5005 \ 0.6789 \end{bmatrix} + 0.1117489 = 0.13447 $$

$$ y_{\text{new}} = \sigma(0.13447) = 0.53357 $$

Loss

After the update, the new loss is:

$$ L_{\text{new}} = \tfrac12(0.53357 - 1)^2 = 0.10878 $$

Boxed result:

$$ \boxed{L_{\text{new}} \approx 0.10878} $$

✅ Loss decreased from 0.11115 → 0.10878 after one update.

That’s the whole training process — on a microscopic scale.

11. Gradient Interpretation

Since $y < t$, the output delta $\delta^{(2)} < 0$. $\rightarrow$ The gradient signs push $W^{(2)}$ and $b^{(2)}$ upward, increasing $y$ on the next pass.
Hidden-layer deltas $\delta^{(1)}$ are smaller in magnitude, since they’re “modulated” by both $W^{(2)}$ and the sigmoid derivative $a^{(1)}(1 - a^{(1)})$.
Gradients for weights scale with the input $x$; if $x_i = 0$, that feature doesn’t influence that weight’s update.

12. The Algebraic Backbone of Backpropagation

Let’s make the core relationships explicit — these are the same identities we used above, just grouped together:

$$ z^{(1)} = W^{(1)}x + b^{(1)} $$

$$ a^{(1)} = \sigma(z^{(1)}) $$

$$ z^{(2)} = W^{(2)}a^{(1)} + b^{(2)} $$

$$ y = \sigma(z^{(2)}) $$

$$ L = \frac{1}{2}(y - t)^2 $$

And their derivatives:

$$ \frac{\partial L}{\partial y} = y - t $$

$$ \frac{dy}{dz^{(2)}} = y(1-y) $$

$$ \frac{\partial L}{\partial z^{(2)}} = (y - t)y(1 - y) $$

$$ \frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial z^{(2)}} (a^{(1)})^T $$

$$ \frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial z^{(2)}} $$

$$ \frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \frac{\partial L}{\partial z^{(2)}} $$

$$ \frac{da^{(1)}}{dz^{(1)}} = a^{(1)}(1 - a^{(1)}) $$

$$ \frac{\partial L}{\partial z^{(1)}} = \left[(W^{(2)})^T \frac{\partial L}{\partial z^{(2)}}\right] \odot a^{(1)}(1 - a^{(1)}) $$

$$ \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(1)}} x^T $$

$$ \frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial z^{(1)}} $$

13. Extending to Multiple Layers and Batches

For multiple layers

If you stack more layers, the same pattern applies recursively:

$$ \delta^{(l)} = \big((W^{(l+1)})^T \delta^{(l+1)}\big) \odot f’(z^{(l)}), $$

and

$$ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)}(a^{(l-1)})^T, \quad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}. $$

For mini-batches

Let $X \in \mathbb{R}^{d \times N}$ (each column an input sample), and $T \in \mathbb{R}^{m \times N}$ (targets).

Then:

$$ Z = WX + b\mathbf{1}^T, \qquad Y = f(Z), $$

and

$$ \frac{\partial L}{\partial W} = \frac{1}{N}(\frac{\partial L}{\partial Z}) X^T, \qquad \frac{\partial L}{\partial b} = \frac{1}{N}(\frac{\partial L}{\partial Z}) \mathbf{1}. $$

It’s the same math, just summed (or averaged) over (N) samples.

14. Common Mistakes and Sanity Checks

Shape mismatches — Always verify that $\frac{\partial L}{\partial W}$ has the same shape as $W$.
Sigmoid derivative misuse — Use $y(1 - y)$, not $\sigma’(z)$ recomputed from scratch.
Forgetting the bias gradient — Always sum over batch dimensions.
Sign errors — If loss increases, check whether you’re subtracting or adding the gradient.
Exploding/vanishing gradients — Common with deep sigmoids. ReLU or tanh activations can help.

15. The Big Picture

From a high-level algebraic perspective:

Each layer applies an affine transformation (Wx + b).
Each activation applies a nonlinear scalar function (f).
Each loss measures distance between output and target.
Backpropagation is just systematically applying the chain rule through these compositions.

So, conceptually:

$$ L = \ell(f_2(W_2 f_1(W_1 x + b_1) + b_2), t) $$

and we simply compute:

$$ \nabla_\Theta L = \frac{\partial \ell}{\partial f_2} \cdot \frac{\partial f_2}{\partial W_2} \cdot \frac{\partial f_1}{\partial W_1} \cdot \ldots $$

Everything else — whether for deep networks or transformers — builds on this same principle.

16. Final Recap and Takeaways

We defined the dependencies explicitly: $x \to z^{(1)} \to a^{(1)} \to z^{(2)} \to y \to L$.
We derived the mathematical objective $J(\Theta) = \frac{1}{2}(y - t)^2$.
We computed each derivative via the chain rule step by step.
We ran one full iteration, confirmed weight updates, and showed that the loss decreased.
We generalized the pattern to any number of layers and to batch processing.

If you’ve followed this walkthrough, you’ve effectively implemented the core of backpropagation — by hand, line by line. That’s the foundation of every modern deep learning library, from TensorFlow to PyTorch.

Written for learners who want to see every symbol move, every dependency formalized, and every derivative unfold — one layer at a time.