How to Manually Compute a Neural Network’s Predictions and Backpropagation Step by Step
A complete mathematical walkthrough for the 2–2–1 network with sigmoid activations and mean squared error.
1. Motivation: Why Compute a Neural Network by Hand?
Before we dive into the math, let’s clarify why this exercise matters.
- When you compute a forward and backward pass manually, you see exactly what each weight and bias is doing.
- You demystify the “black box” — it’s just linear algebra plus calculus.
- You understand how loss functions and activation derivatives link together.
By the end of this article, you’ll be able to write down every quantity in a small neural network, compute the loss, derive gradients, and perform one full parameter update by hand.
2. Setup and Notation
We’ll work with a tiny feed-forward neural network with two inputs, one hidden layer of two neurons, and one output neuron.
That’s a 2–2–1 architecture.
Network overview
Input (x1, x2)
↓
Hidden layer (2 neurons, sigmoid activation)
↓
Output layer (1 neuron, sigmoid activation)
↓
Loss (MSE)3. Entities and Dependencies
Before jumping into derivatives, let’s list everything that exists in this system — the entities and their algebraic relationships.
This is the full dependency structure we’ll use later for the chain rule:
$$ \text{Input:}\quad x \in \mathbb{R}^2 $$
$$ \text{Hidden layer:}\quad z^{(1)} = W^{(1)}x + b^{(1)} $$
$$ \quad a^{(1)} = \sigma\big(z^{(1)}\big) $$
$$ \text{Output layer:}\quad z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} $$
$$ \quad y = \sigma\big(z^{(2)}\big) $$
$$ \text{Loss:}\quad L = \dfrac{1}{2}\big(y - t\big)^2 $$
Sigmoid activation
$$ \sigma(u) = \frac{1}{1 + e^{-u}}, \quad \sigma’(u) = \sigma(u)(1 - \sigma(u)) $$
Formal objective (mathematical statement)
We want to find parameters $$ \Theta = {W^{(1)},; b^{(1)},; W^{(2)},; b^{(2)}} $$ that minimize the loss function:
$$ J(\Theta) = \frac{1}{2}(y - t)^2 = \frac{1}{2}\left[\sigma\big(W^{(2)}\sigma(W^{(1)}x + b^{(1)}) + b^{(2)}\big) - t\right]^2 $$
To minimize $J(\Theta)$, we need the gradients of $J$ with respect to every parameter in $\Theta$:
$$ \frac{\partial J}{\partial W^{(2)}},\quad \frac{\partial J}{\partial b^{(2)}},\quad \frac{\partial J}{\partial W^{(1)}},\quad \frac{\partial J}{\partial b^{(1)}} $$
Each of these will be derived through the chain rule, following the dependency structure defined above.
That’s what backpropagation really is: a structured application of the chain rule across this computation graph.
4. Network Parameters and Example Values
To make this concrete, we fix specific values for all quantities. (All symbols and values are now rendered in math mode for clarity.)
| Symbol | Meaning | Value |
|---|---|---|
| $x$ | Input vector | $\begin{bmatrix} 1.0 \ 0.5 \end{bmatrix}$ |
| $W^{(1)}$ | Hidden-layer weights | $\begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix}$ |
| $b^{(1)}$ | Hidden-layer biases | $\begin{bmatrix} 0 \ 0 \end{bmatrix}$ |
| $W^{(2)}$ | Output-layer weights | $\begin{bmatrix} 0.3 & -0.2 \end{bmatrix}$ |
| $b^{(2)}$ | Output-layer bias | $0.1$ |
| $t$ | Target | $1.0$ |
| $\eta$ | Learning rate | $0.1$ |
5. Forward Pass (Prediction)
We compute the network’s output step by step, boxing each major result for clarity.
Step 1 – Hidden layer pre-activation
The hidden layer computes a weighted sum of the inputs plus bias:
$$ z^{(1)} = W^{(1)}x + b^{(1)} $$
Substitute numbers:
$$ z^{(1)} = \begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix} \begin{bmatrix} 1.0 \ 0.5 \end{bmatrix} + \begin{bmatrix} 0 \ 0 \end{bmatrix} $$
Calculate each component:
$$ \begin{aligned} z^{(1)}_1 &= 0.2 \cdot 1.0 + (-0.4) \cdot 0.5 + 0 = 0.2 - 0.2 = 0.0 \ z^{(1)}_2 &= 0.7 \cdot 1.0 + 0.1 \cdot 0.5 + 0 = 0.7 + 0.05 = 0.75 \end{aligned} $$
Boxed result:
$$ \boxed{ z^{(1)} = \begin{bmatrix} 0.0 \ 0.75 \end{bmatrix} } $$
Step 2 – Hidden layer activation
Apply the sigmoid function to each component:
$$ a^{(1)} = \sigma(z^{(1)}) = \begin{bmatrix} \sigma(0.0) \ \sigma(0.75) \end{bmatrix} $$
Calculate each value:
$$ \begin{aligned} \sigma(0.0) &= \frac{1}{1 + e^{0}} = 0.5 \ \sigma(0.75) &= \frac{1}{1 + e^{-0.75}} \approx 0.6791787 \end{aligned} $$
Boxed result:
$$ \boxed{ a^{(1)} = \begin{bmatrix} 0.5 \ 0.6791787 \end{bmatrix} } $$
Step 3 – Output pre-activation
The output neuron computes:
$$ z^{(2)} = W^{(2)}a^{(1)} + b^{(2)} $$
Substitute values:
$$ z^{(2)} = [0.3, -0.2] \begin{bmatrix} 0.5 \ 0.6791787 \end{bmatrix} + 0.1 $$
Calculate:
$$ 0.3 \cdot 0.5 + (-0.2) \cdot 0.6791787 + 0.1 = 0.15 - 0.1358357 + 0.1 = 0.1141643 $$
Boxed result:
$$ \boxed{z^{(2)} = 0.1141643} $$
Step 4 – Output activation
Apply the sigmoid function:
$$ y = \sigma(z^{(2)}) = \frac{1}{1 + e^{-0.1141643}} \approx 0.5285101 $$
Boxed result:
$$ \boxed{y \approx 0.5285} $$
Step 5 – Loss value
Compute the mean squared error:
$$ L = \frac{1}{2}(y - t)^2 = \frac{1}{2}(0.5285 - 1)^2 = 0.11115 $$
Boxed result:
$$ \boxed{L \approx 0.11115} $$
The network predicts 0.5285 when the target is 1.0, so the error is about 0.47 and the loss ≈ 0.11.
6. Before Backpropagation: What Are We Minimizing?
Let’s pause before diving into gradients.
We have a scalar loss function $L(W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})$.
All the internal variables $z^{(1)}$, $a^{(1)}$, $z^{(2)}$, $y$ depend on the parameters.
Our goal is to find the direction in parameter space that decreases (L) the fastest.
Mathematically:
$$ \nabla_\Theta L = \begin{bmatrix} \frac{\partial L}{\partial W^{(1)}}, & \frac{\partial L}{\partial b^{(1)}}, & \frac{\partial L}{\partial W^{(2)}}, & \frac{\partial L}{\partial b^{(2)}} \end{bmatrix} $$
We’ll compute these by applying the chain rule systematically across our dependency chain from Section 3.
7. Backpropagation — General Idea
We start from the scalar loss (L) and move backward through the network graph.
The core chain-rule logic is:
$$ \frac{\partial L}{\partial z^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} $$
and
$$ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial z^{(l)}} , (a^{(l-1)})^T, \qquad \frac{\partial L}{\partial b^{(l)}} = \frac{\partial L}{\partial z^{(l)}}. $$
This pattern repeats for every layer (l).
The term $(\frac{\partial L}{\partial z^{(l)}})$ is often called the delta of that layer:
$$ \delta^{(l)} = \frac{\partial L}{\partial z^{(l)}} $$
So “backpropagation” literally means computing each $(\delta)$ layer by layer, backward.
8. Backpropagation in the 2–2–1 Network (Numerical)
Now, we’ll compute all gradients explicitly using the example values.
Step 1 — Loss derivative wrt output
The derivative of the loss with respect to the output is:
$$ \frac{\partial L}{\partial y} = y - t = 0.5285101 - 1 = -0.4714899 $$
Boxed result:
$$ \boxed{\frac{\partial L}{\partial y} = -0.4714899} $$
Step 2 — Derivative of sigmoid at output
The derivative of the sigmoid at the output is:
$$ \frac{dy}{dz^{(2)}} = y(1 - y) = 0.5285101(1 - 0.5285101) = 0.2491872 $$
Boxed result:
$$ \boxed{\frac{dy}{dz^{(2)}} = 0.2491872} $$
Step 3 — Delta of output layer
The error signal (delta) for the output layer is:
$$ \delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial y} \cdot \frac{dy}{dz^{(2)}} = (-0.4714899)(0.2491872) = -0.1174892 $$
Boxed result:
$$ \boxed{\delta^{(2)} = -0.1174892} $$
Step 4 — Gradients for output layer
Recall:
$$ \frac{\partial L}{\partial W^{(2)}} = \delta^{(2)} (a^{(1)})^T, \qquad \frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} $$
Compute:
$$ \frac{\partial L}{\partial W^{(2)}} = -0.1174892 \begin{bmatrix}0.5 & 0.6791787\end{bmatrix} = \begin{bmatrix}-0.0587446 & -0.0797962\end{bmatrix} $$
Boxed result:
$$ \boxed{\frac{\partial L}{\partial W^{(2)}} = \begin{bmatrix}-0.0587446 & -0.0797962\end{bmatrix}} $$
and
$$ \frac{\partial L}{\partial b^{(2)}} = -0.1174892 $$
Boxed result:
$$ \boxed{\frac{\partial L}{\partial b^{(2)}} = -0.1174892} $$
Step 5 — Backpropagate to hidden layer
First, propagate through the weights:
$$ \frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \delta^{(2)} = \begin{bmatrix}0.3 \ -0.2\end{bmatrix}(-0.1174892) = \begin{bmatrix}-0.0352468 \ 0.0234978\end{bmatrix} $$
Then multiply elementwise by the derivative of the hidden sigmoid:
$$ \frac{da^{(1)}}{dz^{(1)}} = a^{(1)} \odot (1 - a^{(1)}) = \begin{bmatrix} 0.5(1-0.5) \ 0.6791787(1-0.6791787) \end{bmatrix} = \begin{bmatrix} 0.25 \ 0.217895 \end{bmatrix} $$
So:
$$ \delta^{(1)} = \frac{\partial L}{\partial z^{(1)}} = \frac{\partial L}{\partial a^{(1)}} \odot \frac{da^{(1)}}{dz^{(1)}} = \begin{bmatrix} -0.0352468(0.25) \ 0.0234978(0.217895) \end{bmatrix} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} $$
Boxed result:
$$ \boxed{\delta^{(1)} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix}} $$
Step 6 — Gradients for hidden layer
$$ \frac{\partial L}{\partial W^{(1)}} = \delta^{(1)} x^T, \quad \frac{\partial L}{\partial b^{(1)}} = \delta^{(1)} $$
Compute:
$$ \frac{\partial L}{\partial W^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} \begin{bmatrix}1 & 0.5\end{bmatrix} = \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix} $$
Boxed result:
$$ \boxed{\frac{\partial L}{\partial W^{(1)}} = \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix}} $$
and
$$ \frac{\partial L}{\partial b^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} $$
Boxed result:
$$ \boxed{\frac{\partial L}{\partial b^{(1)}} = \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix}} $$
9. Parameter Update
We apply gradient descent:
$$ \theta_{\text{new}} = \theta_{\text{old}} - \eta \frac{\partial L}{\partial \theta} $$
Output layer updates
Update the output layer weights and bias using gradient descent:
$$ W^{(2)}_{\text{new}} = [0.3, -0.2] - 0.1[-0.0587446, -0.0797962] = [0.3058745, -0.1920204] $$
Boxed result:
$$ \boxed{W^{(2)}_{\text{new}} = [0.3058745, -0.1920204]} $$
$$ b^{(2)}_{\text{new}} = 0.1 - 0.1(-0.1174892) = 0.1117489 $$
Boxed result:
$$ \boxed{b^{(2)}_{\text{new}} = 0.1117489} $$
Hidden layer updates
Update the hidden layer weights and biases (each operation shown on its own line for clarity):
$$ \begin{aligned} W^{(1)}_{\text{new}} &= \begin{bmatrix} 0.2 & -0.4 \ 0.7 & 0.1 \end{bmatrix} - 0.1 \begin{bmatrix} -0.0088117 & -0.0044058 \ 0.0051201 & 0.0025600 \end{bmatrix} \ &= \begin{bmatrix} 0.2008812 & -0.3995594 \ 0.6994880 & 0.0997440 \end{bmatrix} \end{aligned} $$
Boxed result:
$$ \boxed{W^{(1)}_{\text{new}} = \begin{bmatrix} 0.2008812 & -0.3995594 \ 0.6994880 & 0.0997440 \end{bmatrix}} $$
For the biases:
$$ \begin{aligned} b^{(1)}_{\text{new}} &= \begin{bmatrix} 0 \ 0 \end{bmatrix} - 0.1 \begin{bmatrix} -0.0088117 \ 0.0051201 \end{bmatrix} \ &= \begin{bmatrix} 0.0008812 \ -0.0005120 \end{bmatrix} \end{aligned} $$
Boxed result:
$$ \boxed{b^{(1)}_{\text{new}} = \begin{bmatrix} 0.0008812 \ -0.0005120 \end{bmatrix}} $$
10. New Forward Pass After Update
Let’s verify that the loss decreases.
Hidden layer
The new hidden layer pre-activation is:
$$ \boxed{ z^{(1)}_\text{new} = \begin{bmatrix} 0.00198 \ 0.74885 \end{bmatrix} } $$
The new hidden layer activation is:
$$ \boxed{ a^{(1)}_\text{new} = \begin{bmatrix} 0.5005 \ 0.6789 \end{bmatrix} } $$
Output
$$ z^{(2)}_{\text{new}} = [0.3058745, -0.1920204] \begin{bmatrix} 0.5005 \ 0.6789 \end{bmatrix} + 0.1117489 = 0.13447 $$
$$ y_{\text{new}} = \sigma(0.13447) = 0.53357 $$
Loss
After the update, the new loss is:
$$ L_{\text{new}} = \tfrac12(0.53357 - 1)^2 = 0.10878 $$
Boxed result:
$$ \boxed{L_{\text{new}} \approx 0.10878} $$
✅ Loss decreased from 0.11115 → 0.10878 after one update.
That’s the whole training process — on a microscopic scale.
11. Gradient Interpretation
- Since $y < t$, the output delta $\delta^{(2)} < 0$. $\rightarrow$ The gradient signs push $W^{(2)}$ and $b^{(2)}$ upward, increasing $y$ on the next pass.
- Hidden-layer deltas $\delta^{(1)}$ are smaller in magnitude, since they’re “modulated” by both $W^{(2)}$ and the sigmoid derivative $a^{(1)}(1 - a^{(1)})$.
- Gradients for weights scale with the input $x$; if $x_i = 0$, that feature doesn’t influence that weight’s update.
12. The Algebraic Backbone of Backpropagation
Let’s make the core relationships explicit — these are the same identities we used above, just grouped together:
$$ z^{(1)} = W^{(1)}x + b^{(1)} $$
$$ a^{(1)} = \sigma(z^{(1)}) $$
$$ z^{(2)} = W^{(2)}a^{(1)} + b^{(2)} $$
$$ y = \sigma(z^{(2)}) $$
$$ L = \frac{1}{2}(y - t)^2 $$
And their derivatives:
$$ \frac{\partial L}{\partial y} = y - t $$
$$ \frac{dy}{dz^{(2)}} = y(1-y) $$
$$ \frac{\partial L}{\partial z^{(2)}} = (y - t)y(1 - y) $$
$$ \frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial z^{(2)}} (a^{(1)})^T $$
$$ \frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial z^{(2)}} $$
$$ \frac{\partial L}{\partial a^{(1)}} = (W^{(2)})^T \frac{\partial L}{\partial z^{(2)}} $$
$$ \frac{da^{(1)}}{dz^{(1)}} = a^{(1)}(1 - a^{(1)}) $$
$$ \frac{\partial L}{\partial z^{(1)}} = \left[(W^{(2)})^T \frac{\partial L}{\partial z^{(2)}}\right] \odot a^{(1)}(1 - a^{(1)}) $$
$$ \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial z^{(1)}} x^T $$
$$ \frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial z^{(1)}} $$
13. Extending to Multiple Layers and Batches
For multiple layers
If you stack more layers, the same pattern applies recursively:
$$ \delta^{(l)} = \big((W^{(l+1)})^T \delta^{(l+1)}\big) \odot f’(z^{(l)}), $$
and
$$ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)}(a^{(l-1)})^T, \quad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}. $$
For mini-batches
Let $X \in \mathbb{R}^{d \times N}$ (each column an input sample), and $T \in \mathbb{R}^{m \times N}$ (targets).
Then:
$$ Z = WX + b\mathbf{1}^T, \qquad Y = f(Z), $$
and
$$ \frac{\partial L}{\partial W} = \frac{1}{N}(\frac{\partial L}{\partial Z}) X^T, \qquad \frac{\partial L}{\partial b} = \frac{1}{N}(\frac{\partial L}{\partial Z}) \mathbf{1}. $$
It’s the same math, just summed (or averaged) over (N) samples.
14. Common Mistakes and Sanity Checks
- Shape mismatches — Always verify that $\frac{\partial L}{\partial W}$ has the same shape as $W$.
- Sigmoid derivative misuse — Use $y(1 - y)$, not $\sigma’(z)$ recomputed from scratch.
- Forgetting the bias gradient — Always sum over batch dimensions.
- Sign errors — If loss increases, check whether you’re subtracting or adding the gradient.
- Exploding/vanishing gradients — Common with deep sigmoids. ReLU or tanh activations can help.
15. The Big Picture
From a high-level algebraic perspective:
- Each layer applies an affine transformation (Wx + b).
- Each activation applies a nonlinear scalar function (f).
- Each loss measures distance between output and target.
- Backpropagation is just systematically applying the chain rule through these compositions.
So, conceptually:
$$ L = \ell(f_2(W_2 f_1(W_1 x + b_1) + b_2), t) $$
and we simply compute:
$$ \nabla_\Theta L = \frac{\partial \ell}{\partial f_2} \cdot \frac{\partial f_2}{\partial W_2} \cdot \frac{\partial f_1}{\partial W_1} \cdot \ldots $$
Everything else — whether for deep networks or transformers — builds on this same principle.
16. Final Recap and Takeaways
- We defined the dependencies explicitly: $x \to z^{(1)} \to a^{(1)} \to z^{(2)} \to y \to L$.
- We derived the mathematical objective $J(\Theta) = \frac{1}{2}(y - t)^2$.
- We computed each derivative via the chain rule step by step.
- We ran one full iteration, confirmed weight updates, and showed that the loss decreased.
- We generalized the pattern to any number of layers and to batch processing.
If you’ve followed this walkthrough, you’ve effectively implemented the core of backpropagation — by hand, line by line. That’s the foundation of every modern deep learning library, from TensorFlow to PyTorch.
Written for learners who want to see every symbol move, every dependency formalized, and every derivative unfold — one layer at a time.
