Neural networks are the engines powering the modern world, from the algorithms that recommend your next movie to the complex models that can diagnose diseases. They are often portrayed as mysterious “black boxes” or artificial brains, but at their core, they are elegant mathematical structures built upon a handful of fundamental principles from algebra, statistics, and calculus.
This article is a journey from those first principles to a functioning neural network. We will strip away the hype and the jargon to reveal the surprisingly simple mechanics that allow these systems to “learn.” We will start with a single digital “neuron,” equip ourselves with the necessary mathematical tools, assemble those neurons into a network, and finally, uncover the secret to how that network can learn from its mistakes using an algorithm called backpropagation.
Our goal is not to become mathematicians overnight, but to build a strong, intuitive understanding of the concepts that make this revolutionary technology possible.
Chapter 1: The Single Neuron – A Simple Decision-Maker
Before we can build a brain, we need to understand the neuron. A biological neuron receives signals through its dendrites, and if the combined signal is strong enough, it fires, sending a signal down its axon. An artificial neuron, or perceptron, is a mathematical model of this process.
Think of a single neuron as a simple decision-maker. It takes in several pieces of evidence, weighs their importance, and makes a call. This process has three steps:
- Receiving Inputs: The neuron receives one or more input values ($x_1, x_2, …$). These are the pieces of evidence.
- Calculating a Weighted Sum: Each input is multiplied by a weight ($w_1, w_2, …$), which represents its importance. A higher weight means the neuron considers that input more significant. All these weighted inputs are summed together, and a bias ($b$) is added. The bias acts as an offset, allowing the neuron to be more or less likely to fire, independent of its inputs.
- Applying an Activation Function: The final sum is passed through an activation function. This function squashes the result into a standardized output, often a “yes” or “no” (1 or 0) or a probability (a value between 0 and 1). It’s the final step that determines whether the neuron “fires.”
This simple model is the bedrock upon which all neural networks are built. But to truly understand it, we need to speak its native language: mathematics.
Chapter 2: The Language – Core Concepts from Algebra
At its heart, a neural network is an algebraic structure. Understanding a few key concepts from linear algebra is essential, as they provide the language to describe and compute the operations of a network efficiently.
The Foundation: Linear Equations
You may remember the equation of a line from school: $y = mx + b$. This simple formula is the DNA of a neural neuron’s calculation. The weighted sum a neuron computes is a direct extension of this.
For a neuron with just one input ($x_1$) and one weight ($w_1$), the sum is: $$ z = w_1x_1 + b $$ This is exactly the same form as $y = mx + b$. The weight $w_1$ is the slope ($m$), and the bias $b$ is the y-intercept. This means a single neuron is, at its core, a simple linear model.
Scaling Up: Vectors and the Dot Product
What happens when we have multiple inputs? For a neuron with three inputs ($x_1, x_2, x_3$), the weighted sum becomes: $$ z = (w_1x_1) + (w_2x_2) + (w_3x_3) + b $$ Writing this out can become cumbersome. This is where vectors come in. A vector is simply an ordered list of numbers. We can represent our inputs and weights as vectors:
- Input vector: $\vec{x} = [x_1, x_2, x_3]$
- Weight vector: $\vec{w} = [w_1, w_2, w_3]$
With vectors, we can express the weighted sum compactly using the dot product, which is the sum of the element-wise products of two vectors.
$$ z = \vec{w} \cdot \vec{x} + b $$ This is a much cleaner way to represent the same calculation and is fundamental to how these operations are performed in software.
Building Layers: Matrices
A single neuron is limited. The power of neural networks comes from connecting many neurons together in layers. Imagine we have a layer with three neurons, each receiving the same three inputs. Each neuron will have its own unique set of weights and its own bias.
- Neuron 1 uses weights $\vec{w_1} = [w_{11}, w_{12}, w_{13}]$ and bias $b_1$.
- Neuron 2 uses weights $\vec{w_2} = [w_{21}, w_{22}, w_{23}]$ and bias $b_2$.
- Neuron 3 uses weights $\vec{w_3} = [w_{31}, w_{32}, w_{33}]$ and bias $b_3$.
We can organize all these weight vectors into a single matrix, which is a grid of numbers. Each row of the matrix corresponds to the weight vector of one neuron.
$$ W = \begin{bmatrix} w_{11} & w_{12} & w_{13} \ w_{21} & w_{22} & w_{23} \ w_{31} & w_{32} & w_{33} \end{bmatrix} $$
Now, the entire calculation for a full layer of neurons can be expressed in a single, elegant equation using matrix-vector multiplication:
$$ \vec{z} = W \vec{x} + \vec{b} $$ Here, $\vec{z}$ is a vector containing the output sum for each neuron, and $\vec{b}$ is a vector of the biases. This single line of code can perform millions of calculations at once on modern hardware like GPUs. Linear algebra is the language that makes deep learning computationally feasible.
Chapter 3: The Logic – Core Concepts from Statistics
If algebra gives a neural network its structure, statistics gives it its “mind.” Statistical concepts allow the network to make decisions, quantify its certainty, and, most importantly, measure its own mistakes.
Activation Functions: Introducing Non-Linearity
As we saw, the weighted sum is a linear calculation. If we only stacked linear calculations on top of each other, the entire network would just be one big linear function, which is not powerful enough to learn complex patterns in the real world (like the difference between a cat and a dog).
Activation functions introduce non-linearity, allowing the network to learn far more complex relationships. They take the linear sum $z$ and transform it into the neuron’s final output.
- Sigmoid: This function squashes any input value into a range between 0 and 1. It’s often used in the final layer of a network for binary classification, as its output can be interpreted as a probability. $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
- ReLU (Rectified Linear Unit): This is the most common activation function in modern neural networks. It’s incredibly simple: if the input is positive, it returns the input; otherwise, it returns zero. It’s computationally efficient and helps mitigate certain problems during training. $$ \text{ReLU}(z) = \max(0, z) $$
Loss Functions: Quantifying Error
To learn, a network needs to know how wrong its predictions are. A loss function (or cost function) is a formula that calculates a single number representing the “error” of the network’s prediction compared to the true, known answer. The goal of training is to adjust the network’s weights and biases to minimize this loss value.
- Mean Squared Error (MSE): This is commonly used for regression tasks, where the goal is to predict a continuous number (e.g., the price of a house). It calculates the average of the squared differences between the true values ($y$) and the predicted values ($\hat{y}$). $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ We square the error so that positive and negative errors don’t cancel each other out, and to penalize larger errors more heavily.
Chapter 4: The Forward Pass – From Input to Prediction
Let’s put all these concepts together and walk a piece of data through a simple neural network. This process is called the forward pass.
Our Task: We will train a network to classify a fruit as either an Apple (output 0) or an Orange (output 1) based on two features: color (0 for red, 1 for orange) and roundness (0 for not round, 1 for very round).
Our Network: A simple network with 2 input neurons, a hidden layer of 2 neurons (using ReLU activation), and 1 output neuron (using Sigmoid activation).
Initial State: When a network is created, its weights and biases are initialized with small random numbers.
- Weights from input to hidden layer: $W_1 = \begin{bmatrix} 0.5 & 0.2 \ 0.3 & 0.8 \end{bmatrix}$
- Biases for hidden layer: $\vec{b_1} = [0.1, 0.1]$
- Weights from hidden to output layer: $W_2 = [0.4, 0.7]$
- Bias for output layer: $b_2 = 0.2$
Our Input: We see a fruit that is very orange (1) and very round (1).
- Input vector: $\vec{x} = [1, 1]$
Step 1: Calculate Hidden Layer Sums Step 1: Calculate Hidden Layer Sums We use our matrix multiplication formula: $\vec{z_1} = W_1 \vec{x} + \vec{b_1}$
First, we multiply the weight matrix by the input vector: $$ W_1 \vec{x} = \begin{bmatrix} 0.5 & 0.2 \ 0.3 & 0.8 \end{bmatrix} \begin{bmatrix} 1 \ 1 \end{bmatrix} = \begin{bmatrix} (0.5 \cdot 1) + (0.2 \cdot 1) \ (0.3 \cdot 1) + (0.8 \cdot 1) \end{bmatrix} = \begin{bmatrix} 0.7 \ 1.1 \end{bmatrix} $$
Then we add the bias vector: $$ \vec{z_1} = \begin{bmatrix} 0.7 \ 1.1 \end{bmatrix} + \begin{bmatrix} 0.1 \ 0.1 \end{bmatrix} = \begin{bmatrix} 0.8 \ 1.2 \end{bmatrix} $$
Step 2: Apply Hidden Layer Activation (ReLU) We apply the ReLU function to each element of our sum vector $\vec{z_1}$ to get our hidden layer’s output, $\vec{h_1}$. $$ \vec{h_1} = [\max(0, 0.8), \max(0, 1.2)] = [0.8, 1.2] $$
Step 3: Calculate Output Layer Sum Now, the output of the hidden layer, $\vec{h_1}$, becomes the input to the output layer. $$ z_2 = W_2 \cdot \vec{h_1} + b_2 = [0.4, 0.7] \cdot [0.8, 1.2] + 0.2 $$ $$ z_2 = (0.4 \cdot 0.8) + (0.7 \cdot 1.2) + 0.2 = 0.32 + 0.84 + 0.2 = 1.36 $$
Step 4: Apply Output Layer Activation (Sigmoid) We apply the Sigmoid function to get our final prediction, $\hat{y}$. $$ \hat{y} = \sigma(1.36) = \frac{1}{1 + e^{-1.36}} \approx \frac{1}{1 + 0.256} \approx 0.796 $$
The Verdict: Our network predicts 0.796. Since we defined Orange as 1, the network is about 80% confident that this fruit is an orange. Given that the true answer is indeed Orange (1), its initial random guess isn’t bad, but it’s not perfect. The loss is non-zero. Now, how do we make it better?
Chapter 5: The Learning Algorithm – Backpropagation Explained
The network made a prediction and it was slightly off. The process of learning is to take this error and use it to make tiny adjustments to every single weight and bias in the network, nudging them in a direction that would make the error smaller. This process is called backpropagation, and it’s powered by a concept from calculus called the gradient.
The Intuition: Hiking Down a Mountain
Imagine the loss function is a vast, hilly landscape, and the current state of our network’s weights puts us somewhere on that landscape. Our goal is to find the lowest point—the valley where the error is at a minimum.
To get down the mountain, you need to know which direction is downhill. In calculus, the derivative (or gradient for multiple variables) gives you the slope of the landscape at your current position. The gradient of the loss function with respect to a weight ($\frac{\partial L}{\partial w}$) tells us how the loss will change if we make a tiny change to that weight. It points in the direction of the steepest ascent. To go downhill, we just take a small step in the opposite direction of the gradient. This process is called Gradient Descent.
The core update rule is: $$ w_{\text{new}} = w_{\text{old}} - \eta \frac{\partial L}{\partial w_{\text{old}}} $$ Where $\eta$ (eta) is the learning rate—a small number that controls how big of a step we take.
The Mechanism: The Chain Rule
The challenge is that weights deep inside the network (like $W_1$) don’t affect the final loss directly. They affect the hidden layer, which affects the output, which then affects the loss. Backpropagation uses a fundamental concept from calculus called the chain rule to calculate these indirect effects.
The chain rule allows us to calculate the gradient of a nested function. It lets us “propagate” the error from the end of the network all the way back to the beginning, calculating the gradient for each weight along the way.
Backpropagation in Action (Simplified)
Let’s continue our fruit example. Our network predicted $\hat{y} = 0.796$ and the true answer was $y = 1$.
Calculate the Error: The initial error is the difference between the true value and the prediction.
Calculate Gradients for the Output Layer ($W_2$): Backpropagation first calculates how much the output layer’s weights contributed to this error. This is a direct calculation, as these weights are closest to the final loss.
Propagate Error to the Hidden Layer: The error is then passed backward to the hidden layer. Each neuron in the hidden layer receives a portion of the “blame” for the final error, proportional to the strength of its connection (its weight) to the output neuron.
Calculate Gradients for the Hidden Layer ($W_1$): Now, using the propagated error, the algorithm calculates the gradients for the weights in the first layer ($W_1$). It determines how much each of these initial weights contributed to the error that was passed back to it.
Update All Weights: Finally, the network uses the gradient descent rule to update all its weights and biases, taking a small step “downhill.” $$ W_{1, \text{new}} = W_{1, \text{old}} - \eta \frac{\partial L}{\partial W_{1, \text{old}}} $$ $$ W_{2, \text{new}} = W_{2, \text{old}} - \eta \frac{\partial L}{\partial W_{2, \text{old}}} $$ And so on for all biases.
After this single step of backpropagation, if we were to run the same input ($\vec{x} = [1, 1]$) through the network again, its prediction would be slightly closer to the correct answer of 1.
Conclusion: From Theory to a Learning Machine
This entire process—a forward pass followed by a backward pass (backpropagation)—is repeated thousands or millions of times with many different examples (an epoch is one full pass through the entire training dataset). Each time, the weights are nudged in a direction that makes the network slightly better at its task.
We have journeyed from the simplest algebraic equation to a complete, albeit small, learning machine. We’ve seen that a neural network is not magic; it’s a powerful combination of fundamental mathematical ideas:
- Linear Algebra provides the structure and computational efficiency to build networks of neurons.
- Statistics provides the logic for decision-making (activation functions) and for measuring error (loss functions).
- Calculus provides the learning algorithm (gradient descent and backpropagation) that allows the network to improve.
By understanding these core foundations, we move from seeing a neural network as a black box to appreciating it as an elegant and interpretable system, capable of finding patterns in data and learning from its own mistakes.
