HomeMachine LearningMachine Learning EducationTutorial on how to build a Neural Network

Tutorial on how to build a Neural Network

If you’re just getting started in the world of Artificial Intelligence (AI), Python is a great language to learn because most tools are built utilizing it. Deep Learning is a data-driven prediction technique that heavily relies on Neural Networks. Today, you’ll learn how to create a neural network from scratch.

Instead of building your own neural network, you would use a Deep Learning framework such as TensorFlow or PyTorch in a production setting. However, understanding how Neural Networks work is beneficial because you can use it to better architect your Deep Learning models.

In this tutorial, you will learn:

  1. What exactly is Artificial Intelligence?
  2. How Machine Learning and Deep Learning contribute to AI
  3. Internal workings of a Neural Network
  4. How to Create a Neural Network in Python from Scratch

Overview of Artificial Intelligence

The goal of using AI is to teach computers to think like humans. This may appear to be a new field, but it dates back to the 1950s.

Assume you need to create a Python programme that uses AI to solve a sudoku puzzle. Writing conditional statements and checking the constraints to see if you can place a number in each position is one way to accomplish this. Because you programmed a computer to solve a problem, this Python script is already an application of AI!

Machine learning (ML) and Deep Learning (DL) are two other approaches to problem solving. The only difference between these techniques and a Python script is that ML and DL use training data rather than hard-coded rules, but they can all be used to solve problems using AI. You’ll learn more about what distinguishes these two techniques in the following sections.

Machine Learning

Machine learning is a technique that involves training a system to solve a problem rather than explicitly programming the rules. To return to the sudoku example from earlier, in order to solve the problem using machine learning, you would collect data from solved sudoku games and train a statistical model. Statistical models are mathematically formalised methods of approximating a phenomenon’s behaviour.

Supervised learning is a common Machine Learning task in which you have a dataset with known inputs and known outputs. The task is to train a model that predicts the correct outputs based on the inputs using this dataset. The workflow for training a model using supervised learning is depicted in the image below:

The model is created by combining the training data with the Machine Learning algorithm. The model can then be used to make predictions for new data.

Predictions for new, unseen data are the goal of supervised learning tasks. To do so, you assume that the unseen data has a probability distribution similar to the training dataset’s distribution. If this distribution changes in the future, you must retrain your model with the new training dataset.

Feature Engineering

When different types of data are used as inputs, prediction problems become more difficult. Because you’re dealing with numbers, the sudoku puzzle is relatively simple. What if you want to train a model to predict a sentence’s sentiment? What if you have an image and want to know if it shows a cat?

Input data is also known as feature, and feature engineering is the process of extracting features from raw data. When working with various types of data, you must figure out how to represent it in order to extract meaningful information from it.

Lemmatization is an example of a feature engineering technique in which you remove the inflection from words in a sentence. Inflected forms of the verb “watch,” such as “watches,” “watching,” and “watched,” would, for example, be reduced to their lemma, or base form: “watch.”

If you use arrays to store each word in a corpus, applying lemmatization results in a less-sparse matrix. This has the potential to improve the performance of some machine learning algorithms. The image below depicts the lemmatization and representation process using a bag-of-words model:

First, each word’s inflected form is reduced to its lemma. The number of occurrences of that word is then calculated. The result is an array containing the number of times each word appears in the text.

Deep Learning

Instead of using feature engineering techniques, Deep Learning allows the neural network to figure out which features are important on its own. This means that deep learning can avoid the feature engineering process.

It is preferable to avoid feature engineering because the process becomes more difficult as datasets become more complex. For instance, how would you extract data to predict a person’s mood based on a picture of her face? You don’t have to worry about it with Neural Networks because the networks can learn the features on their own. In the following sections, you’ll learn all about Neural Networks and how they work.

Main Concepts of Neural Networks

A Neural network is a system that learns to make predictions by performing the following steps:

  1. Considering the input data
  2. Making a forecast
  3. Taking the prediction and comparing it to the desired output
  4. Changing its internal state in order to predict correctly the next time

Neural networks’ building blocks include vectors, layers, and linear regression. The data is stored as vectors, which are then stored in arrays in Python. The data from the previous layer is transformed by each layer. Each layer can be thought of as a feature engineering step, because each layer extracts some representation of the data that came before it.

The fact that Neural Network layers can extract information from any type of data is fascinating. This means that it makes no difference whether you’re using image data or text data. For both scenarios, the process of extracting meaningful information and training the Deep Learning model is the same.

An example of a network architecture with two layers is shown in the image below:

Each layer applies mathematical operations to the data that came from the previous layer.

The Training of a Neural Network

The process of training a Neural Network is similar to trial and error. Assume you’re trying out darts for the first time. You try to hit the centre of the dartboard with your first throw. The first shot is usually just to get a sense of how the height and speed of your hand affect the outcome. If you notice that the dart is higher than the central point, you adjust your hand to throw it lower, and so on.

The following are the steps for attempting to hit the centre of a dartboard:

Take note of how you continue to assess the error by observing where the dart landed (step 2). You continue until you reach the centre of the dartboard.

The process is very similar with Neural Networks: you start with some random weights and bias vectors, make a prediction, compare it to the desired output, and then adjust the vectors to predict more accurately the next time. The process is repeated until the difference between the prediction and the correct targets is as small as possible.

Knowing when to stop training and what accuracy target to set is critical in Neural Network training, owing to overfitting and underfitting scenarios.

Vectors and Weights

Working with Neural Networks entails performing vector operations. The vectors are represented as multidimensional arrays. Vectors are useful in Deep Learning because of one operation in particular: the dot product. The dot product of two vectors indicates their similarity in direction and is scaled by the magnitude of the two vectors.

The weights and bias vectors are the primary vectors in a Neural Network. In general, you want your Neural Network to check if an input is similar to other inputs it has seen before. If the new input is similar to previous inputs, the outputs will be similar as well. That’s how you get a prediction’s outcome.

The Linear Regression Model

When you need to estimate the relationship between a dependent variable and two or more independent variables, you use regression. Linear regression is a method for approximating the relationship between variables as linear. The method, which dates back to the nineteenth century, is the most widely used regression method.

You can express the dependent variable as a weighted sum of the independent variables by modelling the relationship between the variables as linear. As a result, each independent variable will be multiplied by a vector known as weight. Aside from the weights and independent variables, you also include a bias vector. When all of the other independent variables are equal to zero, it determines the outcome.

As an example of how to build a linear regression model in practice, suppose you want to train a model to predict the price of houses based on their location and age. You decide to use linear regression to model this relationship. The following code block demonstrates how to write a linear regression model in pseudocode for the stated problem:

There are two weights in the preceding example: weights_area and weights_age. The training process involves adjusting the weights and bias so that the model can predict the correct price value. To do so, you must compute the prediction error and update the weights accordingly.

These are the fundamentals of how the Neural Network mechanism works. It’s now time to put these ideas into practise with Python.

Python AI: Creating Your First Neural Network

Creating an output from input data is the first step in building a Neural Network. You’ll accomplish this by calculating a weighted sum of the variables. The first step is to use Python and NumPy to represent the inputs.

Wrapping the Neural Network’s Inputs With NumPy

NumPy will be used to represent the network’s input vectors as arrays. However, before using NumPy, it’s a good idea to experiment with vectors in pure Python to better understand what’s going on.

You have an input vector and two weight vectors in this first example. The goal is to determine which of the weights is closest to the input in terms of direction and magnitude. When the vectors are plotted, they look like this:

Weights_2 is more similar to the input vector because it points in the same direction and has a similar magnitude. So, how do you use Python to determine which vectors are similar?

First, you create three vectors: one for the input and two for the weights. Then you compute the similarity of input_vector and weights_1. You will use the dot product to accomplish this. Because all of the vectors are two-dimensional, the steps are as follows:

  1. Multiply the first index of input_vector by the first index of weights_1.
  2. Multiply the second index of input_vector by the second index of weights_2.
  3. Sum the results of both multiplications

To follow along, use an IPython console or a Jupyter Notebook. It’s best practise to create a new virtual environment whenever you start a new Python project, so do that first. Venv is included with Python versions 3.3 and higher and is useful for creating a virtual environment:

$ python -m venv ~/.my-env
$ source ~/.my-env/bin/activate

The virtual environment is created and then activated using the commands listed above. It’s now time to use pip to install the IPython console. Because you’ll also need NumPy and Matplotlib, you should install them as well:

(my-env) $ python -m pip install ipython numpy matplotlib
(my-env) $ ipython

You are now ready to begin coding. The following code computes the dot product of input_vector and weights_1:

In [1]: input_vector = [1.72, 1.23]
In [2]: weights_1 = [1.26, 0]
In [3]: weights_2 = [2.17, 0.32]

In [4]: # Computing the dot product of input_vector and weights_1
In [5]: first_indexes_mult = input_vector[0] * weights_1[0]
In [6]: second_indexes_mult = input_vector[1] * weights_1[1]
In [7]: dot_product_1 = first_indexes_mult + second_indexes_mult

In [8]: print(f"The dot product is: {dot_product_1}")
Out[8]: The dot product is: 2.1672
input_vector = [1.72, 1.23]
weights_1 = [1.26, 0]
weights_2 = [2.17, 0.32]

# Computing the dot product of input_vector and weights_1
first_indexes_mult = input_vector[0] * weights_1[0]
second_indexes_mult = input_vector[1] * weights_1[1]
dot_product_1 = first_indexes_mult + second_indexes_mult

print(f"The dot product is: {dot_product_1}")

2.1672 is the result of the dot product. Now that you know how to compute the dot product, you can use NumPy’s np.dot(). Here’s how to compute dot_product_1 using np.dot():

In [9]: import numpy as np

In [10]: dot_product_1 = np.dot(input_vector, weights_1)

In [11]: print(f"The dot product is: {dot_product_1}")
Out[11]: The dot product is: 2.1672
import numpy as np

dot_product_1 = np.dot(input_vector, weights_1)

print(f"The dot product is: {dot_product_1}")

Np.dot() performs the same function as before, but now you must specify the two arrays as arguments. Let us now calculate the dot product of input_vector and weights_2:

In [10]: dot_product_2 = np.dot(input_vector, weights_2)

In [11]: print(f"The dot product is: {dot_product_2}")
Out[11]: The dot product is: 4.1259
dot_product_2 = np.dot(input_vector, weights_2)

print(f"The dot product is: {dot_product_2}")

The answer this time is 4.1259. You can think of the similarity between the vector coordinates as an on-off switch when considering the dot product. If the multiplication result is zero, the coordinates are said to be unrelated. If the result is greater than zero, you can say they are similar.

You can think of the dot product as a loose measurement of vector similarity. When the multiplication result is 0, the final dot product will be lower. Returning to the example vectors, because the dot product of input_vector and weights_2 is 4.1259, and 4.1259 is greater than 2.1672, input_vector is more similar to weights_2. This same mechanism will be used in your neural network.

In this tutorial, you will train a model to predict outcomes with only two possible outcomes. The outcome can be either 0 or 1. This is a classification problem, a subset of supervised learning problems in which you have inputs and known targets in a dataset. The dataset’s inputs and outputs are as follows:

Input Vector Target
[1.66, 1.56] 1
[2, 1.5] 0


The variable you want to predict is the target. In this example, you’re working with a dataset of numbers. This is unusual in a real-world production scenario. When a Deep Learning model is required, the data is typically presented in files such as images or text.

Making Your First Prediction

Because this is your first neural network, you’ll keep things simple and construct a network with only two layers. So far, you’ve seen that the neural network’s only two operations were the dot product and the sum. They are both linear operations.

If you continue to use only linear operations, adding more layers has no effect because each layer will always have some correlation with the input of the previous layer. This implies that for every network with multiple layers, there is always a network with fewer layers that predicts the same results.

What you’re looking for is an operation that causes the middle layers to sometimes correlate with an input and sometimes not.

Nonlinear functions can be used to achieve this behaviour. The nonlinear functions are referred to as activation functions. There are numerous kinds of activation functions. For example, the ReLU (rectified linear unit) function converts all negative numbers to zero. This means that if a weight is negative, the network can “turn off,” adding nonlinearity.

The sigmoid activation function will be used in the network you’re creating. It will be used in the last layer, layer_2. The dataset has only two possible outputs: 0 and 1, and the sigmoid function limits the output to a range between 0 and 1. The following is the formula for expressing the sigmoid function:

The e is a mathematical constant known as Euler’s number, and   eˣ can  be calculated using np.exp(x).

Probability functions calculate the likelihood of occurrence for various event outcomes. The dataset has only two possible outcomes: 0 and 1, and the Bernoulli distribution has two possible outcomes as well. If your problem follows the Bernoulli distribution, the sigmoid function is a good choice, which is why you’re using it in the final layer of your neural network.

Because the function’s output is limited to a range of 0 to 1, you’ll use it to predict probabilities. If the output is greater than 0.5, the prediction is one. If it is less than 0.5, the prediction is zero. This is the computation flow inside the network you’re building:

The functions are represented by yellow hexagons, and the intermediate results are represented by blue rectangles. It’s now time to put all of this knowledge into code. You must also wrap the vectors in NumPy arrays. This is the code that performs the functions shown in the image above:

In [12]: # Wrapping the vectors in NumPy arrays
In [13]: input_vector = np.array([1.66, 1.56])
In [14]: weights_1 = np.array([1.45, -0.66])
In [15]: bias = np.array([0.0])

In [16]: def sigmoid(x):
   ...:     return 1 / (1 + np.exp(-x))

In [17]: def make_prediction(input_vector, weights, bias):
   ...:      layer_1 = np.dot(input_vector, weights) + bias
   ...:      layer_2 = sigmoid(layer_1)
   ...:      return layer_2

In [18]: prediction = make_prediction(input_vector, weights_1, bias)

In [19]: print(f"The prediction result is: {prediction}")
Out[19]: The prediction result is: [0.7985731]
# Wrapping the vectors in NumPy arrays
input_vector = np.array([1.66, 1.56])
weights_1 = np.array([1.45, -0.66])
bias = np.array([0.0])

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def make_prediction(input_vector, weights, bias):
     layer_1 = np.dot(input_vector, weights) + bias
     layer_2 = sigmoid(layer_1)
     return layer_2

prediction = make_prediction(input_vector, weights_1, bias)

print(f"The prediction result is: {prediction}")

The raw prediction result is 0.79, which is greater than 0.5, so the output is 1. The network predicted correctly. Now try it with another input vector, np.array ([2, 1.5]). The correct answer for this input is 0. You only need to change the input_vector variable because the other parameters remain unchanged:

In [20]: # Changing the value of input_vector
In [21]: input_vector = np.array([2, 1.5])

In [22]: prediction = make_prediction(input_vector, weights_1, bias)

In [23]: print(f"The prediction result is: {prediction}")
Out[23]: The prediction result is: [0.87101915]
# Changing the value of input_vector
input_vector = np.array([2, 1.5])

prediction = make_prediction(input_vector, weights_1, bias)

print(f"The prediction result is: {prediction}")

This time, the network predicted incorrectly. Because the target for this input is 0, the result should be less than 0.5, but the raw result was 0.87. It made an incorrect guess, but how serious was the error? The next step is to figure out how to evaluate that.

Develop Your First Neural Network

During the training process, you first assess the error and then adjust the weights accordingly. The gradient descent and backpropagation algorithms will be used to adjust the weights. Gradient descent is used to determine the direction and rate at which the parameters should be updated.

You must compute the error before making any changes to the network. That is what you will do in the following section.

Prediction Error Calculation

To comprehend the magnitude of the error, you must first select a method of measurement. The cost function, also known as the loss function, is the function used to calculate the error. The mean squared error (MSE) will be used as your cost function in this tutorial. The MSE is calculated in two steps:

  1. Calculate the difference between the prediction and the target.
  2. Multiply the result by itself.

The network may make an error by returning a value that is either higher or lower than the correct value. Because the MSE is the squared difference between the prediction and the correct result, you’ll always get a positive value with this metric.

This is the full expression for calculating the error for the previous prediction:

In [24]: target = 0

In [25]: mse = np.square(prediction - target)

In [26]: print(f"Prediction: {prediction}; Error: {mse}")
Out[26]: Prediction: [0.87101915]; Error: [0.7586743596667225]
target = 0

mse = np.square(prediction - target)

print(f"Prediction: {prediction}; Error: {mse}")

In the preceding example, the error is 0.75. One implication of multiplying the difference by itself is that larger errors have an even greater impact, and smaller errors continue to shrink as they decrease.

Understanding Error Reduction Techniques

The goal is to change the weights and bias variables so that the error is reduced. To understand how this works, you’ll only change the weights variable and leave the bias alone for the time being. You can also remove the sigmoid function and only use the layer_1 result. All that remains is to figure out how to change the weights so that the error decreases.

The MSE is calculated as error = np.square (prediction – target). If you consider (prediction – target) to be a single variable x, then error = np.square(x) is a quadratic function. If you plot the function, it looks like this:

The y-axis represents the error. If you are in point A and want to reduce the error to zero, you must decrease the x value. If, on the other hand, you are in point B and want to reduce the error, you must increase the x value. The derivative will tell you which direction to take to reduce the error. A derivative describes how a pattern will change.

The derivative is also known as the gradient. Gradient descent is the name of the algorithm used to determine the direction and rate at which network parameters are updated.

You will not focus on the theory of derivatives in this tutorial; instead, you will simply apply the derivative rules to each function you encounter. According to the power rule, the derivative of xⁿ  is nx⁽ⁿ⁻¹⁾. So np.square(x) has a derivative of 2 * x, and x has a derivative of 1.

Remember that error = np.square (prediction – target) is the error expression. When (prediction – target) is treated as a single variable x, the error derivative is 2 * x. By taking the derivative of this function, you want to know which way to change x to bring the error result to zero, thereby reducing the error.

When it comes to neural networks, the derivative will tell you which way to update the weights variable. If it’s a positive number, you predicted too high and should reduce the weights. If it’s a negative number, you predicted too low and should adjust the weights.

It’s now time to write the code that will figure out how to update weights_1 for the previous incorrect prediction. Should you increase or decrease the weights if the mean squared error is 0.75? Because the derivative is 2 * x, the difference between the prediction and the target is simply multiplied by 2:

In [27]: derivative = 2 * (prediction - target)

In [28]: print(f"The derivative is {derivative}")
Out[28]: The derivative is: [1.7420383]
derivative = 2 * (prediction - target)

print(f"The derivative is {derivative}")

Because the result is 1.74, a positive number, you must reduce the weights. You do this by subtracting the weights vector’s derivative result. You can now adjust weights_1 and predict again to see how it affects the prediction result:

In [29]: # Updating the weights
In [30]: weights_1 = weights_1 - derivative

In [31]: prediction = make_prediction(input_vector, weights_1, bias)

In [32]: error = (prediction - target) ** 2

In [33]: print(f"Prediction: {prediction}; Error: {error}")
Out[33]: Prediction: [0.01496248]; Error: [0.00022388]
# Updating the weights
weights_1 = weights_1 - derivative

prediction = make_prediction(input_vector, weights_1, bias)

error = (prediction - target) ** 2

print(f"Prediction: {prediction}; Error: {error}")

The error has nearly disappeared! The derivative result was small in this example, but there are some cases where the derivative result is too high. As an example, consider the quadratic function image. High increments aren’t ideal because you could go from point A to point B indefinitely, never getting close to zero. To account for this, you update the weights by a fraction of the derivative result.

To define a fraction for updating the weights, you use the alpha parameter, also called the learning rate. When the learning rate is reduced, the increments become smaller. When you increase it, the steps become larger. How do you determine the best learning rate value? Making an educated guess and experimenting with it.

If you use the new weights to make a prediction with the first input vector, you’ll notice that it now makes an incorrect prediction for that one. If your neural network correctly predicts every instance in your training set, you have an overfitted model, which simply remembers how to classify the examples rather than learning to notice features in the data.

Now that you understand how to compute the error and adjust the weights accordingly, it’s time to get back to work on your neural network.

Using the Chain Rule

You must update both the weights and the bias vectors in your neural network. The function you’re using to calculate error is determined by two independent variables: weights and bias. Because the weights and bias are independent variables, you can change and adjust them to achieve the desired result.

You’re constructing a network with two layers, and because each layer has its own set of functions, you’re dealing with a function composition. This means that the error function remains np.square(x), but x is now the result of a different function.

To recap, you now want to know how to change weights_1 and bias to reduce error. You already saw that derivatives can be used for this, but instead of a function that only contains a sum, you now have a function that produces its result using other functions.

Since you now have this function composition, you can use the chain rule from calculus to take the derivative of the error in the parameters. Using the chain rule, you take the partial derivatives of each function, evaluate them, and multiply all of them to get the desired derivative.

You can now begin updating the weights. You want to know how to adjust the weights to reduce error. This implies that you must compute the error’s derivative with respect to weights. Because the error is calculated by combining different functions, the partial derivatives of these functions must be taken.

Here’s an example of how to use the chain rule to find the derivative of the error with respect to the weights:

The bold red arrow points to the desired derivative, derror_dweights. Starting with the red hexagon, you’ll follow the inverse path of making a prediction and computing the partial derivatives for each function.

The yellow hexagons in the image above represent each function, and the grey arrows on the left represent the partial derivatives. Using the chain rule, the value of derror_dweights is as follows:

derror_dweights = (
    derror_dprediction * dprediction_dlayer1 * dlayer1_dweights

To find the derivative, multiply all of the partial derivatives that follow the path from the error hexagon (the red one) to the hexagon where the weights are found (the leftmost green one).

The derivative of y = f(x) is defined as the derivative of f with respect to x. Using this nomenclature, for derror_dprediction, you want to know the derivative of the function that computes the error with respect to the prediction value.

This reverse path is referred to as a backward pass. You compute the partial derivatives of each function in each backward pass, substitute the variables by their values, and finally multiply everything.

This is how you apply the chain rule: “take the partial derivatives, evaluate, and multiply.” Backpropagation is the algorithm used to update the Neural Network parameters.

Adjusting the Parameters With Backpropagation

This section will take you step by step through the backpropagation process, beginning with how to update the bias. You want to take the error function’s derivative with respect to the bias, derror_dbias. Then, going backward, you’ll take partial derivatives until you find the bias variable.

Because you are working backward from the end, you must first calculate the partial derivative of the error with respect to the prediction. In the image below, that is the derror_dprediction:

The error-producing function is a square function, and its derivative, as previously stated, is 2 * x. You used the first partial derivative (derror_dprediction) and still didn’t get to the bias, so go back and take the prediction’s derivative with respect to the previous layer, dprediction_dlayer1.

The sigmoid function produces the prediction. You can calculate the sigmoid function’s derivative by multiplying sigmoid(x) by 1 – sigmoid (x). This derivative formula is very useful because it allows you to compute the derivative of an already computed sigmoid result. Then you take this partial derivative and go backward.

Take the derivative of layer_1 with respect to the bias now. You finally got around to it! Because the bias variable is an independent variable, the power rule result is 1. Now that you’ve completed this backward pass, you can combine everything and compute derror_dbias:

In [36]: def sigmoid_deriv(x):
   ...:     return sigmoid(x) * (1-sigmoid(x))

In [37]: derror_dprediction = 2 * (prediction - target)
In [38]: layer_1 = np.dot(input_vector, weights_1) + bias
In [39]: dprediction_dlayer1 = sigmoid_deriv(layer_1)
In [40]: dlayer1_dbias = 1

In [41]: derror_dbias = (
   ...:     derror_dprediction * dprediction_dlayer1 * dlayer1_dbias
   ...: )
def sigmoid_deriv(x):
    return sigmoid(x) * (1-sigmoid(x))

derror_dprediction = 2 * (prediction - target)
layer_1 = np.dot(input_vector, weights_1) + bias
dprediction_dlayer1 = sigmoid_deriv(layer_1)
dlayer1_dbias = 1

derror_dbias = (
    derror_dprediction * dprediction_dlayer1 * dlayer1_dbias

To update the weights, repeat the previous steps, going backward and taking partial derivatives until you reach the weights variable. You only need to compute dlayer1_ dweights because you’ve already computed some of the partial derivatives. The derivative of the dot product is the derivative of the first vector multiplied by the second vector, plus the derivative of the second vector multiplied by the first vector.

Creating the Neural Network Class

You now understand how to write the expressions that will update both the weights and the bias. It’s time to create a neural network class. Classes are the foundation of object-oriented programming (OOP). For the weights and bias variables, the NeuralNetwork class generates random start values.

When creating a NeuralNetwork object, you must include the learning_rate parameter. You’ll use predict() to make a prediction. The methods _compute_derivatives() and _update_parameters() have the computations you learned in this section. This is the final NeuralNetwork class:

class NeuralNetwork:
    def __init__(self, learning_rate):
        self.weights = np.array([np.random.randn(), np.random.randn()])
        self.bias = np.random.randn()
        self.learning_rate = learning_rate

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def _sigmoid_deriv(self, x):
        return self._sigmoid(x) * (1 - self._sigmoid(x))

    def predict(self, input_vector):
        layer_1 = np.dot(input_vector, self.weights) + self.bias
        layer_2 = self._sigmoid(layer_1)
        prediction = layer_2
        return prediction

    def _compute_gradients(self, input_vector, target):
        layer_1 = np.dot(input_vector, self.weights) + self.bias
        layer_2 = self._sigmoid(layer_1)
        prediction = layer_2

        derror_dprediction = 2 * (prediction - target)
        dprediction_dlayer1 = self._sigmoid_deriv(layer_1)
        dlayer1_dbias = 1
        dlayer1_dweights = (0 * self.weights) + (1 * input_vector)

        derror_dbias = (
            derror_dprediction * dprediction_dlayer1 * dlayer1_dbias
        derror_dweights = (
            derror_dprediction * dprediction_dlayer1 * dlayer1_dweights

        return derror_dbias, derror_dweights

    def _update_parameters(self, derror_dbias, derror_dweights):
        self.bias = self.bias - (derror_dbias * self.learning_rate)
        self.weights = self.weights - (
            derror_dweights * self.learning_rate

That concludes the code for your first neural network.This code simply connects all of the previous pieces. If you want to make a prediction, first create an instance of NeuralNetwork(), and then you call . predict():

In [42]: learning_rate = 0.1

In [43]: neural_network = NeuralNetwork(learning_rate)

In [44]: neural_network.predict(input_vector)
Out[44]: array([0.79412963])
learning_rate = 0.1

neural_network = NeuralNetwork(learning_rate)


The preceding code makes a prediction, but you must now learn how to train the network. The goal is for the network to generalise beyond the training dataset. This means you want it to adapt to new, previously unseen data with the same probability distribution as the training dataset. That is what you will do in the following section.

Training the Network With More Data

You’ve already tweaked the weights and bias for one data instance, but the goal is for the network to generalise across the entire dataset. Stochastic gradient descent is a technique in which the model makes a prediction based on a randomly chosen piece of training data, calculates the error, and updates the parameters at each iteration.

Now it’s time to create the train() method of your NeuralNetwork class. Every 100 iterations, you’ll save the error over all data points because you want to plot a chart showing how this metric changes as the number of iterations increases.

This is the neural network’s final train() method:

class NeuralNetwork:
    # ...

    def train(self, input_vectors, targets, iterations):
        cumulative_errors = []
        for current_iteration in range(iterations):
            # Pick a data instance at random
            random_data_index = np.random.randint(len(input_vectors))

            input_vector = input_vectors[random_data_index]
            target = targets[random_data_index]

            # Compute the gradients and update the weights
            derror_dbias, derror_dweights = self._compute_gradients(
                input_vector, target

            self._update_parameters(derror_dbias, derror_dweights)

            # Measure the cumulative error for all the instances
            if current_iteration % 100 == 0:
                cumulative_error = 0
                # Loop through all the instances to measure the error
                for data_instance_index in range(len(input_vectors)):
                    data_point = input_vectors[data_instance_index]
                    target = targets[data_instance_index]

                    prediction = self.predict(data_point)
                    error = np.square(prediction - target)

                    cumulative_error = cumulative_error + error

        return cumulative_errors

The above code block contains a lot of information, so here’s a line-by-line breakdown:

  • Line 8 selects an instance at random from the dataset.
  • Lines 14 to 16 compute partial derivatives and return the bias and weight derivatives. They use _compute_gradients(), which you defined earlier.
  • Line 18 updates the bias and the weights using _update_parameters(), which you defined in the previous code block.
  • Line 21 determines whether or not the current iteration index is a multiple of 100. This is done to see how the error changes after every 100 iterations.
  • Line 24 begins the loop that traverses all of the data instances.
  • Line 28 calculates the prediction outcome.
  • Line 29 calculates the error for each instance.
  • Line 31 is where you accumulate the sum of the errors using the cumulative_error variable. This is done because you want to plot a point with the error for each data instance. Then, on line 32, you append the error to cumulative_errors, the array that stores the errors. You’ll use this array to plot the graph.

To summarise, you select a random instance from the dataset, compute the gradients, and update the weights and bias. Every 100 iterations, you also compute the cumulative error and save the results in an array. This array will be plotted to show how the error changes during the training process.

To keep things less complicated, you’ll use a dataset with just eight instances, the input_vectors array. You can now call train() and plot the cumulative error for each iteration with Matplotlib:

In [45]: # Paste the NeuralNetwork class code here
   ...: # (and don't forget to add the train method to the class)

In [46]: import matplotlib.pyplot as plt

In [47]: input_vectors = np.array(
   ...:     [
   ...:         [3, 1.5],
   ...:         [2, 1],
   ...:         [4, 1.5],
   ...:         [3, 4],
   ...:         [3.5, 0.5],
   ...:         [2, 0.5],
   ...:         [5.5, 1],
   ...:         [1, 1],
   ...:     ]
   ...: )

In [48]: targets = np.array([0, 1, 0, 1, 0, 1, 1, 0])

In [49]: learning_rate = 0.1

In [50]: neural_network = NeuralNetwork(learning_rate)

In [51]: training_error = neural_network.train(input_vectors, targets, 10000)

In [52]: plt.plot(training_error)
In [53]: plt.xlabel("Iterations")
In [54]: plt.ylabel("Error for all training instances")
In [54]: plt.savefig("cumulative_error.png")
# Paste the NeuralNetwork class code here
# (and don't forget to add the train method to the class)

import matplotlib.pyplot as plt

input_vectors = np.array(
        [3, 1.5],
        [2, 1],
        [4, 1.5],
        [3, 4],
        [3.5, 0.5],
        [2, 0.5],
        [5.5, 1],
        [1, 1],

targets = np.array([0, 1, 0, 1, 0, 1, 1, 0])

learning_rate = 0.1

neural_network = NeuralNetwork(learning_rate)

training_error = neural_network.train(input_vectors, targets, 10000)

plt.ylabel("Error for all training instances")


You instantiate the NeuralNetwork class again and call train() using the input_vectors and the target values. You instruct it to run 10000 times. This is a graph of the error for a neural network instance:

The overall error is decreasing, which is desirable. The image is created in the same directory where IPython is running. Following the largest decrease, the error rapidly increases and decreases from one interaction to the next. This is because the dataset is random and small, making it difficult for the neural network to extract any features.

However, using this metric to evaluate performance is not a good idea because you are evaluating it using data instances that the network has already seen. Overfitting occurs when the model fits the training dataset so well that it does not generalise to new data.

Adding More Layers to the Neural Network

For learning purposes, the dataset in this tutorial was kept small. Deep learning models typically require a large amount of data because the datasets are more complex and have many nuances.

Because these datasets contain more complex information, using only one or two layers is insufficient. That is why deep learning models are referred to as “deep.” They typically have numerous layers.

You can increase the expressive power of the network and make very high-level predictions by adding more layers and using activation functions. Face recognition is an example of this type of prediction; for example, when you take a photo of your face with your phone, the phone unlocks if it recognises the image as you.


You created a neural network from scratch using NumPy today. You are now prepared to delve deeper into the world of artificial intelligence in Python.

You learned the following in this tutorial:

1. What is deep learning and how does it differ from machine learning?

2. How to Use NumPy to Represent Vectors

3. What are activation functions and why are they used in neural networks?

4. What is the backpropagation algorithm and how does it work?

5. How to Train and Predict a Neural Network

The process of training a neural network consists primarily of performing operations on vectors. You did it from scratch today, with only NumPy as a dependency. This is not recommended in a manufacturing setting because the entire process can be inefficient and error-prone. One of the reasons why Deep Learning frameworks like Keras, PyTorch, and TensorFlow are so popular is because of this.

Source link

Most Popular