Neurons For Coders Part 2 – Back propagation

In this series I’m working through some of the basics from Michael Nielson’s excellent book on neural networks, and intend to then proceed with some more advanced bits afterwards. Along the way, I’m going to attempt to translate some of the more math heavy aspects of neural networks into a form more targeted at engineers.

The git hub repo for this series is here: https://github.com/chriscummings100/neuralfun

So, back propagation – the magic of neural networks, and the process that makes deep neural networks (i.e. lots of layers) practical. Michael refers to the equations involved as quite beautiful, and I wholeheartedly agree, but even without the actual maths, the elegant simplicity in the process of training a neural network is something quite special!

As with my previous post, before going any further, here’s a screenshot to get you incredibly excited!

The Only Bit Of Maths

There is one mathematical concept that you really need to be a little comfortable with in the world of back propagation, and that is the concept of a gradient, otherwise known (roughly) as a derivative, or rate of change. I don’t want to dig too deep into explaining math principles here – there’s plenty of books about it, but this is so important that I’ll at least have a go!

A gradient (or derivative) of a function can be thought of in lots of different ways – the steepness of a graph, its rate of change, a tangent or a slope. The way that matters to back propagation is that it defines how the parameters of a function are affecting its output at a given point. For example, with the function y=2x, a change in x always results in double the change in y.

With the multi-variable function z=y+2x we can say that a change in y always results in the same change in z, but a change in x always results in double the change in z. As such, we’ve got 2 different ‘gradient values’, one for x and one for y. Mathematically these are known as the partial derivatives. In the case of that simple function:

  • The partial derivative of z=y+2x with respect to y is 1.
  • The partial derivative of z=y+2x with respect to x is 2

In these 2 examples the partial derivatives are fairly intuitive and also constant – they are same regardless of the values for x or y. However in most cases, it’s more complex

  • The partial derivative of z=y2+2x with respect to y is 2y
  • The partial derivative of z=y2+2x with respect to x is 2

Proofs for the above reside in basic calculus, but we don’t need to worry about that here. What matters to us is that we can still work out constant values at specific points:

  • The partial derivative of z=y2+2x with respect to y when y=2 is 4
  • The partial derivative of z=y2+2x with respect to x when x=anything is 2

There is an important point to note here. In the case of y in the above equation, the gradient is continually changing. When y=2, the derivative of z with respect to y is 4, suggesting a change in y would result in 4 times the change in z. Whilst this is clearly not universally true (just try adding a big number to y and see what it does to the result), calculus tells us that for tiny changes in y, the gradient can be used as a good approximation to calculating corresponding changes in z.

A neural network can actually be thought of as a function – one where every weight and bias value is a parameter. However where you might have been happy picturing slopes or triangles in the above examples, with a neural network that has around 30000 weights and 40 biases (our earlier network) you’re into thousand dimensional spaces. My advice is don’t try to picture that – it’ll hurt your head and it won’t get you anywhere. However you can still ask, for a given set of inputs, how much did any specific bias or weight anywhere in the network affect the output.

The reason this question is important is because of a technique known as gradient descent. The idea is that if we know how a given parameter (eg weight 7 of neuron 2 in layer 1) is affecting the output of the network, and we know how wrong the output is, we can adjust the parameter to make the output less wrong! This is what happens with back propagation. We fire loads of different inputs into the network for which we know the desired output. For each one we calculate how wrong the output is, work out the gradients (or derivatives) with respect to every weight and bias, and then nudge them slightly in the opposite direction, resulting in an output that is less wrong. Rince/repeat a few million times and you have a trained neural network!

Now obviously there’s a load of maths to the above – especially if you want to prove it. The principles can be written with a few fairly simple equations, but in my experience, an extremely skilled but not-math-trained-brain is easily thrown off with a new symbol, or a T in an unexpected place, or a new version of multiplication that’s named after somebody…. Every game developer has been there at some point! Fortunately, the rest is fairly intuitive, and can be well understood with good old code. So let’s begin…

The Training Loop

I think this is all clearest starting from the outside and working in to the meaty bit, so we’ll start with a function called ‘DoTrainEpoch’, which does a single pass (or epoch) of training the network.

public IEnumerator DoTrainEpoch(Dataset dataset)
{
    m_epoch_progress = 0;
    
    //get randomly shuffled training data
    var training_data = GetShuffledTrainingData(dataset);
    
    //divide into batches, and train network off each batch 1 at a time
    int first_idx = 0;
    int batch_size = 10;
    while (first_idx < training_data.Length)
    {
        var batch = new ArraySegment<Dataset.Data>(
            training_data, first_idx, batch_size);
        TrainBatch(batch);
        first_idx += batch_size;
        m_epoch_progress = first_idx / (float)training_data.Length;
        yield return null;
    }
    
    //evaluate network after all batches done
    Evaluate(dataset);
}

This calls a function to get the training data from the dataset in a randomly shuffled array, then simply walks over the data in batches of 10. Each batch is passed to the TrainBatch function which modifies the network to get better at identifying that batch. The reason the data is shuffled is so that if we run another ‘epoch’, the batches passed in are different, so the network doesn’t just get very good with specific batches.

Note: the reason the above is in an enumerator with a yield is simply so we can get a pretty progress bar in Unity using a coroutine, and the call to Evaluate simply counts and logs how many records in the test data the network gets correct.

Whilst our app does have a button to train a single ‘epoch’, to wrap up the process of doing multiple ones, we also have another coroutine enumerator to do a full ‘training pass’:

public IEnumerator DoTrain(Dataset dataset, int epochs)
{
    m_train_progress = 0;
    for (int i = 0; i < epochs; i++)
    {
        yield return DoTrainEpoch(dataset);
        m_train_progress = (i+1) / (float)(epochs);
    }
}

As you can probably see, all we’re doing here is calling the training function multiple times.

Training a single batch

The batch is the gradient descent bit I mentioned earlier. It doesn’t actually calculate the gradients (we’ll do those next). Instead, its job is as follows:

  • Calculate the weight and bias gradients for every entry in the batch and add them all up.
  • The summed gradients are then scaled by a magic number (the learning rate) and divided by the batch size
  • Finally, the scaled gradients are subtracted from their corresponding weights and biases.

We’ll cover TrainBatch in a few chunks as its simple but big!

//these are gradients of the cost function with respect to the weights and biases
//they all start out by 0, and are adjusted by the backpropagation algorithm
var dweight_by_dcost = new double[NumLayers][][];
var dbias_by_dcost = new double[NumLayers][];
for (int layer_idx = 1; layer_idx < NumLayers; layer_idx++)
{
    int layersize = LayerSize(layer_idx);
    dbias_by_dcost[layer_idx] = new double[layersize];
    dweight_by_dcost[layer_idx] = new double[layersize][];
    int numweights = LayerSize(layer_idx - 1);
    for(int neuron = 0; neuron < layersize; neuron++)
        dweight_by_dcost[layer_idx][neuron] = new double[numweights];
}

This first bit is nice and simple! All we’re doing here is creating some arrays to contain the summed up weight and bias gradients, all initialized to 0.

Next is the meat of the function, summing up the gradients

foreach (var datapoint in batch)
{
    //run back prop for this data point, which gives us
    //a dweight_by_cost and dbias_by_dcost just for this data point
    CalculateBackPropGradients(datapoint, 
        out double[][][] datapoint_dweight_by_dcost, 
        out double[][] datapoint_dbias_by_dcost);
    
    //now add the gradients for this data point to the running total
    for (int layer_idx = 1; layer_idx < NumLayers; layer_idx++)
    {
        for(int neuron = 0; neuron < LayerSize(layer_idx); neuron++)
        {
            dbias_by_dcost[layer_idx][neuron] += 
                datapoint_dbias_by_dcost[layer_idx][neuron];
            
            for (int weight_idx = 0; 
                weight_idx < dweight_by_dcost[layer_idx][neuron].Length; 
                weight_idx++)
            {
                dweight_by_dcost[layer_idx][neuron][weight_idx] += 
                    datapoint_dweight_by_dcost[layer_idx][neuron][weight_idx];
            }
        }
    }
}

So long as you trust my CalculateBackPropGradients function, this code is doing a pretty simple job. It’s getting the gradients for a specific data point, then simply adding them to the end result.

By the time the above is done, we’ll have all our weight and bias gradients summed up. For example, dbias_by_dcost[1][7] will contain the sum of all layer 1, neuron 7 bias gradients for each datapoint in the batch.

And finally, the weights and biases for the whole network are adjusted accordingly (the gradient descent)

//finally, we can apply 'gradient descent' to the
//weights and biases - i.e. adjust them in the 
//opposite direction to the gradient, scaled
//by constant learning rate over batch size
{
    double batch_learning_rate = 3.0 / (double) batch.Count;
    
    //now add the gradients for this data point to the running total
    for (int layer_idx = 1; layer_idx < NumLayers; layer_idx++)
    {
        for (int neuron = 0; neuron < LayerSize(layer_idx); neuron++)
        {
            m_biases[layer_idx][neuron] -= 
                batch_learning_rate * dbias_by_dcost[layer_idx][neuron];
            
            for (int weight_idx = 0; 
                weight_idx < m_weights[layer_idx][neuron].Length; 
                weight_idx++)
            {
                m_weights[layer_idx][neuron][weight_idx] -= 
                    batch_learning_rate * dweight_by_dcost[layer_idx][neuron][weight_idx];
            }
        }
    }
}

Here, once again we’re looping over all neuron biases and weights, this time subtracting the calculated gradients, scaled by a learning rate (a magic number) and divided by the batch size. Why subtract? The gradients are derivatives with respect to the cost – in other words, they represent how to adjust the weights and biases if we want the network to get worse! We want to go in the opposite direction, so we subtract.

Modifying The Forward Feedback

Before diving into the actual back propagation function, we need to modify the FeedForward process from part 1. Back propagation works by running the neural network forward, evaluating the result, and then using it to tune the network’s parameters. However, to do so, it needs to know about the state of all the neurons in the network.

In this modified version, we’re outputting 2 lists – one is the activations of all neurons in each layer, and another is a new concept – the weighted inputs:

public void FeedForward(
    Dataset.Data data, 
    out List<double[]> weighted_inputs, 
    out List<double[]> activations)
{
    weighted_inputs = new List<double[]>();
    activations = new List<double[]>();
    
    double[] a = data.m_inputs;
    
    //add activations of input layer, and a null set of weighted
    //inputs, as it had none!
    weighted_inputs.Add(null);
    activations.Add(a);
    
    //feed forward through the network, outputting weighted
    //inputs and activations at each step
    for (int layer = 1; layer < m_sizes.Length; layer++)
    {
        double[] weighted_input = new double[m_sizes[layer]];
        double[] biases = m_biases[layer];
        double[][] weights = m_weights[layer];
        for (int neuron = 0; neuron < m_sizes[layer]; neuron++)
        {
            weighted_input[neuron] = biases[neuron] + Dot(weights[neuron], a);
        }
        a = Sigmoid(weighted_input);
        
        weighted_inputs.Add(weighted_input);
        activations.Add(a);
    }
}

This isn’t too different to the original, however now we have added:

  • A new concept – the weighted inputs, which are simply the values that get generated by combining a neuron’s inputs and biases before being fed into the sigmoid activation function
  • At every layer in the network, we’re outputting the full set weighted inputs and activation

This version literally just outputs more info than the old version. In fact, to help maintain the old functionality, I’ve actually wrapped the above in one with the same interface as the original:

public double[] FeedForward(Dataset.Data data)
{
    FeedForward(data, out _, out List<double[]> activations);
    return activations[activations.Count - 1];
}

All we’re doing is running the complicated one and returning the activations from the final layer.

Back Propagating!

Ok! It’s time for the back propagation function, who’s job it is to calculate gradients for all the weights and biases in the network. To do so, first a nice easy bit – we’re going to run the feed forward process. I’ve included the function signature here too so you can see it:

public void CalculateBackPropGradients(Dataset.Data data, out double[][][] dweight_by_dcost, out double[][] dbias_by_dcost)
{
    //these are gradients of the cost function with respect to the weights and biases
    dweight_by_dcost = new double[NumLayers][][];
    dbias_by_dcost = new double[NumLayers][];
    
    //do feed forward step and record all zs and activations
    FeedForward(data, out List<double[]> weighted_inputs, out List<double[]> activations);

At this point we now have 2 lists – the weighted inputs for each layer, and the activations for each layer.

If you read about training neural networks a term you’ll hear a lot is the cost function. This is ultimately a single measure of how right or wrong the output of your network was. As in Michael’s book, we’re going to start with what’s known as the ‘quadratic error’ function – the sum of the squares of the error for each output neuron.

Up until now I’ve talked vaguely about gradients for weights and biases etc. But what I’m really referring to when I say the gradient for the bias of neuron 3 in layer 2 is the derivative of the cost function with respect to the bias of neuron 3 in layer 2. In other words, if the cost function is a measure of how wrong our network was, we’re looking to calculate how a given parameter contributed to that wrongness.

Important note: saying derivative with respect to repeatedly is annoying and very mathy, so from here on I’m just going to refer to weight gradients or activation gradients etc. When I say weight gradient, I mean the derivative of the cost function with respect to a weight value.

We’ve also only talked about gradients for weights and biases, but we can ask ‘how much did it affect the result?’ for aspect of the network. How much did the final output of neuron 2 in the output layer affect the result? How much did the weighted input for weight 2 on neuron 4 of layer 1 affect the result? With back propagation we start by asking this question about the activations of the output layer, then use some equations based in calculus to work the results backwards through the network.

So let’s start working backwards!

int layer_idx = NumLayers - 1;
double[] weighted_input_gradient;
{
    double[] activation_gradient = Sub(activations[layer_idx], 
                                       data.m_vectorized_output);

This first step calculates the activation gradient for each of the output neurons. It’s at this point that our choice of the ‘quadratic error’ cost function comes in handy – a little calculus shows that the derivative of the cost function with respect to a given output activation (aka the activation gradient) is just the actual result minus the desired result. More complex cost functions result in a little more complex math, but the code wouldn’t change a great deal.

Next, we’re going to take that activation gradient, and use it to calculate the weighted input gradient:


    weighted_input_gradient = Mul(activation_gradient, 
                                SigmoidDerivative(weighted_inputs[layer_idx]));

This is the first taste of how the elegant process of back propagation works. We started out by answering the question ‘how does a given output activation affect the cost function’? We are now going to use the result to answer another question – ‘how does the corresponding weighted input affect the cost function?’.

The answer that calculus provides is relatively intuitive – the weighted input gradient is equal to the corresponding activation gradient multiplied by the sigmoid function’s gradient. Why is this intuitive? Remember the shape of the sigmoid function:

The sigmoid function converts weighted inputs to activations. For very large weighted inputs (values of x in the above graph), the sigmoid function is very flat, so a change in the weighted input will have a minimal effect on the output activation (f(x) in the above graph). However, for small weighted inputs, the sigmoid function is very steep, so a change in weighted input will have a signficant affect on the activation. So, by calculating how the activation affects the cost, then scaling it according to how flat the sigmoid function is, we get how the corresponding weighted input affects the cost.

The bias for the neuron is next, and it’s pretty easy:

    //bias gradient is equal to the weighted input error 
    dbias_by_dcost[layer_idx] = weighted_input_gradient;

This might look odd at first, but consider that the bias is just a constant we add on to the sum of the inputs from previous neurons. i.e. a change of 10 in the bias of the neuron corresponds to a change of 10 in the weighted input. So if we know the weighted input has a given effect on the cost function, we can say the bias has exactly the same effect.

The weights are only slightly more complex:

    //each weight gradient for each neuron is the activation of the corresponding
    //connected neuron in the previous layer times the weighted input error
    dweight_by_dcost[layer_idx] = new double[LayerSize(layer_idx)][];
    for (int neuron = 0; neuron < LayerSize(layer_idx); neuron++)
    {
        dweight_by_dcost[layer_idx][neuron] = 
            Mul(activations[layer_idx - 1], weighted_input_gradient[neuron]);
    }

Again, if you think about it, this isn’t too strange. The effect of a given weight on the weighted input depends on the activation of the neuron it links to. If the neuron it links to had a huge activation, then the weight of the link will greatly affect the weighted input (and hence the activation, and hence the cost…). Conversely, if the neuron it links to had an activation of 0, then changing the weight would have no effect whatsoever!

So where are we so far? We’ve managed to work out:

  • The activation gradient for each output neuron
  • From that, the weighted input gradient for each output neuron
  • From that, the bias gradient for each output neuron
  • And the all the weight gradients for each output neuron

Now, having reduced layer_idx by 1, comes the final, beautiful bit of magic

while (layer_idx >= 1)
{
    //this is the most special bit! here we're taking the weighted input error from the next layer and
    //using it to calculate the weighted input error for this layer. aka back propagation!
    {
        double[] next_weighted_input_error = new double[LayerSize(layer_idx)];

        for (int neuron = 0; neuron < LayerSize(layer_idx); neuron++)
        {
            //calculated weighted sum of the weighted input gradients we've
            //already calculated for the connected neurons
            double next_activation_gradient = 0;
            for (int recv_neuron = 0; 
                 recv_neuron < LayerSize(layer_idx + 1); 
                 recv_neuron++)
            {
                next_activation_gradient += 
                    m_weights[layer_idx + 1][recv_neuron][neuron] * 
                    weighted_input_gradient[recv_neuron];
            }

            //with the sum, we can now calculate the weighted input
            //error in this layer for this neuron
            next_weighted_input_error[neuron] = 
                next_activation_gradient * 
                SigmoidDerivative(weighted_inputs[layer_idx])[neuron];
        }

        weighted_input_gradient = next_weighted_input_error;
    }

We’re entering a loop now, working back through the layers. At the start of each loop, we do the beautiful maths above. A simple sum uses the weights that link the neuron in this layer to those of the layer after it (i.e. the ones we just did the work for) to calculate the activation gradient for each neuron in this layer.

Once again, when you think about it, this makes sense. If we know the weighted input for a connected neuron in the next layer had a huge effect on the cost, and the weight of the link connecting this neuron to it is large, then it stands to reason the activation for this neuron will have had a huge effect on the cost as well. Conversely, if the weight of the link is 0, then the activation of this neuron is irrelevant. Similarly, if the connected neuron’s weighted input had no effect on the output, then the activation of this neuron must also have had no effect.

Having made that final leap, exactly the same math as we used earlier is applied to convert this new activation gradient to a new weighted input gradient.

And finally, all we have to do is repeat the previous process for the weights and biases in this layer.

    //repeat the process of calculating the bias and weight gradients
    //for this layer just as we did for the first layer
    dbias_by_dcost[layer_idx] = weighted_input_gradient;
    dweight_by_dcost[layer_idx] = new double[LayerSize(layer_idx)][];
    for (int neuron = 0; neuron < LayerSize(layer_idx); neuron++)
    {
        dweight_by_dcost[layer_idx][neuron] = 
            Mul(activations[layer_idx - 1], weighted_input_gradient[neuron]);
    }
    layer_idx--;
}

That’s it! Basic back propagation in a nutshell. No matter how deep the network, this loop can continue to step back, layer by layer, repeating the same calculations to produce values for the weight and bias gradients. The core of training a neural network comes down to this wonderful combination of the following tools:

  • The math to calculate the output layer’s activation gradients
  • The ability to convert any activation gradient into a weighted input gradient
  • The ability to convert any weighted input gradient into corresponding bias and weight gradients
  • And finally, the ability to step back, and calculate the previous layer’s activation gradients from the weighted input gradients we already have

If you do care about the maths, it’s some low-to-medium level calculus – above high school level, but only just above high school level, involving a bit of practiced manipulation of equations and a mathematical tool known as the chain rule.

Putting it together

Ok! Enough text. Here’s the updated Unity scene in action. I’ve added a little boiler plate code to the project to support making some progress bars etc, but the vast majority is in the ‘Part2’ scene and script files.

Hopefully that’s pretty clear – after loading up the app, I randomize the network and it becomes demonstrably useless, clearly failing to recognise digits and producing an accuracy of 12% (so, luck basically – scores roughly 1/10 in a game where the odds of getting it right by chance are 1/10!). However even after a small amount of training, it begins to match characters well, and after a single epoch hits 93%. As the code is worse than unoptimized, it takes a while, back I can confirm that after 30 epochs it sits around 96%.

Well, I hope that’s helped somebody grasp the innards of network training – an aspect of AI that I think is a bit of a black box for many. In my next post, I’m going to try to switch to something a little more fun and do some experiments with it!

Neurons For Coders Part 1 – A Basic Neural Network

I’ve recently decided it’s time to get my understanding of neural networks from theoretical to solid, and have decided the best way to do that is to write some neural stuff from first principles, starting with help from the digital book by Michael Neilson here http://neuralnetworksanddeeplearning.com/, and implementing things in a simple unity app. Plus, whilst I am a long way from an expert in AI, I am pretty good at translating math speak to programmer speak. So, with a little luck, I might be able to give a little clarity to some engineers wanting to take their first steps into AI along the way!

My goal here is to get a neural network running, and then eventually training, initially using the principles from the book (without pure copy-paste-porting), and then perhaps expanding them using some more advanced tools such as the modern ADAM optimizer. Alongside, I want to put a bit of game developer in and maybe make some cool visualisations or something.

The git hub repo for this series is here: https://github.com/chriscummings100/neuralfun

The git hub repo for the book by Michael Neilson is here: https://github.com/MichalDanielDobrzanski/DeepLearningPython

Dumping out the book data

Whilst I want to eventually move well past any need for the code from the book’s git hub, it’ll help to get started if we have a simple pretrained network and easy access to the MNIST database. As such, my first step is simply modifying the Python files from the repo above to dump the network and training data out to simple binary files that I can parse in C#.

Adding this to the network.py file lets me dump out the network layer sizes, biases and weights (more on this later).

        #write the num layers, sizes, biases and weights to a binary file
        with open('network.bin', 'wb') as f:
            f.write(self.num_layers.to_bytes(4, byteorder='little'))
            for x in self.sizes:
                f.write(x.to_bytes(4, byteorder='little'))
            for x in self.biases:
                f.write(x.tobytes())
            for x in self.weights:
                f.write(x.tobytes())

The MNIST database, which both the book and this blog will work with, contains a huge set of 28×28 pixel grayscale handwritten digits, each correctly labelled. I found it a little tricky to download in a form that wasn’t very python oriented, so I modified the mnist_loader.py to dump out things in binary:

    training_data, validation_data, test_data = load_data()
    with open('mnist.bin', 'wb') as f:

        datalen = len(training_data[0])
        f.write(datalen.to_bytes(4, byteorder='little'))
        for idx in range(0,datalen):
            f.write(training_data[0][idx].tobytes())
            f.write(training_data[1][idx].astype(np.uint8).tobytes())

        <same stuff for validation + test data here>

With that done, I zipped up each file so they can be checked into git hub without using a gazillion bytes, and am ready to start in C#!

Into C# and Unity!

To get you super excited, this is target 1! A simple UI that lets me flick through the different images in the training set, see them along with their label, and also see what the trained neural network outputs.

So the meat of the work is going to be 1 component, with a few internal classes. The first key one, which will store (and eventually run/train) the neural network itself is this:

public class Network
{
    public int[] m_sizes;
    public double[][][] m_weights;
    public double[][] m_biases;

That is:

  • The size of each layer
  • The weights for each layer (not including a set for the first layer, as it’s the input)
  • The biases for each layer (ditto)

More on these later!

Our Dataset is a little more complex. It consists of a core ‘Data’ structure which represents a single known input/output:

public class Data
{
    //data from MNIST dataset
    public double[] m_inputs;
    public byte m_output;
    
    //10-element vector with 1.0 in the position of the output label
    public double[] m_vectorized_output;

Each record in the MNIST database is an array of doubles (a value for each pixel in the 28×28 grid), and a single byte to indicate which digit it should be (0-9). However we want to setup a network that has 10 outputs, where each represents the probability of a given digit. i.e. if output 3 is the highest, we assume it’s a 3. So if we know the expected character is 3, the vectorized output would be [0,0,0,1,0,0,0,0,0,0].

public class Dataset
{
public Data[] m_training_data;
public Data[] m_validation_data;
public Data[] m_test_data;

I won’t post the code to actually load the above arrays – it’s just some boring walking through a buffer, but you can read it in the git hub repo.

Drawing the data

I’m pretty happy with my maths, but nonetheless I like to visualize, plus pretty pictures are nice. So let’s write a simple function in the Data class to populate a gray scale texture so characters can be shown on screen:

public Texture2D CreateTexture()
{
    var tex = new Texture2D(28, 28, TextureFormat.RGB24, false);
    for (int y = 0; y < 28; y++)
    {
        for (int x = 0; x < 28; x++)
        {
            byte v = (byte)(m_inputs[y * 28 + x] * 255);
            tex.SetPixel(x, 27-y, new Color32(v, v, v, 255));
        }
    }
    tex.Apply();
    return tex;
}

Putting all the above together with a couple of buttons and a text box to show the label, we can now build ourselves a little Unity app to step through the different values in the database.

No actual neural network running yet, but we’ve got everything loaded and a simple UI that let’s us view and verify the data. As expected, when stepping through the database, the character and label match. That said, you’d be forgiven for thinking there was a bug given some people’s handwriting…

Running the network

Whilst I’m not an expert, I’ll try to give my programmers-eye-view of getting neural networks running (and eventually, training).

The idea is that for each layer we have a list of neurons that are linked to neurons in the previous layer. The structure of the network dumped out from the python file is as follows:

Network structure, http://neuralnetworksanddeeplearning.com/ chapter 1

After the input layer, each neuron requires 2 pieces of information:

  • A weight value per neuron from the previous layer
  • A single bias value

So in the above diagram, neuron 3 in layer 1 (the hidden layer) will contain 784 weights, plus a single bias value.

A given neuron’s ‘activation’ (the value it outputs) is then calculated using the following psuedo-code:

//add up weight*activation from all links to 
//neurons in previous layer
sum_of_inputs = 0
for(int i = 0; i < num_previous_layer_neurons; i++)
	sum_of_inputs += weight[i] * previous_layer_activation[i]

//neuron activation is the summed up inputs plus bias fed
//into activation function
activation = ActivationFunc(sum_of_inputs + bias)

This can be written much more simply however, if we think of the weights and activation arrays as really big vectors. If so, that first loop simply boils down to a dot product between 2 vectors, so we end up with:

activation = ActivationFunc(Dot(weights, prev_activations) 
                            + bias)

The activations for the first layer are the inputs to the network (our 28×28 grid of gray-scale pixel values). We run this process for each neuron for each layer in sequence, and end up with activations for the final layer, which is it’s outputs.

The activation function is most commonly the ‘Sigmoid’ function, which when plotted on a graph looks like this:

Ultimately it’s a ‘soft’ version of outputting 1 when the input is positive, and 0 when the input is negative. In code, it’s pretty simple:

public static double Sigmoid(double z)
{
    return 1.0 / (1.0 + Math.Exp(-z));
}

We’re almost ready to start neural networking, but need one extra helper function first. Earlier I mentioned everything is simpler if we can think of the weight and activation arrays as vectors. To do so, we’ll write a dot product function (the sum of all the components multiplied together):

public static double Dot(double[] a, double[] b)
{
    double res = 0;
    for (int i = 0; i < a.Length; i++)
        res += a[i]*b[i];
    return res;
}

Ok! All the tools ready, let’s write the feed forward function that runs the network:

public double[] FeedForward(Dataset.Data data)
{
    //start with activations from the inputs
    double[] current_activations = data.m_inputs;
    
    //feed forward through the network
    for (int layer = 0; layer < m_sizes.Length - 1; layer++)
    {
        //get biases and weights for this layer
        double[] biases = m_biases[layer];
        double[][] weights = m_weights[layer];
        
        //calculate the new activations
        int layer_size = m_sizes[layer + 1];
        double[] new_activations = new double[layer_size];
        for (int neuron = 0; neuron < layer_size; neuron++)
        {
            new_activations[neuron] = Sigmoid(
               biases[neuron] + 
               Dot(weights[neuron], current_activations)
            );
        }

        //store them
        current_activations = new_activations;
    }

    return current_activations;
}

This function is pretty simple, and is really just running the earlier per neuron pseudo code for every neuron in every layer sequentially. It starts by taking the inputs from the data as the activations for the first layer. These are fed into the 2nd, then the 3rd (and this could continue if we had lots more layers).

Note: point of confusion here is that m_weights[0] contains the weights for layer 1 and m_biases[0] contains the biases for layer 1. This is because the input layer has none.

That’s actually it for the neurons! The feed forward function takes the inputs, runs them through the network and generates the outputs. In the case of our network, it takes 784 inputs, runs them through the network, and generates 10 outputs – one for each possible digit.

Showing some results

The above is great, but not exactly readable. To get it on screen, we just need one final step – to convert those 10 outputs into a single digit, we need to find the biggest one. If output 3 is biggest, the digit is a 3. Simple!

    public static int MaxIndex(double[] a)
    {
        double max_val = -double.MaxValue;
        int max_idx = -1;
        for(int i = 0; i < a.Length; i++)
        {
            if (a[i] > max_val)
            {
                max_val = a[i];
                max_idx = i;
            }
        }
        return max_idx;
    }

Armed with that simple function we can now display which digit the neural network decided on. My full function that updates the image texture, label and result is called ‘SetData’ and looks like this:

    void SetData(Dataset.Data data)
    {
        if(m_current_texture)
            Destroy(m_current_texture);
        m_current_texture = data.CreateTexture();
        m_data_image.texture = m_current_texture;
        m_data_label.text = data.m_output.ToString();

        double[] results = m_network.FeedForward(data);
        m_result_label.text = MaxIndex(results).ToString();
    }

Job done!

Pretty impressive for the network generated in chapter 1!

All the code for this series can be found at https://github.com/chriscummings100/neuralfun, and the majority of the work for this post is in the Part1 folder.

Next time I’ll probably get training in so we’re no longer reliant on a network trained via the Python scripts provided by Michael’s book, though I may get side tracked by making pretty visualizations or something!

Proceed to my next: Neurons For Coders Part 2 – Back propagation for an introduction to the training process of neural networks.