If you are a developer or an entrepreneur or just a plain curious human being (somewhat like me) having some interest in technology, and unless you are living under a rock, you must have heard about the buzz flowing around, concerning Machine Learning and Deep Learning. Every other day one of the Big Fours (Google, Facebook, Apple, Microsoft) are trumpeting their amazing success of achieving state-of-the-art performance in one or the other problem using Deep Learning which was “unthinkable” for a Artificially Intelligent system even few years back. Needless to say that the “Bot” ecosystem is buzzing with new bots, coming out fresh and full of promises almost everyday, from ambitious start-ups. And some pure AI initiatives like Google Deep Mind or OpenAI are asking to team up with them to try to solve the puzzle of intelligence and making the world a better place.

A quick glance to the infographic, generated using Google Trends, shows how the interest in Deep Learning has skyrocketed in the last few years. It is amazing! Isn’t it?

We admit that the media attention, the money flowing in, and the blogs/news/conference talks can be overwhelming, and if you are trying to make a start into this field then they can really confuse you. That is why we planned this series of articles, starting with the present one, to give you a hands on and detail insight into the fascinating world of Deep Learning and how to use the state-of-the-art technologies available today to apply Deep Learning into the problem of Natural Language Processing, using a gentle and step by step approach.

The idea that machines can be fabricated as intelligent entities is around for many years. As the famous saying goes, it has probably been imagined out of “an ancient wish to forge the gods.” But it is with the advent of modern Mathematics, Computer Scienece and Electronics that it started to take shape in reality. In the last century AI had seen few ups and downs, yet it bounced back with new ideas and fresh researches every time there seemed to be a dead end. In the dark days of Neural Netowrk research of 1970s and 80s people like David Rumelhart, Geoffrey Hinton, Ronald Williams, David Parker, and Yann LeCun worked hard and produced researches thanks to which AI Summer has arrived again.

Deep Learning, the way we know it today, can simply be described as a Neural Network with more than one hidden layer in it. Although, how having increased number of layers helps the system to learn more sophisticated features about the problem space is not completely known and theoritical research is going on in this area, but it is safe to say that a lot of problems, previously untackled, are presently getting solved using them, as we speak.

Just to conclude this section, I would like to mention that deep neural nets are one of the many methods that people has applied to solve the problem of intelligence. The reason it is so hot now is because of the fact that it is producing results unsurpassable by any other methods we have seen so far.

A Neural Network is modeled after the original workings of human brain. A neuron in our brain has three parts, e.g. the Dendrites, the Nucleus, and the Axon. They are there for receiving a signal, doing some processing on it, and transmitting the result to the next neuron respectively.

This is exactly what an Artificial Neural Net does. It has one or more inputs, where we feed in the input data. It then does some processing on them and hand them over to another set of neurons, who in turn do another level of processing on them and then hand them over to the next layer… and so on. Until they reach, what we call, the output layer. So, a Neural Net has three components, the Input Layer, one (Or more) Hidden Layers, and the Output Layer.

So as you can see it is just layers of neurons (Computing Units) fully connected with each other via data flow which takes a *feature vector* (in other word let’s say an array of real numbers) as input, and starts with some unique weights assigned to each connection, and then does some matrix operations. We call these operations the *activation function* of that layer. It then feed the result to every neurons of the next layer.

This process continues throuhgout the entier *architecture* until it reaches to the output layer. At the output layer we compare the result we obtained with the desired result and then we measure the amount of error that our network had made.

We transfer that information to the layer just behind it and there we use error *minimization* and tune the weights of that layer so that the error is minimized. We continue this backward movement until we reach to the first hidden layer. Once there, we say that one *epoch* of training is over. Then, we start the whole process over again.

So, in essence, with each epoch passed, it *Learns* to minimize the error at each layer, and slowly the network *trains* itself to generalize the problem at hand.

This is a (very) simplified version of what backpropagation is doing

The forward movement of data is called, you guessed it, a Feed Forward Neural Netowrk and the backward movement is called Backpropagation. These are all a neural network essentially do. Those who are familar with Mathematics and Statistics, may recognize this as a function minimization problem. Which is completely right!

The idea is not new. It is here for long (starting around 1943), but what changed in recent years are the following two things

- Availability of massive amount of data (With the huge boom of user produced data, thanks to internet).
- Availability of moderately cheap and afforadable hardware to work with that data.

It is worth mentioning that along with these two defining factors mentioned above, Data Science and Computer Science researchers across the world have been working tirelessly on different models and architectures of Neural Nets throuhg past decades, and those pathbreaking new ideas also helped to advance and shape Deep Learning as we know it today. We will mention them when possible and give the curious readers interesting pointres to learn more about them but unfortunately a detailed discussion of them is out of the scope of these series.

What a Deep Neural Net *sees*.

We will now develop a very basic neural network from scratch using Tensorflow. Before we start, we assume that you have familiarity with the following things –

- Python (we will use 2.7.x)
- numpy
- Creation of a virtual environment in Python
- Basic idea of Matrix operations
- You have a Ubuntu 14.04 (64 bit) machine available (Or can create one for time being in a cloud provider like AWS to follow along the code)
- Knowledge of basic Machine Learning ideas (manily
*linear regression*, as we will use it here.)

If you want to produce the graphs you have to install matplotlib in the environment. And finally, as this tutorial was developed using Jupyter notebook therefore some of the features we used here are only available in an environment like this.

So, let’s start –

```
$ mkdir dnn_ex
$ cd dnn_ex
$ virtualenv venv
$ . ./venv/bin/activate
(venv)$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.0rc0-cp27-none-linux_x86_64.whl
(venv)$ pip install --upgrade $TF_BINARY_URL
(venv)$ pip install --upgrade numpy
(venv)$ pip install scikit-learn pyreadline Pillow
```

One word of caution – The flavour of Tensorflow we are using here is the one which uses only the CPU, but as numerical computations can be very demanding you may prefer to have a machine with high end GPUs and use the flavour of Tensorflow which can leverage the power of it for a more real world scenario with large amount of training data.

Tensorflow is an open source library that uses Data Flow Graphs to make numerical computations (Sounds familiar? Just like what Neural Nets are doing? You guessed it right. That is why it is so popular for Deep Learning).

It was originally developed by Google Btain team and later open sourced. It supports both CPU and GPU based computations and distributed computations as well. It also comes with an utility to serve the tarined models (Tensorflow serving). There are other frameworks out there to implement deep neural nets, like Theano or Torch but we will restrict ourselves to Tensorflow for now.

One last thing about Tensorflow what a lot of beginners (including me, when I started) find difficult to understand at first is this : when you create the graph, which means when you declare the variables, constants, and the operations on them, tensroflow does not execute them right away like you would have expected for any normal python code. What you have to do instead is to create a *session* and put the entire computing graph in the session and then run it from there. Tensorflow will optimize your computing graph in many ways internally and then run it and give you back the result.

Enough talk. Now coding –

```
## Let's import the libraries
from __future__ import print_function
import numpy as np
import os
import tensorflow as tf
import matplotlib.pyplot as plt
##If you are not using an environment like jupyter then the following line is not needed.
%matplotlib inline
# Let us define the total number of training examples total
```*training*examples = 100
# Let us create a 2(rows) x 100 (columns) matrix using numpy
# We use linspace method from numpy to equally distribute 50 points between two given numbers init_data = np.array([np.linspace(-4, 2, total*training*examples), np.linspace(-6, 6, total*training*examples)])
#Let us try to see how this data looks like fig, (ax1, ax2) = plt.subplots(1, 2) ax1.plot(init_data[0]) ax2.plot(init_data[1]) plt.show()

Notice the difference of boundary along Y axis

```
#What is the shape of our input matrix? So that we are really clear on this point.
print(np.shape(init_data))
```

```
(2, 100)
```

```
# Let us add some noise in the data. It is not exciting to work with perfectly linear data anyway!
# We create a 2 x 100 matrix of random numbers from a normal distribution with mean 0 and variance 1, and add them with the linear data
init_data += np.random.randn(2, total_training_examples)
# Let us visualize again
#Let us try to see how this data looks like
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(init_data[0])
ax2.plot(init_data[1])
plt.show()
```

Perfect! a lot of “nice” noise 🙂

```
# Now let us look how they look like when plotted against one another
plt.figure(figsize=(4,4))
plt.scatter(init_data[0], init_data[1])
plt.show()
```

We assume that we are modelling a linear regression with two variables using a perceptron. A perceptron is just a single neuron cell.

What is our aim? We want the machine to learn to fit the best fitting straight line which can be drawn through the data so that the suqared distance from each data point to the line is the mininum. The mathematical function that we want to learn is very simple – y = WX + b

Yes you got it right! It is the equation of a straight line where **W** is the gradient and **b** is the intercept. Only thing you should keep in mind is the fact that due to the vetorized nature of the function it is super simple to extend it over many dimensions. You just have to create the right matrix for weights, and voila! It will work. However, for the sake of nicely visualizing things, we will limit us to a single variable function.

```
# Let us separate out the X's and y's
train_x, train_y = init_data
# And print the dimension of both so that we know what we did
print(np.shape(train_x))
print(np.shape(train_y))
```

```
(100,)
(100,)
```

```
# We add a bias term to the X's. Our original function was looking like - WX + b where b is the bias or intercept, so, we need the bias term.
# Usually it is 1
train_x_with_bias = np.array([(1., a) for a in train_x]).astype(np.float32)
# Some essential variables that we are going to use soon.
loss_array = []
epochs = 1000
learning_rate = 0.002 # Learning rate should be low. Because it helps the gradient decent to converge properly.
```

```
with tf.Session() as sess:
tf_input_x = tf.Variable(train_x_with_bias)
tf_target = tf.constant(np.transpose([train_y]).astype(np.float32))
# Time to create the weights. We choose random weights for both the bias term and the X term
weights = tf.Variable(tf.random_normal([2, 1], 0, 0.1))
# Let us initialize the variables -
tf.initialize_all_variables().run()
# We define our model here. Which means the function that helps us to predict a Y given a X
ypredicted = tf.matmul(tf_input_x, weights)
# Yes, simple matrix multiplication!
# What error did the Neural Net make?
error = tf.sub(ypredicted, tf_target)
# Let us define a cost function. Which calculates the final error.
# For this case mean of the all squared disctances
# We use the built in l2_loss method of tf. We will use it in future for many things.
cost = tf.nn.l2_loss(error)
# We define the graph computation to be done. Here
update_weights = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Now run this operation for all epochs
for i in range(epochs):
update_weights.run()
# keep a record of the diminishing loss
loss_array.append(cost.eval())
# The training is over. We should have the weight now
w_final = weights.eval()
print ("The final weight matrix is - ", w_final)
print ("Final cost is - ", cost.eval())
y_final = ypredicted.eval()
# Plot some nice graphs
fig, (ax1, ax2) = plt.subplots(1, 2)
plt.subplots_adjust(wspace=.5)
fig.set_size_inches(12, 6)
ax1.scatter(train_x, train_y, alpha=.5)
ax1.scatter(train_x, np.transpose(y_final)[0], c="g", alpha=.6)
line_x_range = (-6, 4)
ax1.plot(line_x_range, [w_final[0] + a * w_final[1] for a in line_x_range], "g", alpha=0.6)
ax2.plot(range(0, epochs), loss_array)
ax2.set_ylabel("Loss")
ax2.set_xlabel("Epochs")
plt.show()
```

```
The final weight matrix is - [[ 2.17945075]
[ 1.84547126]]
Final cost is - 2.51648e-09
```

```
# A comparison of the actual Y's and the predicted Y's
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(train_y)
ax2.plot(y_final)
plt.show()
```

Very very similar. Isn’t it?

In this article, we observed and learned how to construct a very basic computing unit or Neuron and we fed it with data, performed functions on that, and predicted a final result. What we did here, is the basic of all Neural Nets everywhere.

We also observed how one can plot the error as a function of epochs and actually check if the error is diminishing as the training process advances. This is often a very good practice when you are trying out models and architectures for a new project.

In the coming few articles we will take this basic knowledge to the next level. We will learn how to add multiples layers of neurons to the architecture, how to model Natural Languages, how to analyze the sentiment of any given text, and finally we will also learn how to make our work available for everyone. So, stay tuned! I

Shubhadeep Roychowdhury was born in 1979 in a secluded small village of Eastern India. He has always been a curious soul and studied Physics and Computer Science in his Bachelor and Masters level. He is working as a professional software developer for the last 11 years and has designed and developed a bunch of successful and highly scalable backend systems across India, USA and Europe. He has recently been working in serious deep learning projects although NLP has always been a passion for him. Living in Paris with his wife, Shubhadeep can often be found trying to write or travelling with his wife when he is not working.

Linkedin profile

Linkedin profile

- Inside Deep Learning, an Intro to NLP - April 25, 2017
- A walk in the park : Getting hands on with Deep Learning and Tensorflow - March 16, 2017