Supervised Learning

What do you think of when the word “supervised” comes into mind? Perhaps you opening a door for someone when you know that their hands are full because they are carrying tons of boxes? How about helping your grandparents cross the street? Helping your sibling stand up after he or she tripped during a game you were playing? This general idea of “receiving help” is a good way to understand how supervised learning works.

This “help” in supervised learning comes in the form of knowing the answers beforehand. Let us go back to our math test example from before. It does not make sense for us to study for our upcoming math test and not verify the right answers. If we do not do this, then it might end up with us learning bad habits or imagining that a certain technique works when it does not. What we do instead is we study a practice exam, then cross-validate with the answer key. This way, we can eliminate techniques or concepts that we already mastered so we can better focus on things we lack for the test instead.

For machines, this means giving them learning materials we call “training data.” This training data is our set that contains thousands, if not, millions or billions of examples to learn from. This dataset contains the answer of course because as the A.I. tries to learn from the data, it will cross-validate its own answers from the actual answers to properly tune its own hyperparameters. Think of it like learning to memorize something through flashcards. You try to answer in the best of your abilities what the front side of the flashcard is asking for and then validate at the back to see if your answer was right or not. The A.I. does the same.

After learning from the training dataset, the A.I. is exposed to a testing dataset. This dataset contains instances of examples the A.I. has never seen before. This is done to test if the A.I. learned or if it just managed to memorize the pattern of that training dataset. In a real-life scenario, it is just like that math example we had earlier. Did we really learn from the practice test, not paid enough attention entirely, or just memorized the solutions? If an A.I. does not perform well, then the A.I. either “underfitted” or “overfitted” its hyperparameters for the training dataset. Underfitting is when you did not learn enough from your training set, while overfitting means just memorizing your training set. Both scenarios cause an A.I. to not perform well.

As you can see, supervised learning requires a lot of examples to work with. Typical A.I. implementations usually work with at least 100,000 instances of data for the predictions in the real world to be as correct as possible. Furthermore, this type of learning assumes a well-defined outcome in the end. What if we are just given learning materials and are not told what to do with it?

As we have discussed above, supervised learning is “with help” type of learning. Our A.I. will learn a task at hand provided we give it the answers in the end. Again, this is like studying for a math test with practice tests and their corresponding answer keys. This is done to cross-check how far or close our answers are from the solutions. When supervised learning A.I. model gets trained in real life, data usually needs to be cleaned and organized first.

The above dataset is the infamous Titanic dataset. The dataset contains valuable information from the passengers of the Titanic-like their name, age, sex, where their destination was, et al. But more importantly, the dataset also contains their survival status—0 if they died, while 1 if they survived.

This is what we call a labelled dataset. It is called labelled because it contains the column that is our objective. If our objective was to predict their survival status given all other information and said information was available in the data, then we call it labelled. Labelled datasets are important because they contain information as to what our model is trying to predict.

The only difference between labelled and unlabeled data is the pre-defined target in mind—labelled datasets contain the information that we want our A.I. to predict. In unlabeled datasets, our targets are not decided yet, meaning we either do not know what we want our A.I. to predict or we just want to explore the contents and structure of the data at hand first.

The nature of the dataset is also reflective of how what learning method you will be using—labelled for supervised and unlabeled for unsupervised. More on this later.

If we were to predict if a person survived the titanic or not, then that would be considered as a classification problem. We are telling the A.I. to use all other features to figure out what type of passenger qualifies as a survivor or not. Since we are only concerned with either a dead or alive passenger, then the A.I. will try to figure out a decision boundary that defines what a survivor or a non-survivor looks like.

In the dataset, there are missing ages. Around 20% of them are missing. A technique called imputation could be used to fill in these missing values. A common method is to fill in the missing values with either the mean of the median. Some decide to use other factors such as first grouping them to where they embarked and then using either the mean or median of said groups to fill in the missing values. If a person embarked from Queenstown and the average or median age is X, then all persons with missing age values who embarked from Queenstown will now have age X.

However, another technique that can be utilized is regression. Assuming all other variables have non-empty values, then what a data scientist can do is to program an A.I. that first imputes missing age values given the other variables. This A.I. will predict what age group a person belongs to given trends present from other variables. This way, unlike the imputation method presented above, you take into consideration all other variables as well in determining the age of the person.

Regression can also be used for other things such as predicting market trends, the likelihood of having a bad movie review, and all sorts of other applications. The same goes with classification. It is up to the imagination to figure out how creative an application can be.

Throughout your adventure in learning A.I., there will always be three supervised learning models that repeatedly show themselves: multilayer perceptrons, convolutional neural networks, and recurrent neural networks. Let us jump right in!

Multilayer Perceptron

Imagine the structure of your brain. You have millions if not, billions of neurons firing every second as you obtain information in real-time. Reading the newspaper? Neurons fire to bring back relevant information like similar occurrences or on-the-spot reactions and analyses. Practising how to play the piano? Neurons fire relevant spots and strengthen their connections with practice over time. This is the inspiration that brought to life multilayer perceptrons (MLPs), neurons and strong connections with more practice.

MLPs have three structures in them, namely the input layer, the hidden layer, and the output layer. The input layer contains all the relevant “features” that you need, the output layer is your goal, and the hidden layers are your nonlinear combinations of your hidden layers that handle deeper representation and abstraction. Each line was a certain weight that corresponds to it. This weight determines how much one node influences another node. Much like in our own brains, if one node’s functions do not directly relate to another node’s function, then there is no need to strengthen their bond that pertains to playing the piano.

These weights are important because as the A.I. learns, it tweaks the values of these weights to create the proper neural pathways that define whatever it is the user wants to define. In our Titanic example, more refined paths will clearly define what it is that makes a Titanic survivor. This weight update is done through backpropagation, which is just a fancy way of saying chain rule. The simple idea behind backpropagation is that it spots the errors, and sends back those errors back to the network for it to make the necessary weight adjustments. With more training and more iterations of forward propagating the information and then backpropagating the error, the more the A.I. becomes better.

In the Titanic dataset, the input layer is the features you set that you think to define a survivor and the output layer is the survived column. The hidden layer becomes the many different pools of ideas and combinations that will build-up to the idea of what a Titanic survivor looks like. The beauty of the MLP is that its decision boundary becomes much complex compared to others. In the image earlier, we saw that the decision boundary was just a line. In the real world, data can become more complicated than that. A simple line will not suffice in those scenarios. Just look at the image below.

Spiral decision boundaries are being made in response to the spiral nature of the dataset. A linear decision boundary cannot hope to accomplish the same level of elegance and complexity that an MLP can. These make MLPs versatile and powerful. Furthermore, this makes MLPs the building block of other A.I. models to come.

Convolutional Neural Network

Image classification has been around for a long time. Programming computers to recognize A from B has always been an interesting problem to solve. In the first iteration, lots of programmers would probably classify A from B based on their physical features. For example, in programming the ‘cat or dog’ classifier, there would be the functions that would return whether or not the image has whiskers, paws, claws, a tail, and others. It would be tedious as there would be an endless stream of cat or dog features that you need to program. However, A.I. can just learn what it means to be a cat or a dog based on the data you feed it and make more accurate predictions from there.

The problem with just using MLPs is that they do not abstract deeper enough into what an image is. Furthermore, using MLPs is computationally explosive as the number of features explodes to whatever the dimension of the image is. If an image is 18×18 pixels, with 3 channels (red, blue and green), that would make 972 features in total. Furthermore, all nodes in one layer are connected to all the other nodes in the other layer. Imagine having 972 nodes in both the input and hidden layers. That means that nodes 1 to 972 in the input layer connects to all the other 972 nodes in the hidden layer. That is a lot. There should be a more efficient method.

CNNs, as they are otherwise called, reduces the dimensions while keeping the important features of an image intact. Imagine using a flashlight to illuminate a painting. You do not illuminate the entire thing, but just a small portion of the image from left to right—up to down. As you do this process of lighting up certain parts of the image, you are also only picking the most important features that you see, and transferring them over to somewhere else, creating a new image in the process. This is called max pooling.

Above we see an example of what max-pooling does. The left square matrix is the numerical representation of our image while the size of the light of our flashlight is 2×2. As we move this 2×2 light across the image, we only get the most important feature, which in this case, corresponds to the biggest number. This spits out a new matrix from the right, which we either perform max pooling again or we now feed it into an MLP.

The MLP in this case gets the max-pooling input that contains only the most relevant information. This is all the MLP needs, a dataset that contains the most important features, and then proceeds to create neural connections based off this. In a sense, it is like the Pareto principle of focusing on the 20% of inputs that produce 80% of the results. The convolution and max-pooling parts figure out what that 20% of features is, while the MLP exploits that facts and learns from that as much as it can.

Recurrent Neural Network

Notice that since we started talking about supervised learning we have mostly been talking about sequence-independent examples. Image classification is not sequence-dependent. The same goes for either classifying surviving passengers or imputing missing age values with classification and regression respectively for MLPs. If we want to predict the movement of stock prices, or create a chat bot, then what should we use? RNNs, as they are often called, are our gems.

Notice anything eerily familiar? Unlike the MLP, the hidden layer’s last layer feeds its output back to the first layer of the hidden layer. This is the time component at work here. It does this based on the number of time steps defined in the dataset and then spits out a prediction in the output layer for your target time. Once you know how MLPs work, you pretty much know how RNNs work also. The one problem with RNNs is the vanishing/exploding gradient problem. The longer the time period, the more backpropagation has to reach back overtime to make weight corrections. If the error values being sent back are too big, then the weight corrections explode. On the other hand, if the error values sent back are too small, then only the most recent ones benefit from the change, while the earlier ones get little to no benefit at all.

Another model called the long short-term memory (LSTM) network was created to address this problem. This model is another way of implementing an RNN that sidesteps the vanishing/exploding gradient altogether. All you need to know about this model is that, as the name implies, it far exceeds regular implementations of RNN because of not having to rely on gradients stacking up over time.

Possibilities

If you think about it, the possibilities with each of the discussed frameworks can be extended depending on how far you stretch them. MLPs, for example, are being used to classify and uncover what it means to get cancer or not. Meanwhile, CNNs are being used by Facebook for its facial recognition system. RNNs and LSTMs are being used for automated trading and generating text.

One smart example of using image recognition is MTailor, a Silicon Valley Startup who won a grant from Sam Altman’s startup accelerator company called Y Combinator. You take a picture of yourself with the application and it determines the proper sizing of clothes for you just with that picture. This is an example of how to stretch the boundaries of imaginations that enables clever ways for A.I. to solve problems.

Another possibility for CNNs in the future may be house valuation. Just by taking key areas in the house, you can combine a CNN with an MLP to generate a price approximator. This way, when you are scouting for your new home, you can compare what the agent is offering and what the A.I. tells you is its approximate price. Extending the idea a bit further, let us a look at these applications for car sales. An application that also approximates the price of a second-hand car might help people negotiate or even scout faster.

One last important application of RNNs and MLPs is fake news detection. By tweaking the necessary models under these two and combining natural language processing techniques, one could create a fake news detector. You would need a labelled dataset for this one, so creating the proper scale in which to rank the “realness” or “fakeness” of a news article will be a challenge.

Related

Era Innovator

Leave a ReplyCancel reply

Supervised Learning

Share this:

Related

Era Innovator

Leave a ReplyCancel reply