Machine Learning — Basic concepts —
After finishing my master course Machine Learning, I am gonna write a series of stories describing some interesting topics in the most understandable way I can. Let’s start with an intoduction about what is machine learning and some basic issues. My background is mathematics and this is a series for people with little or no experience.
Machine Learning simple definition: ML is the technology of using known information to create a system that can generalize and find unknown information. In fact, ML is a lot families of algorithms and each time we have to find the most suitable for our problem. This kind of algorithms are for example neural networks, linear regression and many other with fancy names that you have heard.
So, let’s define the problem: Suppose we have a dataset
each pair of the dataset (x,ti) is called a data/pattern/example.
x: is a vector of one or more dimentions and the components of the vector are called features of the data (x,ti)
ti: is called target value and it is the thing we want to predict.
For example here, we have the dataset D consisted of 5 data/patterns. Each data has 2 features, so x=(date, event) and 1 target value ti=Price.
We want to build a system that is using this known information and can generalize to find the Price of any other event on any date which is not in the dataset(so is unknown). In other words, we want a system that can take as input a vector x=(date, event) and predicts which will be the price with a great accuracy.
Types of problems: There are three big categories in ML.
- Supervised Learning: “I have the target availiable in the dataset”.
a) This target can be a label, for example we can have as features
x =(age, weight, ..) and as target ti =stroke/ not stroke. In this case we want a system that takes as input the info of patients who are not in the dataset and predicts their label/category (if they will have a stroke or not). We can have 2 categories(binary problem) or more categories(multi label problem). This problem is called Classification.
b) This target can be a real number, as for example in the dataset with events, that the target value was the Price. This problem is called Regression.
2. Unsupervised Learning: “Ι don’t have the target availiable in the dataset”. As we can imagine, there are plenty of problems that the target value is not availiable, as it is expensive to collect the data in many cases and of course their labels. It is very easy to collect images of cats and dogs from web, but difficult to label each one of them. More details about this kind of learning will be given in an other story.
3. Reinforcement learning: “we don’t have the target value, but the environment is a kind of supervisor for the system”. The system takes a decision without knowing if is good or not, and the supervisor gives a review about how good the decision was. This kind will be part of an other story too.
Describe the solution: Firstly we have to keep in mind, that we will have to split the dataset in two subsets: the Trainset & the Testset.
The process is this:
- Firstly we make an assumpsion about one model f with parameters θ(pick a model with random parameters, for example if we have a regression problem we can use linear regression model y=f(x)=wx+w0 with random values of w and w0) that fits the data of the train set. (we keep the data of test set out of the process !). The purpose is to find the best parameters θ (optimize f) in order to fit well the data of train set. In other words, for every input feature x of the train set we want a model f such that
2. Then we are going to use this model f for unknown features x and find the predicted target value yi.
“The ability of model f to predict the target value in uknown data is called Generalization of the model. We want models with good generalization!We care about models that have a good behavior in unknown data”
And how do we know if the model f is good enough? How do we find the parameters θ that give the optimized model f with f(x) = yi very close to the real target ti?
the 3 basic steps:
- Choose model family (define the problem and choose one algorithm of the family — for example choose fθ as linear regression with parameters θ=w,w0 for a regression problem —
- Define the Loss Function( this will help us evaluate if model fθ is good enough)
- Define a way to find the optimized parameters θ ( optimization task)
Loss Function: The most common are Mean Square Error(1) for regression and missclasification(2) error for classification.
we can think that ti is the real label and is a real number and yi is the predicted label by our model f(x)=yi. Each time that the error is not good we change the parameters θ of f and calculate the error, and again, and agan .. The purpose is to have the less error! Remember, this error is in the trainset, we want models with good generalization — aka good behavior in unknown data —
we can think that the real target is t and is a label(for example stroke/no stroke), so we compare it with the predicted f(x)=y. Everytime the model f doesn’t find the true label t for an input x (so f(x) not equal to t) the loss function L takes 1, in other case 0.
Empirical and generalization error:
Generalization error is the error of model f for predicting the target of unknown data, it’s like we want to predict if all the population of earth will have a stroke or not! We can not have all these data and calculate the generalization error!It would be a blessing if we could!
We have in hands only a specific dataset D, and so we can calculate the error of model f using only these data. This is called empirical error.
We want of course a very small empirical error. This small value, if anything bad don’t happen(aka overfitting!) will give as a small generalization error also !
TRAINING : The process in which we try to reduce the empirical error on a train set, in order to find a model with good generalization(aka prediction in unknown data).
And how do we test the model generalization?
Well, do you remember when I said KEEP THE TEST SET OUT OF THE PROCESS! If you do, then test set will be unknown data for the system we created. We will know the target values, but the system won’t! So finally, we can calculate the error/accuracy in test set and see how the model behaves. We want
for every data (x,ti) of the test set. This error, is the generalization error that I was talking above !
- Just for note, there are many Loss Function, I just talked about two.
- We speaked for Supervised Learning. This was an intro to machine learning concepts.
- The optimization part that we have left ,of how to find the optimal θ during the training, is a quite mathy thing and it’s part of an other story.
- Things to keep: The 3 basic steps & Empirical error VS Generalization error.
*The stories are written by personal understanding,combining information of many books, tutorials, and academic courses* Thanks to my prof. A.L. *