INTRODUCTION TO SUPERVISED MACHINE LEARNING

Linh Mai
7 min readDec 16, 2020

WHAT IS SUPERVISED LEARNING?

Nowadays, the major part of machine learning is using supervised learning. Anyone who is new to machine learning will most likely start with supervised machine learning algorithms. Supervised learning is referred to as a class of systems and algorithms that determine a predictive model by using data points with known outcomes. In other words, supervised learning is a case when we have both features (the matrix X) and the target (the vector y). It is used to predict the target y from the features X, given the data that is labeled. In contrast, Unsupervised Learning is used to find the inherent structure of unlabeled data. Imagine that you are enrolling in a class. We are all aware of the correct answers, the algorithm iteratively makes predictions on the training data, and is corrected by the teacher. The learning process stops when the algorithm achieves an acceptable level of performance.

ADVANTAGES AND DISADVANTAGES OF SUPERVISED LEARNING

There are always pros and cons to everything. Similarly, supervised learning has its own advantages, but it also carries some limitations. Understanding clearly theses points will help us know when to use supervised learning in machine learning.

1. Advantages

  • Simplicity: supervised learning is a simpler method, compared to unsupervised learning. It is also easier to understand what is happening inside of the machine, how it is learning, and so on.
  • High accuracy: Supervised learning allows collecting data and producing data output from the previous experiences. It also optimizes performance criteria from the experiences.
  • The number of classes: you know exactly about the classes in the training data or how many classes before giving the data for training.

2. Disadvantages

  • Not great with new information: when it comes to retrieval-based method, supervised learning has trouble dealing with brand-new inputs. In other words, supervised learning cannot cluster or classify data by itself. For example, if a system has categories of cars and trucks, but it was presented with motorcycles, it would incorrectly lumped in one or the other category.
  • A lot of computation time in training supervised learning models: because supervised learning needs the data to be labeled, it requires a lot of time in retreiving data or tuning the performance of models.
  • Human errors: supervised learning more likely contains errors that are caused by human being in the process of labeling. Therefore, it can lead to inaccurate learning.

REGRESSION and CLASSIFICATION

Supervised learning problems are further grouped into two main categories: Regression and Classification. For instance, predicting house market value based on certain features, or predicting whether a customer will soon stop doing business with a company (churn or no-churn). The main goal of these two types of problems is the construction of a concise model that can predict the value of the dependent variable from the independent variables. The main difference between these two tasks is that the dependent variable is numerical for Regression and categorical for Classification.

1. Regression

Regression is a part of supervised machine learning. Regression models investigate the relationship between a dependent variable or also referred as the target and the independent variables or also referred as the features. A regression problem is when the output variable is a real value, such as house price, weight, and so on

2. Classification

Classification problems are problems in which our prediction space is discrete. For instance, there is a finite number of values the output variable can be. A classification problem is when the output variable is a category, such as churn or no-churn, red or yellow, disease or no-desease, and so on.

WIDELY-USED ALGORITHMS IN SUPERVISED MACHINE LEARNING

1. Linear Regression

Linear Regression establishes a linear relationship between a target and features. It predicts a numeric value and has the shape of a straight line. The case of one explanatory variable is called “Simple Linear Regression” or “Linear Regression”. For more than one explanatory variable, the process is called “Multiple Linear Regression”.

  • Simple linear regression: y = B0 + B1*x1
  • Multiple linear regression: y = B0 + B1*x1 + … + Bn * xN

2. Decision Tree

A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) is a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable. This is a type of supervised learning algorithm that is mostly used for classification problems. It works for both categorical and continuous variables. To train decision trees, we firstly start at the root node. Secondly, for each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X*, S*} that gives the minimum overall X and S. If a stopping criterion is reached, we exit. Otherwise, the second step is applied to each child node in turn.

3. Random Forest

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. You can think of it as a rainforest with many trees contained in it. Random Forest belongs to a larger class of machine learning algorithms, called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem). The number of trees in the random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest. It is rather fine-tuned with the data, It typically starts off by taking the square of the number of features present in the data followed by tuning until we get the optimal results.

4. K-Nearest Neighbor (KNN)

K-Nearest Neighbor, also known as KNN, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm can be used in both Classification and Regression problems. It is a distance-based algorithm, means that it implicitly assumes that the smaller the distance between two points, the more similar they are. Therefore, it calculates the distance between data points and assigns the categories based on the most frequent category or average.

5. MLP Classifier

MLP Classifier is part of the neural networks. It has became popular in machine learning due to its strength in image recognition, speech recognition, and natural language processing.

Neural networks process trains the data by mimicking the interconnectivity of the human brain through layers of nodes. To implement a neural network, we need to feed it the inputs (i.e. location, pricing, variety) and the output (i.e. pricing), and all the features in the middle will be figured out automatically in the network. This layer is called the hidden layer, with the nodes representing hidden units.

6. Logistic Regression

Do not be fooled by the word “regression” in the term “Logistic Regression”. Logistic regression is used for binary classification. You should use logistic regression when your target variable y takes only two values, for example, True and False, “spam” and “not spam”, “churn” and “not churn”, and so on. This type of target variable is said to be a “binary” or “dichotomous”. Logistic Regression is considered a generalized linear model because the outcome always depends on the sum of the inputs and parameters. In other words, the output cannot depend on the product of its parameters.

7. Support Vector Machine (SVM)

Support Vector Machine, also known as SVM, is an algorithm that can be used for both regression and classification. It is usually leveraged for classification problems where it constructs a hyperplane where the distance between two classes of data points is at its maximum. This hyperplane is called Decision Boundary, which separates the classes of data points on each side of the plane.

8. XGBoost

eXtreme Gradient Boosting, or also known as XGBoost, is a decision-tree-based ensemble algorithm that uses a gradient boosting framework. Gradient boosting refers to a class of algorithms, rather than any single one. XGBoost is a stand-alone library that implements popular gradient boosting algorithms in the fastest and most performant way. There are many optimizations that allow XGBoost to train more quickly than other library implementations of gradient boosting. As a result, XGBoost is currently the highest performance version and it is a great choice for classification tasks.

CONCLUSION

This post introduced you to supervised machine learning world with the definitions, advantages and disadvantages, and some of the most common-used algorithms. In the next post, I will talk about the common metrics that are used to evaluate the performance of regression and classification models.

--

--

Linh Mai
0 Followers

Data Analyst with a professional background in Chemical Research