DATA CLEANING: A CRUCIAL STEP BEFORE BUILDING MACHINE LEARNING MODELS

Linh Mai
7 min readNov 10, 2020

Every new project gets most of us excited to dive into to analyze the patterns and information behind the datasets. However, almost all of the datasets are not perfect. Each of the datasets contains a large amount of data, there is certainly a lot of messy and uncleaned data. The cause of this problem is usually from human errors about inputting incomplete data, creating unaware duplicates, or making typographical errors. Using those data straightly without cleaning will do more harm than good and of course will lead to inaccurate conclusions. With that being said, the very next step after importing data files is to clean the data to deal with missing values, incorrect format values and put away any unnecessary values for building models later. Data cleaning is an important step, but also a tedious task as one data scientist spends roughly 60% to 80% of their time doing this task. In this post, I will walk you through multiple data cleaning tasks, using Python and Pandas and hopefully, it can help you shorten the time for cleaning data, and have more time for exploring data, conducting machine learning models, and analyzing the final results. In this post, we will focus on three main tasks in data cleaning: missing values, duplicates, and outliers.

GENERAL REVIEW OF THE DATASET

In any problem, in order to target and solve the issues effectively, we need to have a general view of what we are working with. In this way, we can rank the priority of the problems and plan out the solution. Some of the questions that we can start with to have a general view of our dataset are:

· What is the size of our dataset?

· What are the features that we are working with?

· Are those features in the expected format (string, boolean, integer, float)?

There are two commands that we can use in Python to answer these questions:

1. View the first 5 rolls of the dataframe:

df.head()

2. Summary of the dataframe:

 df.info()

As you can see from above, we know that there are 21 different features and more than 21,500 data points. At a glance, the format for the ‘date’ column supposes to be ‘datetime64’, not object or in other words, string format. Similarly, the ‘sqft_basement’ column indicates the square footage of the basement which should be either in integer or float format, not object format. It leads us to the very first task in data cleaning which is correcting the format of columns.

DATA CLEANING

1. Re-format the types of columns:

To re-format the type of desired columns in our dataset, we will use commands through pandas library which is imported as pd.

· Covert argument to datetime:

df[‘date’] = pd.to_datetime(df[‘date’])

· Convert argument from string/object to float64 or int64:

df[‘sqft_basement’] = df[‘sqft_basement’].astype(str)
.astype(‘float64’)

Note: sometimes it will throw an error for not knowing ‘float64’ or ‘int64’. You can simply fix this issue by importing float64 and int64 from numpy library.

2. Deal with Missing Values:

Dealing with missing values in a dataset is a huge part of data cleaning. This task is harder than it seems to be. There are two types of missing values:

· Placeholder values are the missing values that are denoted by a value that seems valid to pandas, but those values do not actually exist.
For instance, with continuous data like a ‘weight’ column, it contains values such as 0 or 9999 which seems valid to pandas because they are indeed integers. However, they are pragmatically impossible as no human being weighs 0lb or 9999lbs. With continuous data columns, we can check for placeholder values by using the command below to see whether any numerical value shows up statistically improvable:

df.value_counts()

With categorical data, we can use the command below to check whether any value that does not match up with the expectations.

df.unique()

· NaN stands for null values or “Not a Number”. We can get a total number of null values in each column of the dataset by using the command:

df.isna().sum()

There are several strategies for dealing with null values. You need to know clearly the number of missing values in order to choose the best solution as you do not want to delete a lot of data points from the dataset or incorrectly replacing them which leads to inaccurate results at the end.

a) REMOVE or DELETE

We only choose this solution if we know for sure that we have a very large dataset, and the amount of null values is small (up to 5%).

· Drop the entire column: use it as a last case resort. If a feature does not add any predictive value to the machine learning model, dropping has no cost.

df.drop(‘column_name’, axis=1, inplace=True)

· Drop rows that contain missing values: use it when there are very few missing values

df.dropna(subset=[’column_name’], inplace=True)

b) REPLACE

This method will somewhat influence the distribution of the variable. Hence, you need to consider the trade-off here to see if it is worth using the replacing method. For example, when we input the mean, it will reduce the variance.

· Continuous Data: we can replace the missing values with the median value of that column

df[‘column_name’] = df[‘column_name’].fillna(df[‘column_name’]
.median())

· Categorical Data: if there is one categorical value that is much more common than others in the columns, we can replace that value for the missing values.

c) KEEP

· Continuous Data: we can use Coarse Classification or Binning methods which means that we convert the entire column from a continuous column to a categorical column.

For instance, we can separate each person into ranges of age: age 0 to 10 will be 1, age 11 to 20 will be 2, and so on. After that, we can throw out the continuous version of that column and leave the missing values as one valid category inside the new categorical column. On the other hand, we can keep these missing values as NA, but it is not possible for many machine learning algorithms.

· Categorical Data: we treat the missing values as its own category by replacing them with a string “NaN” or any other string that makes it obvious that these values are missing.

df[‘column_name’] = df[‘column_name’].fillna(‘No Defined’, 
inplace=True)

So far, you may find it overwhelmed to deal with missing values in continuous and categorical data columns. Here is a summary table for a quick reference when doing data cleaning:

3. Deal with Duplicates:

In order to check for duplicates, we need to know which column in our dataset contains unique values to distinguish each of data points. For instance, it can be ‘ID’, ‘customer_name’, ‘ISBN-13’, and so on. To check for duplicate rows based on a dataframe column, we can use the command df.duplicated() in python and combine with len() to count the total number of duplicates. Obviously, the main goal is to eliminate these duplicated values, so the command df.drop_duplicated() will help us complete this job.

# Review the duplicated values
duplicates = df[df['id'].duplicated()]
duplicates.head()
# Count the total number of duplicates
len(df[df['id'].duplicated()])
# Eliminate the duplicated values
df.drop_duplicates(subset='id', inplace=True)

4. Deal with Outliers:

There are multiple ways to check for outliers in a dataset: by descriptive statistics and by visualization.

a) Descriptive Statistics

Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of string, it returns NaN for numeric operations. However, if we call the describe() method on string column that contains no null values, different statistics are returned like count of values, unique values, top and frequency of occurrence.

df.describe()

b) Visualization

The type of plot that I really love to use for showing outliers clearly is Seaborn boxplot. According to a blog post by AJ Pryor, boxplots summarize numeric data over a set of categories. The data is divided into four groups called quartiles. A box is drawn connecting the innermost two quartiles, and a horizontal line is drawn at the position of the median (which always falls within the box). Usually, a second set of lines will be drawn some distance from the inner box denoting a “maximum” and “minimum” value for the data, and then values existing outside of these extrema are considered outliers and plotted as individual points.

sns.set(style="darkgrid")
# Set the color and the type of marker for the outlier points
flierprops = dict(marker='o', markersize=5, markerfacecolor='red')
sns.boxplot(x="variable", y="value", data=pd.melt(df3),
linewidth=2.5, palette="Set3", flierprops=flierprops)
plt.show()

If there are only 1 or 2 outliers in the column, according to the descriptive statistics table and boxplot, we can simply drop that observation point from the dataset.

df = df.drop(df.index[df['bedrooms'] == 33]).copy()

However, if there are multiple outlier points, we can limit the range of value in the column.

# Limit the ranges in columns to remove outliers 
df = df.loc[(df[‘account length’] < 225)].copy()
df = df.loc[(df[‘total day minutes’] > 25) &
(df[‘total day minutes’] < 340)].copy()

CONCLUSION

Facing and finding a solution to deal with a messy dataset can be a tedious task. However, if we break down the problem into smaller parts like I showed above, it will save us time in the process of cleaning data. There will be more time for us to explore, build more accurate machine learning models, and analyze the final results.

--

--

Linh Mai
0 Followers

Data Analyst with a professional background in Chemical Research