Data Science: Data Preprocessing using Scikit Learn

Dhrumil Dalwadi
3 min readOct 21, 2021

--

What is Data Preprocessing?

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Data Encoding

Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The two popular techniques of converting Categorical values to Numeric values are done in two different methods.

  1. Label Encoding
  2. One Hot Encoding

Label Encoding

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.

One Hot Encoder

Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. When multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Discretization

Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

Imputation of missing values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

We can handle missing values in two ways:

  1. Remove the data (whole row) which have missing values.
  2. Add the values by using some strategies or using Imputer.

You can check for the code below.

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Dhrumil Dalwadi
Dhrumil Dalwadi

Written by Dhrumil Dalwadi

Blockchain and Cyber-security Enthusiast

No responses yet

Write a response