Data Science: Data Preprocessing using Scikit Learn

3 min readOct 21, 2021

What is Data Preprocessing?

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Data Encoding

Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The two popular techniques of converting Categorical values to Numeric values are done in two different methods.

Label Encoding
One Hot Encoding

Label Encoding

Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.

One Hot Encoder

Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. When multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Discretization

Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

Imputation of missing values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

We can handle missing values in two ways:

Remove the data (whole row) which have missing values.
Add the values by using some strategies or using Imputer.

You can check for the code below.

GitHub - dhrumildalwadi/DS_PR2

Contribute to dhrumildalwadi/DS_PR2 development by creating an account on GitHub.

github.com

Data Science: Data Preprocessing using Scikit Learn

What is Data Preprocessing?

Data Encoding

Normalization

Standardization

Discretization

Imputation of missing values

GitHub - dhrumildalwadi/DS_PR2

Contribute to dhrumildalwadi/DS_PR2 development by creating an account on GitHub.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dhrumil Dalwadi

No responses yet