Data Science: Data Preprocessing using Scikit Learn
What is Data Preprocessing?
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:
(1) Encoding the Data
(2) Normalization
(3) Standardization
(4) Imputing the Missing Values
(5) Discretization
Data Encoding
Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms performance vary based on the way in which the Categorical data is encoded. The two popular techniques of converting Categorical values to Numeric values are done in two different methods.
- Label Encoding
- One Hot Encoding
Label Encoding
Label Encoding refers to converting the labels into the numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.
One Hot Encoder
Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.
Normalization
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. When multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale.
Standardization
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Discretization
Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.
Imputation of missing values
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
We can handle missing values in two ways:
- Remove the data (whole row) which have missing values.
- Add the values by using some strategies or using Imputer.
You can check for the code below.