Practical Exam 7IT1

3 min readNov 18, 2021

Roll No: 18IT027

Task 1: Dataset Description using Orange Tool

Task Description

In this task, we are given a dataset of risk_factors_cervical_cancer.csv. I had to use that dataset and apply necessary cleaning and preprocessing to the data to improve the classification result of the given dataset.

Data Preprocessing

Data preprocessing is a technique which is used to transform raw data in a useful and efficient format. It is an important step in the data mining process.

Dataset Description

This dataset contains 858 instances and 36 attributes or columns. But only 12 columns are numeric and the rest are categorical. Apart from that this dataset contains a list of risk factors for cervical cancer leading to a biopsy examination.

Encoding

Encoding is a technique of transforming of categorical data to a binary or numerical form. Other machine learning algorithms can then decide how the data can be used or how it can be operated.

Normalization

The data normalization is a basic element of data mining. It means transforming the data, namely converting the source data in another format so that it allows processing data effectively.

Missing Values Handling

Missing data are values that are not recorded in a dataset. They can be single value missing or missing an entire row. We can handle missing values in two ways:

Remove the data which has missing values
Add the values by using some strategies.

The above image shows the work flow of the task 1 performed.

The above image shows the preprocessing steps performed.

We are then given the task to compare the classification accuracy of the dataset before and after performing the preprocessing. I have set biopsy as the target variable and used different classification models such as Random Forest, Naïve Bias. Neural Network and kNN to test and score the dataset.

Before(left) and after(right) preprocessing

As you can see, after preprocessing of the dataset, the overall precision of the model has increased.

Below is the image of Confusion Matrix of the result of both the tests.

Task 2: Generating the dashboard of preprocessed dataset from task 1. Find the maximum data insights by plotting Bar chart, Boxplot, Pie Plot, Stack Plot using PowerBI dashboard visualization.

What is PowerBI?

PowerBI is a business analytics service provided by Microsoft. It provides interactive visualizations of the data which is so easy to use that end users can create dashboard and reports by themselves.

I have exported the preprocessed data from task 1 and used the same for visualization in PowerBI.