Data Science: Data Preprocessing with Orange Tool

Dhrumil Dalwadi
3 min readOct 21, 2021

Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted. Orange is a great data mining tool. In Orange, data analysis is done by stacking components into workflows.

Orange Tool With Python

Using Orange in Python is straightforward. Firstly, we have to install Orange3

pip install Orange3

Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. For performing discretization Discretize function is used.

import orange
brown = Orange.data.Table("iris.tab")
disc = Orange.preprocess.Discretize()
disc.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_brown = disc(brown)
print("Original dataset:")
for e in brown[:3]:
print(e)
print("Discretized dataset:")
for e in d_brown[:3]:
print(e)
Output of the above code

Continuization

Continuization refers to the transformation of discrete (binary or multinominal) variables to continuous. Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.

  • continuous variables can be normalized or left unchanged
  • discrete attribute with less than two possible values are removed;
  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables
  • multinomial variables are treated according to the flag multinomial_treatment.

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and others will be zero. This is the default behavior.

For example, as shown in the below code snippet, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

titanic = Orange.data.Table("titanic")
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
print('before Continuization',titanic.domain)
print('after Continuization',titanic1.domain)
print('7th row of data before : ',titanic[10])
print('7th row of data after : ',titanic1[10])
Output of the above code

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

from Orange.preprocess import Normalize
normalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(brown)
print("Before Normalization: ",brown[2])
print("After noramlization: ",normalized_data[2])
Output of the above code

Randomization

With randomization, a given data table, preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

from Orange.preprocess import Randomize
randomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(b)
print(“Before Randomization: “,brown[2])
print(“After Randomization: “,randomized_data[2])
Output of the above code

Code Below.

--

--