Parallel computation of information gain using Hadoop and MapReduce
Date Issued
2015-09-13
Author(s)
Abstract
Nowadays, companies collect data at an increasingly high rate to the extent that traditional implementation
of algorithms cannot cope with it in reasonable time. On the
other hand, analysis of the available data is a key to the business
success. In a Big Data setting tasks like feature selection, finding
discretization thresholds of continuous data, building decision
threes, etc are especially difficult. In this paper we discuss
how a parallel implementation of the algorithm for computing
the information gain can address these issues. Our approach
is based on writing Pig Latin scripts that are compiled into
MapReduce jobs which then can be executed on Hadoop clusters.
In order to implement the algorithm first we define a framework
for developing arbitrary algorithms and then we apply it for
the task at hand. With intent to analyze the impact of the
parallelization, we have processed the FedCSIS AAIA’14 dataset
with the proposed implementation of the information gain. During
the experiments we evaluate the speedup of the parallelization
compared to a one-node cluster. We also analyze how to optimally
determine the number of map and reduce tasks for a given cluster.
To demonstrate the portability of the implementation we present
results using an on-premises and Amazon AWS clusters. Finally,
we illustrate the scalability of the implementation by evaluating
it on a replicated version of the same dataset which is 80 times
larger than the original.
of algorithms cannot cope with it in reasonable time. On the
other hand, analysis of the available data is a key to the business
success. In a Big Data setting tasks like feature selection, finding
discretization thresholds of continuous data, building decision
threes, etc are especially difficult. In this paper we discuss
how a parallel implementation of the algorithm for computing
the information gain can address these issues. Our approach
is based on writing Pig Latin scripts that are compiled into
MapReduce jobs which then can be executed on Hadoop clusters.
In order to implement the algorithm first we define a framework
for developing arbitrary algorithms and then we apply it for
the task at hand. With intent to analyze the impact of the
parallelization, we have processed the FedCSIS AAIA’14 dataset
with the proposed implementation of the information gain. During
the experiments we evaluate the speedup of the parallelization
compared to a one-node cluster. We also analyze how to optimally
determine the number of map and reduce tasks for a given cluster.
To demonstrate the portability of the implementation we present
results using an on-premises and Amazon AWS clusters. Finally,
we illustrate the scalability of the implementation by evaluating
it on a replicated version of the same dataset which is 80 times
larger than the original.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
Parallel_computation_of_information_gain.pdf
Size
882.27 KB
Format
Adobe PDF
Checksum
(MD5):286352ba6fc181505eebd16a14596ad4
