Parallel computation of information gain using Hadoop and MapReduce

Zdravevski, Eftim; Lameski, Petre; Kulakov, Andrea; Filiposka, Sonja; Trajanov, Dimitar; Jakimovski, Boro

Ве молиме користете го овој идентификатор да го цитирате или поврзете овој запис: http://hdl.handle.net/20.500.12188/20784

DC Field	Value	Language
dc.contributor.author	Zdravevski, Eftim	en_US
dc.contributor.author	Lameski, Petre	en_US
dc.contributor.author	Kulakov, Andrea	en_US
dc.contributor.author	Filiposka, Sonja	en_US
dc.contributor.author	Trajanov, Dimitar	en_US
dc.contributor.author	Jakimovski, Boro	en_US
dc.date.accessioned	2022-07-15T09:08:43Z	-
dc.date.available	2022-07-15T09:08:43Z	-
dc.date.issued	2015-09-13	-
dc.identifier.uri	http://hdl.handle.net/20.500.12188/20784	-
dc.description.abstract	Nowadays, companies collect data at an increasingly high rate to the extent that traditional implementation of algorithms cannot cope with it in reasonable time. On the other hand, analysis of the available data is a key to the business success. In a Big Data setting tasks like feature selection, finding discretization thresholds of continuous data, building decision threes, etc are especially difficult. In this paper we discuss how a parallel implementation of the algorithm for computing the information gain can address these issues. Our approach is based on writing Pig Latin scripts that are compiled into MapReduce jobs which then can be executed on Hadoop clusters. In order to implement the algorithm first we define a framework for developing arbitrary algorithms and then we apply it for the task at hand. With intent to analyze the impact of the parallelization, we have processed the FedCSIS AAIA’14 dataset with the proposed implementation of the information gain. During the experiments we evaluate the speedup of the parallelization compared to a one-node cluster. We also analyze how to optimally determine the number of map and reduce tasks for a given cluster. To demonstrate the portability of the implementation we present results using an on-premises and Amazon AWS clusters. Finally, we illustrate the scalability of the implementation by evaluating it on a replicated version of the same dataset which is 80 times larger than the original.	en_US
dc.publisher	IEEE	en_US
dc.subject	Hadoop, MapReduce, information gain, parallelization, feature ranking	en_US
dc.title	Parallel computation of information gain using Hadoop and MapReduce	en_US
dc.type	Proceeding article	en_US
dc.relation.conference	2015 Federated Conference on Computer Science and Information Systems (FedCSIS)	en_US
item.grantfulltext	open	-
item.fulltext	With Fulltext	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

Files in This Item:

File	Опис	Size	Format
Parallel_computation_of_information_gain.pdf		882.27 kB	Adobe PDF	View/Open

Прикажи едноставен запис

Page view(s)

51

checked on 4.5.2025

Download(s)

17

checked on 4.5.2025

Google Scholar^TM

Проверете

Записите во DSpace се заштитени со авторски права, со сите права задржани, освен ако не е поинаку наведено.

Репозиториум на трудови на УКИМ

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM