Cluster-size optimization within a cloud-based ETL framework for Big Data

Zdravevski, Eftim; Lameski, Petre; Dimitrievski, Ace; Grzegorowski, Marek; Apanowicz, Cas

Ве молиме користете го овој идентификатор да го цитирате или поврзете овој запис: http://hdl.handle.net/20.500.12188/20979

DC Field	Value	Language
dc.contributor.author	Zdravevski, Eftim	en_US
dc.contributor.author	Lameski, Petre	en_US
dc.contributor.author	Dimitrievski, Ace	en_US
dc.contributor.author	Grzegorowski, Marek	en_US
dc.contributor.author	Apanowicz, Cas	en_US
dc.date.accessioned	2022-07-18T07:42:52Z	-
dc.date.available	2022-07-18T07:42:52Z	-
dc.date.issued	2019-12-09	-
dc.identifier.uri	http://hdl.handle.net/20.500.12188/20979	-
dc.description.abstract	The ability to analyze the available data is a valuable asset for any successful business, especially when the analysis yields meaningful knowledge. One of the key processes for acquiring such ability is the Extract-Transform-Load (ETL) process. For Big Data, ETL requires a significant effort and it is a very challenging task to be performed in a cost-effective way. There are quite a few examples in the literature that describe an architecture for cost-effective ETL but none of the available examples are complete enough and they are usually evaluated in narrow problem domains. The ones that are more general, require specific implementation details. In this paper we propose a cloud-based ETL framework where we use a general cluster-size optimization algorithm, while providing implementation details, and is able to perform the required job within a predefined, and thus known, time. We evaluated the algorithm by executing three scenarios regarding data aggregation during ETL: (i) ETL with no aggregation; (ii) aggregation based on predefined columns or time intervals; and (iii) aggregation within single user sessions spanning over arbitrary time intervals. The execution of the three ETL scenarios in a production setting showed that the cluster size could be optimized so it can process the required data volume within a predefined and thus, expected, latency. The scalability was evaluated on Amazon AWS Hadoop clusters by processing user logs collected with Kinesis streams with datasets ranging from 30 GB to 2.6 TB.	en_US
dc.publisher	IEEE	en_US
dc.subject	Data streams; ETL; Business analytics; Hadoop; Spark; Cluster size optimization	en_US
dc.title	Cluster-size optimization within a cloud-based ETL framework for Big Data	en_US
dc.type	Proceeding article	en_US
dc.relation.conference	2019 IEEE International Conference on Big Data (Big Data)	en_US
item.fulltext	With Fulltext	-
item.grantfulltext	open	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
crisitem.author.dept	Faculty of Computer Science and Engineering	-
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

Files in This Item:

File	Опис	Size	Format
2019_SparkET_IEEE_BigData_Eftim_2019.pdf		467.35 kB	Adobe PDF	View/Open

Прикажи едноставен запис

Page view(s)

59

checked on 4.5.2025

Download(s)

118

checked on 4.5.2025

Google Scholar^TM

Проверете

Записите во DSpace се заштитени со авторски права, со сите права задржани, освен ако не е поинаку наведено.

Репозиториум на трудови на УКИМ

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM