Cluster-size optimization within a cloud-based ETL framework for Big Data
Date Issued
2019-12-09
Author(s)
Dimitrievski, Ace
Grzegorowski, Marek
Apanowicz, Cas
Abstract
The ability to analyze the available data is a
valuable asset for any successful business, especially when the
analysis yields meaningful knowledge. One of the key processes
for acquiring such ability is the Extract-Transform-Load (ETL)
process. For Big Data, ETL requires a significant effort and it
is a very challenging task to be performed in a cost-effective
way. There are quite a few examples in the literature that
describe an architecture for cost-effective ETL but none of
the available examples are complete enough and they are
usually evaluated in narrow problem domains. The ones that
are more general, require specific implementation details. In
this paper we propose a cloud-based ETL framework where
we use a general cluster-size optimization algorithm, while
providing implementation details, and is able to perform the
required job within a predefined, and thus known, time. We
evaluated the algorithm by executing three scenarios regarding
data aggregation during ETL: (i) ETL with no aggregation;
(ii) aggregation based on predefined columns or time intervals;
and (iii) aggregation within single user sessions spanning over
arbitrary time intervals. The execution of the three ETL
scenarios in a production setting showed that the cluster size
could be optimized so it can process the required data volume
within a predefined and thus, expected, latency. The scalability
was evaluated on Amazon AWS Hadoop clusters by processing
user logs collected with Kinesis streams with datasets ranging
from 30 GB to 2.6 TB.
valuable asset for any successful business, especially when the
analysis yields meaningful knowledge. One of the key processes
for acquiring such ability is the Extract-Transform-Load (ETL)
process. For Big Data, ETL requires a significant effort and it
is a very challenging task to be performed in a cost-effective
way. There are quite a few examples in the literature that
describe an architecture for cost-effective ETL but none of
the available examples are complete enough and they are
usually evaluated in narrow problem domains. The ones that
are more general, require specific implementation details. In
this paper we propose a cloud-based ETL framework where
we use a general cluster-size optimization algorithm, while
providing implementation details, and is able to perform the
required job within a predefined, and thus known, time. We
evaluated the algorithm by executing three scenarios regarding
data aggregation during ETL: (i) ETL with no aggregation;
(ii) aggregation based on predefined columns or time intervals;
and (iii) aggregation within single user sessions spanning over
arbitrary time intervals. The execution of the three ETL
scenarios in a production setting showed that the cluster size
could be optimized so it can process the required data volume
within a predefined and thus, expected, latency. The scalability
was evaluated on Amazon AWS Hadoop clusters by processing
user logs collected with Kinesis streams with datasets ranging
from 30 GB to 2.6 TB.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
2019_SparkET_IEEE_BigData_Eftim_2019.pdf
Size
467.35 KB
Format
Adobe PDF
Checksum
(MD5):7a1303d3fcd4b1fa8f3cc6b3172667d5
