Scalable Cloud-based ETL for Self-serving Analytics
Date Issued
2019
Author(s)
Apanowicz, Cas
Stencel, Krzysztof
Slezak, Dominik
Abstract
Nowadays, companies must inevitably analyze the available
data and extract meaningful knowledge. As an essential prerequisite,
Extract-Transform-Load (ETL) requires significant effort, especially for
Big Data. The existing solutions fail to formalize, integrate and evaluate
the ETL process for Big Data in a scalable and cost-effective way. In
this paper, we introduce a cloud-based architecture for data fusion and
aggregation from a variety of sources. We identify three scenarios that
generalize data aggregation during ETL. They are particularly valuable
in the context of machine learning, as they facilitate feature engineering
even in complex cases when the data from an extended time period has
to be processed. In our experiments, we investigate user logs collected
with Kinesis streams on Amazon AWS Hadoop clusters and demonstrate
the scalability of our solution. The considered datasets range from 30
GB to 2.5 TB. The results were deployed in the domains, such as churn
prediction, fraud detection, service outage prediction, and more generally
– decision support and recommendation systems.
data and extract meaningful knowledge. As an essential prerequisite,
Extract-Transform-Load (ETL) requires significant effort, especially for
Big Data. The existing solutions fail to formalize, integrate and evaluate
the ETL process for Big Data in a scalable and cost-effective way. In
this paper, we introduce a cloud-based architecture for data fusion and
aggregation from a variety of sources. We identify three scenarios that
generalize data aggregation during ETL. They are particularly valuable
in the context of machine learning, as they facilitate feature engineering
even in complex cases when the data from an extended time period has
to be processed. In our experiments, we investigate user logs collected
with Kinesis streams on Amazon AWS Hadoop clusters and demonstrate
the scalability of our solution. The considered datasets range from 30
GB to 2.5 TB. The results were deployed in the domains, such as churn
prediction, fraud detection, service outage prediction, and more generally
– decision support and recommendation systems.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
2019_07_ICDM_Cloud-basedscalableETL.pdf
Size
607.65 KB
Format
Adobe PDF
Checksum
(MD5):57321cacb1151af4e532c9473acfb745
