From Big Data to business analytics: The case study of churn prediction
Journal
Applied Soft Computing
Date Issued
2020-05-01
Author(s)
Apanowicz, Cas
Ślȩzak, Dominik
Abstract
The success of companies hugely depends on how well they can analyze the available data and
extract meaningful knowledge. The Extract-Transform-Load (ETL) process is instrumental
in accomplishing these goals, but requires significant effort, especially for Big Data. Previous
works have failed to formalize, integrate, and evaluate the ETL process for Big Data problems
in a scalable and cost-effective way. In this paper, we propose a cloud-based ETL framework
for data fusion and aggregation from a variety of sources. Next, we define three scenarios
regarding data aggregation during ETL: (i) ETL with no aggregation; (ii) aggregation based
on predefined columns or time intervals; and (iii) aggregation within single user sessions
spanning over arbitrary time intervals. The third scenario is very valuable in the context
of feature engineering, making it possible to define features as “the time since the last
occurrence of event X”. The scalability was evaluated on Amazon AWS Hadoop clusters by
processing user logs collected with Kinesis streams with datasets ranging from 30 GB to
2.6 TB. The business value of the architecture was demonstrated with applications in churn
prediction, service-outage prediction, fraud detection, and more generally – decision support
and recommendation systems. In the churn prediction case, we showed that over 98% of
churners could be detected, while identifying the individual reason. This allowed support
and sales teams to perform targeted retention measures.
extract meaningful knowledge. The Extract-Transform-Load (ETL) process is instrumental
in accomplishing these goals, but requires significant effort, especially for Big Data. Previous
works have failed to formalize, integrate, and evaluate the ETL process for Big Data problems
in a scalable and cost-effective way. In this paper, we propose a cloud-based ETL framework
for data fusion and aggregation from a variety of sources. Next, we define three scenarios
regarding data aggregation during ETL: (i) ETL with no aggregation; (ii) aggregation based
on predefined columns or time intervals; and (iii) aggregation within single user sessions
spanning over arbitrary time intervals. The third scenario is very valuable in the context
of feature engineering, making it possible to define features as “the time since the last
occurrence of event X”. The scalability was evaluated on Amazon AWS Hadoop clusters by
processing user logs collected with Kinesis streams with datasets ranging from 30 GB to
2.6 TB. The business value of the architecture was demonstrated with applications in churn
prediction, service-outage prediction, fraud detection, and more generally – decision support
and recommendation systems. In the churn prediction case, we showed that over 98% of
churners could be detected, while identifying the individual reason. This allowed support
and sales teams to perform targeted retention measures.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
FromBigDatatobusinessanalytics-Thecasestudyofchurnprediction-accepted.pdf
Size
990.05 KB
Format
Adobe PDF
Checksum
(MD5):a7b8df034832c65d36f34e92d6c58450
