Single point of access to the national research repositories

Subject: Data warehouses, Data streams, ETL, Business analytics

Year: 2019

Type: Proceedings

Title: Scalable Cloud-based ETL for Self-serving Analytics

Author: Zdravevski, Eftim
Author: Apanowicz, Cas
Author: Stencel, Krzysztof
Author: Slezak, Dominik

Abstract: Nowadays, companies must inevitably analyze the available data and extract meaningful knowledge. As an essential prerequisite, Extract-Transform-Load (ETL) requires significant effort, especially for Big Data. The existing solutions fail to formalize, integrate and evaluate the ETL process for Big Data in a scalable and cost-effective way. In this paper, we introduce a cloud-based architecture for data fusion and aggregation from a variety of sources. We identify three scenarios that generalize data aggregation during ETL. They are particularly valuable in the context of machine learning, as they facilitate feature engineering even in complex cases when the data from an extended time period has to be processed. In our experiments, we investigate user logs collected with Kinesis streams on Amazon AWS Hadoop clusters and demonstrate the scalability of our solution. The considered datasets range from 30 GB to 2.5 TB. The results were deployed in the domains, such as churn prediction, fraud detection, service outage prediction, and more generally – decision support and recommendation systems.

Publisher:

Relation: ICDM

Identifier: oai:repository.ukim.mk:20.500.12188/22307
Identifier: http://hdl.handle.net/20.500.12188/22307

Title	Date	Views
Scalable Cloud-based ETL for Self-serving Analytics	2019	30