All group telecom companies collect data about transactions that the subscribers perform when they purchase on app stores like Google Playstore. This transaction is processed and analysed to understand the subscribers interest for certain types of apps and thereby look for partnership with brands. A Spark based ETL pipeline was built on transient EMR cluster to process the transaction data and populate the data marts for further reporting purposes
The solution involves ingesting and processing transaction data from content platforms for all the telecom companies in the group of Singtel. This was built as transient ETL pipeline to process the data and populate data marts for reporting. A on-demand pipeline to process raw transactions daily and refresh the data marts
Purpose of EMR:
The solution uses EMR cluster for Spark based pipelines to process and transform transactions. The pipelines reads data from S3 and writes the transformed records back to S3. The data processing pipelines run on EMR as Spark. The transactions per day were around 25 GB per day
The solution uses transient clusters for running the Spark pipelines. On-demand instances were used for this m4.4x Large. Spot instances were considered as an option but it was agreed to opt for On-demand since the pipelines needed to complete within a certain OLA to make the dashboards available. Incase of unwanted loss of nodes with Spot instances would cause re-running the pipeline. Application level optimization was done with partitions strategies and choosing columnar storage formats. The data is stored on S3 as daily partition. A MySQL RDS instance is used as a metadata layer.
The data is ingested into S3 using DMS and then processed using ETL pipelines running on EMR. The transformed data is written back to S3
Daily volumes was approximately 25 GB per day. Data was written as CSV files in landing area