Jumia, a Nigeria-based e-commerce company, has transitioned its Hadoop distribution to an AWS serverless platform to build a next-generation data platform with metadata-driven specification frameworks. The company faced issues in terms of increased cost, lack of scalability of computing, job queuing, embracing modern technologies, complex infrastructure automation, and inability for local development.
The metadata frameworks offer a consistent and efficient approach to data orchestration, migration, ingestion, maintenance and processing. They provide reusability and scalability and streamline the development workflow and minimize the risk of errors. In addition, metadata-driven frameworks adhere to data protection and enforce encryption across all services.
The architecture consists of frameworks that focus on creating DAGs, dependencies, validations and notifications. Amazon Managed Workflow for Apache Airflow (Amazon MWAA) is used in data orchestration, enabling dynamically created DAGs, natively integrating with non-AWS services, creating dependencies of past executions and generating accessible metadata.
Another framework involves migrating data from HDFS to Amazon S3 with Apache Iceberg storage format. A metadata-driven framework built in PySpark receives a configuration file and runs migration tasks in an Amazon EMR Serverless job.
A metadata-driven framework for micro-batch and batch mode was implemented in the data ingestion phase. In the batch mode, the framework is written in PySpark, which extracts data from different data sources (such as Oracle or PostgreSQL). In the micro-batch mode, Spark Structured Streaming ingests data from a Kafka cluster, which has the capability of running native streams in streaming.
In the data processing phase, Iceberg is used as a delta lake file system. Spark Structured Streaming ingests data from Amazon S3. The maintenance phase involves building a framework that is capable of performing various maintenance tasks on tables within the data lake, including expiring snapshots and removing old metadata files.
The rearchitected data platform resulted in a 50% reduction in data lake costs, and spawned faster insights and reduced turnaround time to production. It standardized workflows and ways of working across data teams, and created a more reliable source of truth for data assets. The AWS serverless platform also afforded improved scalability, flexibility, integration and cost efficiency.
Jumia's transformation was led by Hélder Russa, Head of Data Engineering at Jumia Group. Ramón Díez is a Senior Customer Delivery Architect at AWS, while Paula Marenco is a Data Architect at AWS. Pedro Gonçalves is a Principal Data Engineer at Jumia Group.