This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution.
The project uses Google Dataflow, a fully managed service for stream and batch data processing, built on Apache Beam to handle large-scale data processing tasks.
It also demonstrates the use of Cloud Functions, a serverless execution environment that allows you to run code in response to events.
Google Cloud Scheduler is used to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.
The project implements a CI/CD pipeline with GitHub Actions for automated deployments using comprehensive error handling and logging for reliable data processing.
Before getting started, ensure you have a Google Cloud account with billing enabled and a GitHub account.
The project requires the creation of a Google Cloud Storage bucket to store your data, a BigQuery dataset where the data will be ingested, and a Dataproc cluster for processing.
After setting up a service account, the project requires granting Storage Access Permissions, Dataflow Permissions, Permissions to Create and Manage Cloud Functions and Cloud Scheduler, and Permissions to Manage Service Accounts.
Ensure that the environment variables and secrets are set in the deployment configuration or within GitHub Secrets to configure bucket paths, process names and steps.
GitHub Actions uses service account credentials to authenticate with Google Cloud and execute the necessary workflow jobs containing four steps - enable-services, deploy-buckets, build-dataflow-classic-template, deploy-cloud-function, and deploy-cloud-schedule.