Ed Crewe will be speaking at PyConLT 2025 about cloud pricing complexity and data pipelines for EDB's Postgres AI product.
Cloud pricing involves managing nearly 5 million prices across AWS, Azure, and GCP for various services, types, tiers, and regions.
To estimate costs for customers, a data pipeline was built using Python, Airflow, and Postgres, replacing a 3rd party service.
The pipeline's Python code uses an abstract base class for scrapers, Psycopg for faster database updates, and Go for Embedded Postgres.
Separate temporary Postgres databases per step ensure independent data handling and compatibility with the final target database.
Click package functionalities help in developing and testing pipelines, with the ability to run individual scrapes for debugging.
Unit testing is facilitated by creating mock response objects for data scrapers, enabling functional testing of the scrape and data creation ETL cycle.
Data pipelines like Airflow and Dagster allow for local development mode and faster testing of DAG steps to improve development experience.
Soda tests ensure the correctness of the scraped data by validating the number of prices, tiered rates, and service ranges expected.
The final data artifacts are loaded into a Postgres cluster price schema micro-service running on CloudNativePG for efficient data management.