Numida Reduces Fraud and Accelerates Underwriting with ML on AWS

Numida built a real-time ML fraud detection system on AWS, cutting underwriting time from days to hours and laying the groundwork for scalable operations.
Stream CDC data with Amazon Redshift streaming, Amazon MSK and Debezium Connector

Episode 3: Redshift Serverless Streaming Ingestion Introduction In the previous episodes, I covered the overall architecture design for this project and Debezium connector configuration for our CDC streaming pipeline. Now I’ll complete the series by diving deep into Amazon Redshift Serverless streaming ingestion — the final piece that enables real-time analytics on your CDC data. This episode focuses on the practical […]
Stream CDC data with Amazon Redshift streaming, Amazon MSK and Debezium Connector

Episode 2: Configuring Debezium Connector for Reliable CDC Introduction In the previous episode, I covered the overall architecture and infrastructure setup for our CDC streaming pipeline. Now I’ll dive deep into Debezium — the open-source platform that captures row-level database changes in real-time and streams them to Kafka topics through MSK Connect. This episode focuses on the […]
Stream CDC data with Amazon Redshift streaming, Amazon MSK and Debezium Connector

Episode 1: Designing the End-to-End CDC Architecture and IaC Setup Introduction In today’s data-driven landscape, organizations need real-time insights from their operational databases to make informed decisions quickly. I developed a comprehensive Change Data Capture (CDC) streaming pipeline that captures database changes from Aurora MySQL and streams them in real-time to Amazon Redshift data warehouse for analytics. This solution […]
QuickSight Quickstart: From Blank AWS Account to Published Dashboards in One Command

The missing link for enterprise Quicksight Development The Problem — Automating QuickSight Is Still Awkward QuickSight is pay-as-you-go BI, but its deployment story lags behind the rest of AWS. A single dashboard hides dozens of interlocked artefacts — data sources, datasets, analyses, themes, RDS/VPC plumbing, IAM policies, and invisible dependencies such as SPICE refresh schedules. […]
Databricks Asset Bundles

Background Photo by Luis Vaz on Unsplash A first look at the tool’s maturity, benefits and current limitations Databricks offers several ways to deploy resources like jobs, notebooks, clusters — each suited to different levels of automation, complexity and levels of control. Whether it’s managing infrastructure, orchestrating jobs, or promoting code across environments, choosing the right deployment tool is […]
A Technical Comparison of Apache Parquet, ORC, and Arrow: Storage Formats for Big Data Workloads

In the world of Big Data, choosing the right storage format is critical for the performance, scalability, and the efficiency of analytics and processing tasks. Apache Parquet, Apache ORC, and Apache Arrow are three popular formats commonly used for data storage and processing within the ecosystem. While each of these formats serves a distinct purposes […]
The Art of Data Engineering: Applying Sun Tzu’s Principles

How Ancient Wisdom Can Transform Modern Data Practices Sun Tzu’s “The Art of War” offers timeless wisdom that transcends the battlefield, providing insights applicable to various domains, including data engineering. In this field, success is not merely about coming up with innovative ideas or implementing initial solutions. True success is measured by how effectively these […]
Databricks First Look: August 2024 Release, Deep Dive into Lakehouse Federation

Exploring the Power of Seamless Data Integration and Enhanced Security with Databricks Introduction In the fast-evolving landscape of data analytics, staying updated with the latest platform enhancements is crucial. The August 2024 release from Databricks brings a suite of impactful updates designed to boost security, compliance, and performance. Among these, Lakehouse Federation stands out, offering […]
Risk Management in Cloud Data Projects: Strategies for Success

Introduction to Risk Management in Cloud Data Projects Risk management is a critical component of any cloud data project. As organizations increasingly rely on cloud technologies to store, process, and analyze data, understanding the unique risks associated with these projects becomes essential. Cloud data projects involve various stakeholders and technologies, which introduce complexities in data […]