Introduction In the rapidly evolving field of data engineering, maintaining high-quality, reliable, and efficient data pipelines is crucial for businesses to make informed decisions and stay competitive. One methodology that has been instrumental in achieving these objectives is Test-Driven Development (TDD). At its core, TDD involves a simple, yet powerful cycle: write a failing test […]
Generally and historically, data engineering, analytics, and science efforts focused on progressing from data to knowledge/wisdom. The emergence of LLMs allows for the decomposition of wisdom/knowledge back down to data. This can enable novel discovery, integrate with information systems, and drive automated processes. GenAI Categories Generation: Use bedrock models to create code, text, or images […]
Introduction In today’s fast-paced development environment, the Agile methodology stands out for its emphasis on delivering functional features to users as early as possible. This approach challenges traditional, lengthy development cycles by advocating for the incremental release of a product’s most essential functionalities. By prioritizing early delivery, Agile aims to provide immediate value to users, […]
An Introduction to Data Modeling and Why it Matters for Development Teams Data modeling is a critical yet often underrated skill in technology development and within development teams. This article aims to teach you the basics of it and why it is important. This article will introduce you to the concept of a data model, […]
Introduction In the realm of distributed computing and batch processing, operational challenges frequently arise that necessitate innovative solutions. A particular challenge we encountered involved a scenario where multiple jobs within our AWS environment were generating tens of thousands of files and storing them in an Amazon S3 bucket. Subsequently, a specific job was tasked with […]
Python Performance: Issue 2 – Feature Envy Previous Issue Recap In the previous issue we discussed the differences between the “Clean Code” version of calculating the cumulative area of a collection of shapes and “the old fashioned way”. Robert Martin, aka “Uncle Bob”, advocates for a “clean” polymorphic approach to the problem, where each shape […]
Python Performance: Issue 1 – The Polymorphism Rule Welcome to Python Performance Welcome to the Python Performance blog series. In this series, I will be exploring various performance topics in Python, with the aim to create a list of heuristics to help developers write more performant Python code before they ever start thinking about reaching […]
Introduction In the digital era, where data is king, the choice of table format for data storage and processing is crucial. Common file formats like CSV, Avro, and Parquet have long been the go-to solutions in various data handling scenarios. However, with the evolving needs of big data and cloud computing, newer and more efficient […]
Case Study: AWS re:invent 2023 featured a lab session on building out serverless architecture which utilized SNS, SQS, and Lambda. I found this lab particularly helpful because it helped me design a solution for a problem where data was being throttled through a single chokepoint. In this case, a large batch of data was being […]
Introduction Joining or starting data projects in large enterprise environments with many stakeholders can be stressful, not to mention a technical implementation nightmare. When the primary stakeholders can’t (or won’t) give the project team clear requirements, the onus falls to the technical implementation team to create order from the chaos and organize the delivery team […]