Who we are

Contacts

1815 W 14th St, Houston, TX 77008

281-817-6190

Data Engineering

The Data Journey

Many organizations share similar challenges with growing their operational capabilities with data. I have given several talks on data lake design and avoiding the “swampiness” of your data lake, invariably there are various pockets of mess or a “junk drawer” where people hide little bits of critical information.

A complex data environment with myriad source systems and a growing web of consumers characterizes our organization. We boast a mature cloud infrastructure with numerous highly talented leaders, architects, and engineers. A stewardship program is in place to govern access and usage for specific applications, and everyone involved has the best intentions of following best practices for collective success. Despite these efforts, it remains a complicated web of one-off data deliveries, ad hoc versioning, and undocumented interdependencies. So, what can be done? Could this situation have been prevented or mitigated? Hindsight is always clearer, yet here we are, years into a migration and cloud adoption plan, facing a substantial refactoring. Change is expensive from both a mental anguish and time perspective. Could this situation have been prevented, and what could have been done to mitigate it?

Even an internal survey has very mixed (and strong!) opinions on whether this situation could have been prevented or mitigated.

Discussed below are some of the themes that have contributed to the current situation, and how they might be addressed.

Strategy

What is the organizational structure of your company? Do you operate with multiple teams or a centralized data team? Is there effective communication and collaboration, or do silos hinder the flow of information? Is there a prevailing cross-functional team mindset, or do individuals tend to focus on specific tasks?

Beyond organizational structure, certain conditions must be in place for the successful development of data-centric systems. Teams should have a clear understanding of the input data they require and how it is used across different departments. It’s essential to ascertain if teams generate additional metadata or curate input datasets and whether the data is consistently utilized. Accessibility to needed data and the ability to seamlessly publish data back into the organization’s broader ecosystem are crucial aspects.

At present, this organization grapples with a myriad of complex source data systems across the enterprise. These systems feed an intermediate layer of data products designed for use in analytics applications. Some groups within the organization modify these data products to extract key statistics and values, which are vital to their functions and may also hold value for other teams. However, this process often results in duplicated efforts and the potential for inconsistent data generation. For instance, a filtered time-series dataset may exhibit similarities, but differences in applied filtering coefficients can lead to discrepancies.

To enhance the internal use of data, the organization should consider fostering a more collaborative and communicative environment. Establishing a centralized data governance framework can help streamline data processes, minimize duplication of efforts, and ensure consistency in data generation. Encouraging cross-functional collaboration, implementing standardized data access protocols, and providing comprehensive training on data handling practices can further enhance the organization’s ability to leverage its data effectively. By addressing these aspects, the organization can optimize its data-centric systems for improved efficiency and accuracy across teams.

Planning

Planning is crucial to the successful execution of the data-enabled organization. Even for groups that follow Agile principles, you must take care to follow the Agile spirit and not just conduct the ceremonies.

Start by clearly defining the requirements of the project. Understand the current needs and potential future requirements. This clarity will guide the development process and help in avoiding unnecessary features.

Are the business and stakeholder requirements clearly understood? Can the entire development team explain the system back to the stakeholders in a way they will agree with? Small decisions made by team members can be made on almost a daily basis that, like ripples in a pond, have negative effects far into the future if the full system context is not fully understood.

Document the system architecture, design decisions, and rationale behind them. Foster a culture of knowledge sharing within the team. This ensures that team members understand the system’s intricacies and can contribute to its evolution without unnecessary complexity. Make sure system interfaces are extendable. Call out decisions that you are making today that will limit flexibility in the future.

This detailed system planning exercise will illustrate what data each application will need access to, and from there, project clusters and coordination can take place.

Governance and Access

Developing a robust data governance strategy is essential for ensuring controlled and secure access to data sets within an organization. Here are key points to consider in the planning process:

Access Control

Define and enforce access control policies to regulate who has access to what data. Establish roles and permissions based on job responsibilities, ensuring that individuals can only access the data necessary for their roles. Utilize tools like Lake Formation to automate and manage access control.

Development Environment

Implement guidelines regarding development practices against production data. It’s essential to strike a balance between providing developers with the data they need for testing and preventing unauthorized or unintended changes to production data. Consider using data masking or synthetic data generation for non-production environments.

Data Lineage and Metadata Management

Implement data lineage tracking to understand how data flows through the organization. This helps in identifying dependencies and impacts of changes to data sources. Utilize metadata management tools to catalog and document data assets, making it easier for users to discover and understand available data sets.

Data Quality Standards

Define data quality standards and implement checks to ensure data accuracy and reliability. This includes validating data at the point of entry and monitoring data quality over time. Establishing data quality standards is crucial for building trust in the data and supporting informed decision-making.

Collaboration and Communication

Foster collaboration and communication between data stewards, IT teams, and data consumers. Establish a governance council or team of data stewards responsible for overseeing and enforcing data governance policies. Regularly communicate updates, changes, and best practices related to data governance across the organization.

By addressing these considerations in your data governance strategy, your organization can establish a structured and effective approach to managing and controlling access to data sets. This, in turn, promotes data security, accuracy, and compliance with regulatory requirements.

Stability

Managing production and development environments effectively is critical for a well-functioning data management strategy. Let’s delve into two key use cases: Production and Development.

Use Case 1: Production

Production represents the live, operational environment where the organization’s core business processes run. It contains the most up-to-date and reliable data, ensuring that consumers have access to accurate information for their day-to-day operations. The first and foremost requirement for production data is data quality and reliability. Ensure that the data in the production environment meets high-quality standards and is reliable. This involves implementing data quality checks, monitoring data accuracy, and addressing any issues promptly.

Implement strict access controls to govern who can access and modify data in the production environment. Only authorized personnel should have the necessary permissions, and access should be based on job roles and responsibilities. For internally facing datasets without sensitive information, it is better to err on the side of permissiveness to mitigate the “backroom ETL” spread.

With an exception for PII masking, analytics application developers should be consuming production quality outputs from upstream systems. This concept extends to all producers of data that have consumer dependencies.

Use case 2: Development

Development environments are dedicated spaces for testing and building new features, models, or applications. These environments are essential for innovation and ensuring that new developments don’t impact the stability of the production environment.

Development Data Availability

As mentioned above regarding PII, implement data masking or use synthetic data in development environments to protect sensitive information. This ensures that developers have realistic data for testing without exposing confidential data to unauthorized individuals.

Version Control

Utilize version control systems for code and configurations. Implement continuous integration and continuous deployment (CI/CD) pipelines to automate testing and deployment processes. This ensures that changes can be tested thoroughly before being introduced to the production environment.

Environment Isolation

Design and create isolated environments for different development streams, projects, or teams. This prevents conflicts between different development efforts and allows for parallel testing of new features without interfering with each other. This can take the form of multiple cloud accounts or isolation infrastructure. This is best done early on in the data platform development stages as it is very hard to retrofit to existing architectures.

Development Flow Architecture

Design a development flow that mirrors the production environment. This approach allows “production” to be treated as a special case of the standard development flow. It promotes consistency, reduces the risk of errors, and ensures that new features are thoroughly tested before deployment.

Preventing Backroom ETL

Discourage the practice of “backroom ETL” by incorporating ETL processes into the standard development flow. This ensures that data transformations and integrations are treated as integral parts of the development process, reducing the risk of inconsistencies between development and production.

By carefully managing production and development environments with these considerations in mind, organizations can strike a balance between stability and innovation, ensuring that data is both reliable for operational use and adaptable for ongoing development and improvement. Much of this is easier said than done and requires substantial investment in culture and organization capabilities.

Tooling

Selecting the right tools and ensuring their functionality, flexibility, and support for automation is crucial for an effective data management strategy. Here are considerations related to tooling:

Ensure that the selected tools align with the specific requirements of data management, including data integration, storage, processing, and analysis. Different tools may be needed for various stages of the data lifecycle.

Assess whether the tools can be customized to meet specific organizational requirements. The ability to modify or extend functionality can be crucial as business needs evolve.

Leverage automation to streamline repetitive tasks, such as data ingestion, ETL processes, and deployment. Automation reduces the risk of human error and enhances efficiency.

Implement monitoring tools that provide insights into system performance, data quality, and potential issues. Troubleshooting tools should enable quick identification and resolution of problems to minimize downtime.

Tools should also allow for rapid iteration and deep inspection of the process. This can be particularly challenging where normal features like breakpoints are not available and even basics like logging can be painful to use.

Closing

To summarize there are many components to contribute to creating an effective data-enabled organization. Many of these items can be very complex with nuanced trade-offs that should be considered. Wherever you are in your cloud journey, talking with experts can help work through some of these decisions and help your organization navigate its data journey to the cloud. We specialize in finding solutions for all businesses of all shapes and sizes so give us a call!