The Art of the Deployment Pipeline
For many months now I have been working with one of our clients designing and implementing deployment pipelines. Given that this topic is currently close to my heart and has been the source of many sleepless nights, it seems very apt to share some thoughts as a blog!
In this blog, I’d like to discuss some of the concepts surrounding a deployment pipeline. The intention here is not to provide a one-size-fits-all solution but to help you consider how you can make your deployment pipelines more comprehensive. This is a massive topic and I really could dive deeply into many of the sections but this is a blog post and I’ll try to keep it brief. Of course, we are available to help you design and implement pipelines!
I am carefully choosing to use the term Deployment Pipeline over the more common CI/CD (continuous integration and continuous delivery). To me, CI/CD implies a high degree of automation, typically seen in software engineering, where techniques for rollbacks or feature toggling are more straightforward. In contrast, our work is often with complex data engineering systems where such strategies aren’t as easily applicable.
Based on my recent experiences we are going explore these key areas:
- Security
- Pipeline Initiation
- Code Retrieval
- Quality Assurance
- Unit Testing
- Code Deployment
- Integration Testing
- Performance Testing
- User Acceptance Testing
- Release Finalisation
- Post Deployment Activities
- Logging
- Clean Up and Resource Management
- Disaster Recovery and Rollbacks
Security
We need to think carefully about security. The pipeline can be a very dangerous component, it often has access to very sensitive systems and this has the potential to be abused. Here are some important security considerations:
- Pipeline accessibility - the pipeline resources need to be secure, the method used to invoke it needs to be protected from abuse
- Credentials - the pipeline needs to have some form of authorisation to access systems, whether this is in the form of certificates, service accounts, or trust relationships between roles we need to protect these details and make sure they cannot be accessed. Going a step further we can periodically rotate the credentials to reduce the risk of secrets being extracted and misused
- Networking - the pipeline is communicating with sensitive systems, we need to make an effort that hosts are authentic and that the communication cannot be intercepted. Perhaps we take steps to ensure all network traffic remains in a private network
- Vulnerabilities - the pipeline itself is built from various tools, the tools need to be upgraded to ensure we are protected against vulnerabilities
- Permissions - the pipeline should only be granted the minimum set of permissions it needs. The principle of least permission is a valuable rule that should be applied to pipelines to help protect against exploits
Pipeline Initiation
This stage is all about setting in motion the events that lead to code deployment. In my experience, this often revolves around a cloud-hosted Git repository (repo). It’s not just about having a repo, but understanding how it is being used by the team and the implications of changes in branches.
Take for example the Git flow methodology, which has three common branches develop, release and main. Commits to each of these branches should be handled very differently by the pipeline. For example, we might expect the following behaviour per branch:
- a develop commit flows through the pipeline with no approvals to a development environment for developers to review quick iterative changes
- a release commit flow might result in an initial approval by a release manager before flowing through a chain of environments for a variety of activity to happen before being approved by key people to progress
- a main commit might be ignored as it represents the final state of production and is the final step in the pipeline
With this in mind, we face the question of whether to have a highly configurable pipeline that can make these decisions based on some form of configuration or to build a pipeline per flow. The answer to this question can only really be decided by considering the implications of each and can often be a trade-off between complexity versus the sheer number of pipelines.
In either case, you eventually face the prospect of hooking your git repo up to the pipeline. When a change is made to the repo we need to notify the pipeline to run and that can often be achieved with a webhook. If the git repo is hosted in a cloud this may well mean you need some public-facing host that can receive the webhook request and can in turn invoke the pipeline with the payload or condition. This in itself can become its own topic of how to secure that webhook. There are some methods that can be used:
- some form of certificates to identify the hosts
- a continuously changing token that can be used to validate the request
- IP allow the list to ignore unexpected requests
There are alternatives to webhooks, a common alternative is to use built-in pipeline functionality hosted by your git repo provider. The right decisions need to be made based on your requirements and security considerations.
Code Retrieval
The pipeline has been triggered, however, the webhook only told us that there was a change, the payload doesn’t send across a copy of the code. We need to make sure the correct commit is collected from the repo and passed along the different phases of the pipeline.
The first task of the pipeline is to clone the repo, check the correct commit, take a copy and store it as an artefact. The artefact should be stored in some form that can be accessed by all the following phases. It should ideally be secure and immutable. We need to be certain the code cannot be changed during the pipeline execution or we open ourselves up to potential exploits and code being deployed into production that was not tested and approved.
Quality Assurance
Before doing anything with the code we can take the opportunity to apply some static code analysis. There are many great tools that can be used for static code analysis. Some of the tools we have been using include:
- checkov to scan terraform code for misconfiguration and potential security issues
- tflint to scan terraform code for best practice violations
- sqlfluff for scanning SQL files
- pylint to scan python files for conformity
- bandit to scan python files for common security issues
There are many excellent tools that can be used to scan the languages you are working with. Many of the tools can be extended and allow you to write your own rules. This gives you the opportunity to implement policy as code (PaC) and capture breaches to your development standards before the code is ever released.
Unit Testing
Once we know the code is of a good standard it might be a great opportunity to run some unit tests that take our confidence to the next level. In this phase, we are really thinking about how can we test the code in isolation from our systems. The pipeline is really opening up opportunities to automate repeatable tasks that not only make sure our releases are of a high standard but help guide the development team to think about code changes in a holistic way.
There are many testing frameworks across many languages that might be released by a pipeline. Too many for us to cover in detail. Some of the more accessible frameworks exist for major programming languages. For example, a python function can be tested using tests defined in pytest. However, when we start looking at SQL files it can become a little more tricky. We might want to start exploring using docker images to run a very lightweight mock-up of the target database or perhaps using an on-demand instance cloned in some form from a target database.
Code Deployment
At some point in the deployment pipeline we need to complete the obvious task of deploying that code!
We have the code that we want to deploy. We need to consider:
- how to deploy it
- where to deploy it to
- retaining deployment information between runs that is used in the deployment
- capturing information that might needed to decide if a deployment was successful
- capturing errors and making automated decisions on what action to take
Let's use a terraform code release as an example for these points. We might need to install a version of the terraform application or run a docker container that has it preinstalled. We need to have access to configuration information that tells terraform how to authenticate against a target and the connection information for the target. The terraform state information needs to be retained between deployments so terraform knows the resources it controls and what their last known state was. In the event of failure, we want to be able to at least inform people of what happened so action can be taken.
Integration Testing
After the code has been deployed we need to be confident it has landed correctly. We might implement a phase within the pipeline that runs some integration tests. These tests can use the system and validate the responses. It might be that we make API calls to services that we deployed, extract metadata to validate or perhaps run queries that can only work within our new set of code changes.
We need to be careful that the integration tests are not making data changes that impact the intended usage of the system. Cancelling a mortgage in production might not be an acceptable form of integration testing!
Performance Testing
The pipeline has the potential to collect information on tests in all environments. At the very least we can analyse the metadata on these tests and can monitor them for degradation. This can be a way of putting a finger in the air and getting a feel for the impact of our stacked series of changes. As we release more code we collect tests and these tests build history and that history can help us gain some insight.
In a more direct approach perhaps we have the luxury of a preproduction environment that contains comparable production volumes of data. We might decide performance testing is very important for our project and set some minimum expectations that we need to meet before the code finally gets deployed to production.
User Acceptance Testing
Up until now, we haven't delved much into approval steps. Depending on your needs and confidence in the pipeline, these can be strategically placed after key phases. A prime candidate for manual approval is User Acceptance Testing (UAT). Here, the changes are deployed in a designated environment for our subject matter experts to scrutinize. While they're giving it a spin, our pipeline patiently awaits their verdict.
Remember, approvals don't have to be a mere 'Yes' or 'No' click. We can get creative and configure an approval stage that requires a mix of designated individuals or roles to greenlight the next phase of the pipeline.
Release Finalisation
Once the code is in production we really should at the very least merge the code into our main branch. If you are using Git flow then the main branch should be a reflection of the code that has been released to the production environment. Perhaps you want a similar branching strategy to provide some form of insight into other environments that are along the release path.
Post Deployment Activities
Think of this as the encore after the main show. What special tasks should the pipeline handle after the final release? Here's some food for thought:
- Update a Jira ticket's status or drop a comment
- Document your achievements, maybe pulling info straight from docstrings
- Keep a record in a change log
- Sprinkle a celebratory comment in the git log
Logging
Our pipeline can sometimes feel like a labyrinth, so a solid logging strategy is crucial. Consolidating logs from various phases into a central tool, easily filtered by pipeline execution, can be a lifesaver when things get hairy.
Clean-up and Resource Management
With any pipeline, there will be a requirement to perform ongoing housekeeping. Let's consider some things we might want to keep an eye on:
- docker images as we move through different versions of tools
- persisted files that are used to store artefacts, logs, metadata, etc
- unit-testing clones
It really depends on the approach you have taken but you can quite easily look at something as being inconsequential and within a year it's taking up a chunk of the storage budget. Some of the cloud vendors object storage and container repos have the ability to add lifecycle policies to objects and this can be really useful.
Disaster Recovery and Rollbacks
The final thought for the pipeline is how it can be used in DR scenarios and even how it can be used to rollback a failed change. Both are complex topics and I’m not going to give you a solution here. It really depends on the systems and the approach. The pipeline can be used in both scenarios and in some cases can even be automated to recover a failed deployment through clever coding approaches.
Final Thoughts
This blog post was really an opportunity for me to provide you with some food for thought. I’ve avoided specifics simply to keep the post as short as possible. We have helped clients implement some really powerful pipelines that are highly configurable and provide solutions for all of these problems at least for their use case. I hope this post helps you to design a pipeline that provides your organisation with a comprehensive and robust solution.