Optimizing infrastructure costs in AWS. Part Two: Alternative Environments

Martín Valdez

Software Engineer & Solver

To learn more about this topic, click here.

At some point of maturity and when a considerable stability threshold in a software project is reached, it is a very common task to review infrastructure costs and usage to analyze if there are some low effort actions that can be taken to reduce expenses if possible without affecting quality, times and processes.

‍

In this article we will take a look at some of the improvements we had to do in one of our projects as a systematic plan of reducing infrastructure cost with no impact on operations. In this case, we will focus mainly on reducing idle infrastructure in the QA environment.

Analyzing non-productive environments

In most of the projects it is very usual to have a QA/staging/pre-production environment that will be used by the development team (and some stakeholders as well) to deploy and test upcoming features and improvements - in this case, we will refer to such environments to "alternative" ones for simplification.

‍

When setting up an alternative environment, it is normal to seek to replicate the production environment as much as possible. However, access patterns in those environments are different than in production ones. Ideally, we could consider that instead of being on 24/7 as in production environments, only about 40 hours per week would be necessary, corresponding to the average hours of full-time team workdays

‍

Considering that different team members (QA and engineering staff and other stakeholders) can be in different timezones, the first task was to identify which were the time slots in which the service was really needed. This analysis showed that the services could be turned off for about 72 hours of the weekly hours 168, giving a 42.8% reduction in use.

‍

This policy applies to all infrastructure used "on demand" by the different teams on a day-to-day basis, including server instances in ECS for the two backend services of the project, the database, and an instance in EC2 for maintenance tasks; general queries and tasks.

Scheduled shutdown and startup

To perform this scheduled environment activation we tried two approaches, one through AWS CodeBuild and the other through AWS Lambda functions. In both the methodology was the same, an AWS CLI command script (Using the boto3 library for python) where the resources were turned on or off in a certain order depending on whether the input parameter indicated on or off.

At the end of these empirical tests, we ended up choosing the Lambda function approach because it was more cost-effective.

Once we had a way to turn the environment on and off with a single action, the only thing left to do was automate the process. For which we configure triggers in the AWS EventBridge to turn the environment on or off at the times agreed with all the teams.

With that, we were able to reduce the infrastructure cost for that customer by 42% in alternative environments already. There are some other services that present a particular situation such as Opensearch or Memcached, which cannot be easily turned off and on without losing its prior execution state. Due to that, since stopping them will change the behavior in comparison with production environments, we omitted them from the optimization process.

Final adjustments and on-demand startup

Before ending the process, there were two steps left to cover.

‍

On the one hand, it was necessary to provide a quick turn-on function, so that in an emergency any team could turn on the environment to work, for example for testing a hotfix in a QA environment. To achieve this, we implemented an API Gateway that calls the Lambda function. Long story short, we implemented a URL that can be used to start the environment immediately if needed. This URL was shared with the different teams that might need it.

‍

On the other hand, another aspect to consider is that some tasks and jobs that were meant to be executed periodically weren't run while the environment was turned off. In our case, we considered whether it was possible to change their execution in both the alternative and production environments. After some analysis, this was effectively possible, so a couple of changes in the scheduling process were enough.

Conclusion

In this article we've shown how, by performing an analysis of the usage patterns of different resources, and some relatively simple specific tasks, it is possible to reduce costs in development environments that have idle periods and generate savings or reinvest them in improvements to the final product. Although in this example we applied this optimization for AWS, it can be easy to replicate in other platforms like Microsoft azure or Google cloud platform.