Deploying JupyterLab on AWS with Terraform and Docker: a learning experience.
Recently, a fellow alumnus of Flatiron School's Data Science bootcamp mentioned to me that they had been tasked to deploy JupyterLab(a data science user interface for creating and interacting with notebooks of data science code) on Amazon Web Services. I held a Data Scientist position as my last job title, but my career has trended towards DevOps recently as well, so I thought this task would be good practice with provisioning AWS infrastructure using Terraform as well as configuring it a bit in the process. Here are some notes on my process and learning, the not-quite-final code can be found on my GitHub here.
Today I am going to focus on what Terraform is with some context on the DevOps side for those coming at this from an ML/DS perspective, my experience utilizing it for this project along with how the current code works, and where I'd go from here to improve the code.
So what's Terraform? Terraform is an IaC(Infrastructure as Code) tool, meaning that it interacts with infrastructure(in this case, AWS infrastructure such as EC2 instances for running the computation needed for JupyterLab tasks) through code instead of manually tinkering with the infra. Specifically, Terraform is an infrastructure provisioning tool, meant to spin up and tear down infrastructure according to a declarative approach: you tell Terraform the state you want your infrastructure to be in, and provide proper code, Terraform takes the intermediary steps to get it there. I find this declarative approach fun to work with, it's an approach that is popular in DevOps, including tools like Kubernetes that can accomplish very impressive infrastructure, processes, and results. I chose to use Terraform as I had worked with it before, finding it to be an impressive tool for rapid IaC work -- and the number of times I ran terraform apply and terraform destroy (two of the main commands in Terraform, to provision towards the desired state and destroy infrastructure respectively) rapidly was...considerable, but considerably fun as I figured out how to best set up a barebones deployment.
That barebones deployment was going to be on a less than barebones platform, so into the AWS mines we go! As the chosen deployment platform for this project, AWS is very powerful, but with this power comes complexity and some issues with user-friendliness that Terraform can help alleviate, given a grasp of basic to intermediate AWS concepts. I started with loading up the AWS Provider(Terraform acts by using Providers that interact with the APIs of nearly countless platforms such as AWS to operate on the given infrastructure), provisioning the EC2 instance that would host JupyterLab(installed and run as a Dockerized container along with Docker via. a shell script run when provisioning the infrastructure) along with an SSH key pair for accessing the instance and security groups to determine how traffic could flow in and out of the instance. That was the bulk of the work, but there was a lot of learning and troubleshooting along the way on how to improve it, eventually moving everything out of a single main.tf Terraform file to more files to separate concerns and add on infrastructure features such as Elastic Block Storage for the instance to persist via. Docker volumes the data from working in JupyterLab and implementing best practices in AWS and Terraform such as avoiding hardcoding variables where possible.
I've kept this blog code-free as I believe the GitHub repo, especially going through the commits to see how it was built up over time and my thought process through the README, is more demonstrative than giving finished code without the context of how I got there. With that said, I'll mention some improvements that I'd like to make going forward on this starter project: implementing some security best practices such as practices around JupyterLab authentication, autoscaling groups, or even Kubernetes on EKS (among other potential solutions) for moving this sort of infrastructure deployment into production workloads, and including more input and output variables for greater flexibility and usability of the Terraform files and shell script.
Taking on this use case for Terraform has proven to be exciting and educational. I have now a barebones, imperfect learning experience of a Terraform setup for provisioning a suite of data science development tools on AWS. A lot of issues can be caused by environment mishaps and misconfiguration in developing data projects, and provisioning utilizing declarative infrastructure allows for an environment to be replicated, spun up, destroyed, and more rapidly. Along with using Docker, I think this -- likely in more commercial settings -- is where data science development is going to trend more and more, looking at services such as Google Collab as an example of plug-and-play tools to code notebooks.
One of the reasons I put work into this project to try and streamline setting up a data science development environment is that as technologists, our time is precious, so it means a lot to me that you've taken the time to read this blog. Leave a comment if you have any thoughts (terra)forming in your head about this piece and the intersection of DevOps with data science, feel free to reach out via. the email and social media listed on this site if you need anything or would like to chat DevOps or anything else that catches your eye in my web presence, generally don't be a stranger and consider sharing this post if you got something from it as well as definitely checking out the code linked at the top for the full story!