Cron scripts are chargeable for crucial Slack performance. They guarantee reminders execute on time, electronic mail notifications are despatched, and databases are cleaned up, amongst different issues. Over time, each the variety of cron scripts and the quantity of information these scripts course of have elevated. Whereas typically these cron scripts executed as anticipated, over time the reliability of their execution has sometimes faltered, and sustaining and scaling their execution surroundings turned more and more burdensome. These points lead us to design and construct a greater option to execute cron scripts reliably at scale.
Working cron scripts at Slack began in the way in which you would possibly anticipate. There was one node with a duplicate of all of the scripts to run and one crontab file with the schedules for all of the scripts. The node was chargeable for executing the scripts regionally on their specified schedule. Over time, the variety of scripts grew, and the quantity of information every script processed additionally grew. For some time, we might hold transferring to greater nodes with extra CPU and extra RAM; that saved issues working more often than not. However the setup nonetheless wasn’t that dependable — with one field working, any points with provisioning, rotation, or configuration would carry the service to a halt, taking some key Slack performance with it. After repeatedly including increasingly patches to the system, we determined it was time to construct one thing new: a dependable and scalable cron execution service. This text will element some key parts and concerns of this new system.
When designing this new, extra dependable service, we determined to leverage many current companies to lower the quantity we needed to construct — and thus the quantity we’ve to keep up going ahead. The brand new service consists of three foremost parts:
- A brand new Golang service referred to as the “Scheduled Job Conductor”, run on Bedrock, Slack’s wrapper round Kubernetes
- Slack’s Job Queue, an asynchronous compute platform that executes a excessive quantity of labor shortly and effectively
- A Vitess desk for job deduplication and monitoring, to create visibility round job runs and failures
Scheduled Job Conductor
The Golang service mimicked cron performance by leveraging a Golang cron library. The library we selected allowed us to maintain the identical cron string format that we used on the unique cron field, which made migration easier and fewer error susceptible. We used Bedrock, Slack’s wrapper round Kubernetes, to permit us to scale up a number of pods simply. We don’t use all of the pods to course of jobs — as an alternative we use Kubernetes Chief Election to designate one pod to the scheduling and have the opposite pods in standby mode so certainly one of them can shortly take over if wanted. To make this transition between pods seamless, we applied logic to forestall the node from happening on the high of a minute when attainable since — given the character of cron — that’s when it’s probably that scripts will must be scheduled to run. It would first seem that having extra nodes processing work as an alternative of only one would higher clear up our issues, since we gained’t have a single level of failure and we wouldn’t have one pod doing the reminiscence and CPU intensive work. Nonetheless, we determined that synchronizing the nodes can be extra of a headache than a assist. We felt this manner for 2 causes. First, the pods can swap leaders in a short time, making downtime unlikely in follow. And second, we might offload nearly all the reminiscence and CPU intensive work of truly working the scripts to Slack’s Job Queue and as an alternative use the pod only for the scheduling element. Thus, we’ve one pod scheduling and several other different pods ready within the wings.
That brings us to Slack’s Job Queue. The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” (or items of labor) per day. It consists of a bunch of theoretical “queues” that jobs stream by way of. In easy phrases, these “queues’” are literally a logical option to transfer jobs by way of Kafka (for sturdy storage ought to the system encounter a failure or get backed up) into Redis (for brief time period storage that permits further metadata of who’s executing the job to be saved alongside the job) after which lastly to a “job employee” — a node able to execute the code — which truly runs the job. See this text for extra element. In our case, a job was a single script. Despite the fact that it’s an asynchronous compute platform, it may possibly execute work in a short time if work is remoted by itself “queue”, which is how we had been capable of benefit from this method. Leveraging this platform allowed us to dump our compute and reminiscence considerations onto an current system that might already deal with the load (and far, way more). Moreover, since this method already exists and is crucial to how Slack works, we diminished our construct time initially and our upkeep effort going ahead, which is a superb win!
Vitess Database Desk
Lastly, to spherical our service out, we employed a Vitess desk to deal with deduplication and report job monitoring to inner customers (different Slack engineers). Our earlier cron system used flocks, a Linux utility to handle locking in scripts, to make sure that just one copy of a script is working at a time. This only-one requirement is glad by most scripts often. Nonetheless, there are a couple of scripts that take longer than their recurrence, so two copies might begin working on the similar time. In our new system, we report every job execution as a brand new row in a desk and replace the job’s state because it strikes by way of the system (enqueued, in progress, accomplished). Thus, once we wish to kick off a brand new run of a job, we will test that there isn’t one working already by querying the desk for lively jobs. We use an index on script names to make this querying quick.
Moreover, since we’re recording the job state within the desk, the desk additionally serves because the backing for a easy net web page with cron script execution info, in order that customers can simply lookup the state of their script runs and any errors they encountered. This web page is very helpful as a result of some scripts can take as much as an hour to run, so customers need to have the ability to confirm that the script remains to be working and that the work they’re anticipating to occur hasn’t failed.
Total, our new service for executing cron scripts has made the method extra dependable, scalable, and consumer pleasant. Whereas having a crontab on a single cron field had gotten us fairly far, it began inflicting us lots of ache and wasn’t maintaining with Slack’s scale. This new system will give Slack the room wanted to develop, each now and much off into the longer term.
Wish to assist us work on programs like this? We’re hiring! Apply now