From SSH Sessions to Job IDs: Slack's EMR Orchestration Rewrite
The architectural contract moved from a shell on a host to an API call with a lifecycle, and that shift, not the platform itself, is the part other data and platform teams can copy.
The architectural contract moved from a shell on a host to an API call with a lifecycle, and that shift, not the platform itself, is the part other data and platform teams can copy.
Most platform teams running Airflow on Amazon EMR eventually inherit the same default: operators that SSH directly into a cluster's master node to submit work, and a fleet of long-lived SSH sessions that nobody is quite sure who owns. Slack's data platform team has spent the last year pulling that default out by the root, and the replacement contract is the part worth studying.
The migration, reported by InfoQ, moved more than 700 Airflow operators and their jobs across 8 data regions off direct SSH access to EMR master nodes and onto an internal orchestration layer called Quarry. As part of the change, direct SSH access to production EMR clusters was removed altogether. The architectural shape is what makes it interesting: Airflow now submits jobs to Quarry over HTTP, Quarry assigns each job a server-side identifier, and the job's lifecycle, including submission, tracking, retry, and cancellation, lives on the server rather than the client.
That last detail is the structural lesson. The old pattern assumed that whoever started a job was responsible for staying connected to it. When an SSH session dropped, the job might keep running, fail silently, or end up in a state the orchestration layer could not see. Key rotation across hundreds of operators became its own operational project, and the audit trail for what had actually run lived in the logs of every host that had ever accepted an SSH connection. Replacing all of that with REST calls and a job ID is not just a security move. It relocates the failure mode from the client to the server, where idempotent submission, server-tracked status, and a clean cancellation path become a contract rather than a hope.
For teams running Airflow on EMR at scale, the takeaway is not "adopt Quarry." The takeaway is that the boundary between the orchestrator and the compute substrate can be an API rather than a shell. That change unlocks things the SSH pattern could not, including centralized audit logs of job submissions, predictable retry and cancellation semantics, and the ability for a job to outlive the connection that started it without becoming invisible.
The pattern is not free. A REST orchestration layer is only as good as its handling of EMR instability, retry storms, and the moments when a cluster disappears mid-job. As the InfoQ write-up notes, the server-side lifecycle is the core of the design, which is the right place to put the load. Teams copying the pattern should expect to spend the most time on what happens when Quarry cannot reach the cluster, when an EMR node dies with a job still attached, and when cancellation arrives after the work has already finished. The contract is stronger than the old one, but it is not magic.
The lesson generalizes beyond EMR. Anywhere a data platform has grown up around operators that shell into compute hosts to start work, the same swap is available: replace the persistent remote session with a job submission API, give the server ownership of the lifecycle, and stop relying on the client to keep the connection alive. Slack's migration is the worked example. The pattern is what other teams should take home.