Encrypting a Live Redshift Cluster: What AWS Doesn’t Tell You About Timing
Dev took four hours. Prod took fifteen minutes. Here's the honest breakdown of why that might have happened, what to measure before you schedule your maintenance window, and more.
You go the AWS Console, search up RedShift, find your cluster, click edit the Config, and you flip the encryption setting on, wait a bit, and move on with your life.
Well, that’s not the version I lived.
The version I lived had an unscheduled four-hour maintenance window for a dev cluster encryption, built an entire fear-based timeline around that data point, communicating it upward to stakeholders as the expected production impact, and then watching prod encrypt in fifteen minutes.
Fifteen. Minutes. Let that sink in.
I’m not exaggerating. And the moment it finished, my first instinct wasn’t relief. It was suspicion. I immediately looked to verify the encryption actually happened, because something that was supposed to take four hours finishing in fifteen minutes does not inspire confidence. It inspires the kind of paranoia that makes you check twice.
This is what AWS documentation won’t give you.
The mechanism is documented well enough, the real-world timing variance is not. That gap is where production surprises live, and I want to close it for you.
What’s Actually Happening Under the Hood
Before you can understand the timing, you have to understand what AWS is actually doing when you encrypt an existing Redshift cluster.
Because it is not a simple config change.
It is not flipping a flag.
It is not a background re-key operation that runs while your cluster stays online.
What AWS does is take a snapshot of your cluster, restore that snapshot into a new encrypted cluster, and copy the data across. The whole time this is happening, your cluster is in a modifying state, and it is completely unavailable for writes.
Read availability during this window depends on your setup, but you should plan for full unavailability and be pleasantly surprised if you get less.
The safest way to handle this in production is the snapshot-and-restore approach: you keep your existing unencrypted cluster live while the new encrypted cluster builds in the background from the snapshot, then you cut over.
It gives you a fallback if something goes wrong.
It’s more work to coordinate the cutover, especially if you have active ETL pipelines and connections you need to redirect, but the operational safety is worth it.
If you’re using Terraform to manage your cluster, the resource change that triggers encryption looks straightforward:
resource "aws_redshift_cluster" "main" {
cluster_identifier = "your-cluster"
encrypted = true
kms_key_id = aws_kms_key.redshift.arn
...
}
What Terraform doesn’t communicate well is that this single boolean change initiates a full cluster replacement under the hood.
Plan your apply window accordingly.
The Real Variable Is Storage, Not Nodes
Here’s what I had wrong before this operation: I assumed that node count and cluster size were the primary indicators of how long encryption would take.
My best explanation for the timing difference is the storage footprint.
RA3 nodes decouple compute from storage, and the encryption operation works at the managed storage layer, not local disk. So the amount of data sitting in managed storage at encryption time is the variable most likely to drive duration.
The dev cluster had fewer nodes than prod.
The dev cluster also had four years of test data, sample loads, half-finished experiments, tables nobody had ever vacuumed, storage bloat from abandoned ETL attempts, and basically everything short of actual production data.
It was a little messier, as dev environments tend to be.
The prod cluster was clean.
Regular vacuuming.
Consistent analyze runs.
ETL hygiene that the team had actually maintained.
A smaller effective data footprint as a result.
That’s the most technically coherent explanation I have for why dev took four hours and prod took fifteen minutes.
But I can’t rule out other factors. AWS infrastructure prioritization, snapshot queue depth at the time, what else was running in the environment, none of those were controlled.
This was production experience, not a benchmark. I’d be careful about treating it as a universal law.
What I’d say with confidence: the storage footprint explanation is the most actionable one, because it’s the only variable you can actually influence before you encrypt.
All the CPU spike troubleshooting you do before the operation, the vacuum runs, the analyze jobs, the sort key hygiene work, those directly reduce your effective storage footprint. Cleaning up your cluster before you encrypt it is not just good practice. It is a concrete lever on your maintenance window duration, and even if the timing is partly AWS infrastructure factors you can’t see, a smaller footprint can only help.
Before you schedule anything, run this:
select * from stv_node_storage_capacity;The actual used capacity is your real timing indicator. Not node type. Not cluster size.
That number.
Post-encryption, don’t just trust the console. Verify it yourself:
select "table", encoded, skew_rows
from svv_table_info
limit 20;Check the encrypted column in the cluster properties via the AWS CLI as well:
aws redshift describe-clusters \
--cluster-identifier your-cluster \
--query 'Clusters[0].Encrypted'If it returns true, it worked.
Don’t skip this step.
Finishing in fifteen minutes made me check three times.
KMS Keys and a Decision You Can’t Easily Undo
KMS key selection is a decision that deserves more attention than it usually gets.
You have two options.
AWS-managed keys, which are simpler to set up and don’t require you to manage key rotation yourself.
Customer-managed CMKs, which give you more control and are required for certain compliance frameworks that specify key ownership explicitly.
AWS-managed keys are fine for many use cases.
Customer-managed CMKs are the right choice when you have cross-account access requirements, specific key rotation policies, or compliance requirements that mandate you hold the keys.
What nobody emphasizes enough: once your cluster is encrypted, changing the KMS key triggers the exact same full snapshot-restore-copy operation again.
Another outage window.
Another timing unknown.
Another stakeholder communication.
So choose your key strategy before you encrypt, not after.
If you’re in a multi-account architecture, be aware that cross-account KMS adds complexity to the key policy.
The Redshift service principal in the cluster’s account needs explicit grant permissions on a CMK that lives in another account. This is solvable, but it’s not the default behavior and it will bite you if you don’t account for it in your key policy before you start.
Bundling Your Maintenance Windows
One thing I wish I’d thought about earlier, if you have other maintenance operations pending on your cluster, bundle them with the encryption.
A classic resize.
A node type migration.
Schema cleanup that requires a cluster restart.
Each of these is its own outage window if you do them separately.
If you can sequence them intelligently into a single window, you take one availability hit instead of three.
The planning overhead is worth it, and your stakeholders will thank you for minimizing disruption.
The CPU spike investigation ran before encryption, chasing an unrelated performance issue, turned out to be the best pre-encryption prep done.
So vacuum tables you haven’t touched in months.
Analyze across schemas that are running on stale statistics.
Identify and drop tables that aren’t being used.
By the time you schedule the encryption window, the cluster will be in better shape than it had been in a year.
Fifteen minutes.
What This Doesn’t Solve
Encryption at rest protects your data if someone walks out with the underlying storage.
It does not protect you from a misconfigured security group.
It does not protect you from overly permissive IAM policies.
It does not replace network-level controls, and it does not mean you can relax your access management.
Encryption is one layer. It is not the whole security story.
What I still don’t know, is whether there’s a meaningful performance difference between AWS-managed and customer-managed key encryption for the ongoing operation of the cluster post-encryption.
The prod cluster performance after encryption looked normal, but I haven’t done a rigorous before-and-after benchmark.
The AWS documentation suggests the impact is negligible.
I’d want to measure that myself on a cluster under real query load before I’d call it settled.
The broader lesson here is not that prod will always be faster than dev. That’s a dangerous takeaway.
The lesson is that you have to measure your actual storage footprint before you can predict anything about Redshift operational timing.
Assumptions built on node count will fail you.
The stv_node_storage_capacity query is non-negotiable.
Plan for the four-hour window.
Run the query.
Communicate the conservative estimate upward.
And if you’re lucky, finish in fifteen minutes.
That’s not a failure. That’s under-promise, over-deliver applied to infrastructure. The stakeholder who expected four hours of downtime and got fifteen minutes will remember that.
Build that trust everywhere you can.
What’s your experience with Redshift encryption timing in production?
I’d love to hear how your numbers compared in the comments.
With Love and DevOps,
Maxine
Last Updated: March 2026
If this kind of production detail is useful, all of my learning resources go deeper on the AWS and infrastructure patterns that make operations like this make sense.
LLMs for Humans: From Prompts to Production is for engineers who want to understand how language model systems are actually built and deployed, not just how to write prompts.
The DevOps Career Switch Blueprint covers the AWS and IaC foundations for engineers building the career to go with the knowledge.



