The DevOps and Cloud Engineer Interview Questions That Actually Separate Candidates

The IaC, AWS, and incident response questions worth asking in a DevOps interview, and what the answers tell you about whether someone has actually been paged at 2am or just studied for the screen.

Mar 05, 2026

DevOps interview might to be where someone asks you to explain CI/CD, you say “it automates deployments,” they nod, and everyone moves on.

That exists. And, it produces bad hires.

The interviews that actually find strong engineers look different.

They start with a concept and then follow the thread into production reality, into the failure modes, into the “what would you do at 2am when this breaks” territory.

The questions below are the ones I’ve been asked, the ones I’ve asked, and the ones I think about when I’m trying to figure out whether someone has actually run these systems or just read about them.

IaC and Terraform

1. “Walk me through what happens when you run terraform apply on a resource that already exists in AWS but isn’t in your state file.”

This one separates people immediately.

The wrong answer is “Terraform creates it.” Terraform doesn’t know it exists.

It will try to create a new resource with the same name or configuration, which either fails with a conflict error or succeeds and gives you a duplicate you now have to clean up.

The right answer involves importing the resource into state first:

terraform import aws_s3_bucket.my_bucket my-existing-bucket

And then understanding that import only pulls the resource into state. It doesn’t generate the configuration. You still have to write the HCL to match what’s actually deployed, then run a plan to verify the diff is clean.

Strong candidates will also mention that this is exactly the situation you end up in after someone clicks through the console instead of using Terraform. It happens. The question is whether you know how to recover.

2. “What’s the difference between terraform taint and terraform destroy on a single resource, and when would you use each?”

terraform taint marks a resource for replacement on the next apply without destroying it immediately.

terraform destroy -target destroys it right now.

The practical difference matters in production: taint gives you a chance to review the plan before the replacement happens. Destroy is immediate.

The follow-up that matters: what happens when the resource you’re replacing has other resources depending on it?

If a security group is attached to an EC2 instance and you taint the security group, the plan will show you the cascade. If you don’t read it carefully, you end up replacing more than you intended.

I’ve seen this play out with ACM certificates attached to load balancers. Taint the cert, miss the dependency chain in the plan, and suddenly your apply is trying to replace the listener too.

Read your plans.

AWS Architecture

3. “Your Lambda function is hitting the 15-minute timeout. Walk me through how you’d redesign the architecture.”

The answer they’re looking for isn’t “increase the timeout.” There is no increasing past 15 minutes. The answer is decomposition.

The most common pattern is breaking the work into chunks and using SQS or Step Functions to orchestrate them.

SQS gives you natural parallelism and retry handling.
Step Functions give you visibility into where in a long workflow something failed, with state management built in.

The architecture question underneath this is: what’s the unit of work? If your Lambda is doing one monolithic thing, you need to split it into stages.

If it’s processing records one at a time from a large dataset, you need to rethink whether Lambda is even the right compute for that workload. ECS Fargate tasks have no timeout ceiling and are often the right answer for long-running batch operations.

A strong candidate will ask clarifying questions before prescribing a solution.

What is the Lambda doing?
Is it I/O bound or CPU bound?
Is it processing events or running a batch job?

The architecture depends on the answers.

4. “Explain the difference between a NAT Gateway and a NAT instance, and when the right answer is neither.”

NAT Gateway is managed, scales automatically, costs money per hour plus per GB processed.

NAT instance is an EC2 instance running NAT software, gives you more control, requires you to manage it, historically cheaper at scale but operationally heavier.

The “when the right answer is neither” part is what I’m listening for.

If your workload in a private subnet only needs to reach AWS services, the right answer is VPC endpoints. No NAT required, no data transfer charges, traffic stays on the AWS network. For S3 and DynamoDB you want Gateway endpoints, which are free. For most other AWS services you want Interface endpoints, which cost money but are still often cheaper than routing everything through a NAT Gateway.

A lot of engineers have NAT Gateways routing traffic to S3 that could be going through a Gateway endpoint for free.

Not a security win. Not a performance win. Just an unnecessary bill.

5. “You’re getting intermittent 5xx errors from an ALB target group. The instances are healthy. Walk me through your investigation.”

Healthy instances with 5xx errors is one of the most common patterns that generates “the load balancer is broken” tickets.

It isn’t broken.

The instances are returning 5xx at the application layer. The ALB health check is passing because health checks are typically shallow, a /health endpoint returning 200 tells the ALB the instance is alive. It says nothing about whether the application can handle real traffic.

Investigation path:

Look at the ALB access logs first, not the instance logs. The ALB logs will show you the backend status code, the response time, and which target handled the request.
Look at the application logs on the targets that returned 5xx. Is it a specific endpoint? A specific request pattern? A downstream dependency timing out?

Common culprits:

Database connection pool exhaustion
Third-party API the application depends on
Memory pressure causing the application to degrade under load
A deployment that went partially wrong and left some instances running old code with a bug

Observability and Incidents

6. “What’s the difference between a metric, a log, and a trace, and when is each the right tool?”

Metrics tell you something is wrong.

Logs tell you what happened.

Traces tell you where the time went.

In Translation? High error rate on a metric gets you paged. The logs tell you it’s a NullPointerException in a specific service. The trace shows you that the request hit Service A, which called Service B, which called a database that took 8 seconds to respond, which caused Service B to timeout, which caused Service A to throw the error.

You need all three. Engineers who’ve only worked with logs and metrics and haven’t used distributed tracing yet tend to dramatically underestimate how hard it is to debug latency issues in microservices architectures without it.

7. “Walk me through how you’d design an alerting strategy that doesn’t wake people up unnecessarily.”

Alert on symptoms, not causes.

Alert when users are affected, not when a metric crosses an arbitrary threshold that may or may not matter.

High CPU on an instance is not an alert.

High CPU that’s causing latency to exceed your SLO is an alert.

Everything else is a dashboard, a log entry, or an investigation you do during business hours. On-call fatigue from noisy alerts is a real organizational problem, and it’s almost always caused by alerting on causes instead of symptoms. Engineers who’ve been on a bad on-call rotation understand this immediately. Engineers who haven’t yet will.

What These Questions Are Actually Testing

Every question above is testing the same underlying thing: does this person understand what systems actually do in production, or do they understand what systems are supposed to do in theory?

Theory is fine as a foundation.

Production is where the interesting problems live.

The candidate who says “I’d look at the CloudWatch logs” when you ask about the 5xx errors knows the tool exists. The candidate who says “I’d start with ALB access logs because they show backend response codes and I want to know if this is the application or the load balancer before I go anywhere else” has debugged this before.

That’s the difference.

What questions do you get in DevOps interviews that you think actually predict real-world performance?

I’d love to hear them in the comments.

With Love and DevOps,

Maxine

Last Updated: March 2026

If this kind of depth is useful to you, the DevOps Career Switch Blueprint covers the AWS and IaC foundations that make these interview answers feel natural rather than rehearsed.

And if you’re working with AI tooling in your infrastructure, LLMs for Humans: From Prompts to Production covers how these systems actually work under the hood.

Discussion about this post

Ready for more?