Site Reliability Engineer (SRE) Interview Questions
A deep-dive guide for SREs and Platform Engineers, focusing on Error Budgets, SLIs/SLOs, Observability, Kubernetes orchestration, Incident Response, and system scalability.
Total Questions:400
Difficulty Levels:
BeginnerIntermediateAdvanced
0%
Overall Progress
0/400
Status
Problem
Level
2.What is the difference between SRE and DevOps?
Medium
3.What is the difference between SRE and traditional operations?
Medium
4.What are the main responsibilities of an SRE?
Easy
5.What is the error budget?
Medium
6.How do you calculate error budget?
Medium
7.What happens when error budget is exhausted?
Medium
8.What is SLI (Service Level Indicator)?
Easy
9.What is SLO (Service Level Objective)?
Easy
10.What is SLA (Service Level Agreement)?
Easy
11.What is the difference between SLI, SLO, and SLA?
Medium
12.Give examples of good SLIs for a web service.
Easy
13.How do you set SLOs?
Medium
14.What is the recommended SLO target (e.g., 99.9% vs 99.99%)?
Medium
15.What is toil in SRE?
Easy
16.How do you measure toil?
Medium
17.What is the acceptable percentage of toil for SREs?
Easy
18.How do you reduce toil?
Medium
19.What is the 50% engineering time rule in SRE?
Medium
20.What is blameless postmortem?
Easy
21.What should be included in a postmortem document?
Medium
22.What is the difference between incident and outage?
Easy
23.What is mean time to detect (MTTD)?
Easy
24.What is mean time to repair/recover (MTTR)?
Easy
25.What is mean time between failures (MTBF)?
Medium
26.What is the difference between monitoring and observability?
Medium
27.What are the three pillars of observability?
Easy
28.What is the difference between metrics, logs, and traces?
Medium
29.What is the four golden signals of monitoring?
Medium
30.Explain latency as a golden signal.
Easy
31.Explain traffic as a golden signal.
Easy
32.Explain errors as a golden signal.
Easy
33.Explain saturation as a golden signal.
Medium
34.What metrics would you monitor for a web application?
Easy
35.What metrics would you monitor for a database?
Medium
36.What metrics would you monitor for a Kubernetes cluster?
Medium
37.What is white-box monitoring vs black-box monitoring?
Medium
38.What is Prometheus and how does it work?
Easy
39.What is the Prometheus data model?
Medium
40.What is PromQL?
Medium
41.Write a PromQL query to calculate request rate.
Medium
42.Write a PromQL query to calculate error rate.
Hard
43.Write a PromQL query for p95 latency.
Hard
44.What is Grafana and how do you use it with Prometheus?
Easy
45.What is the pushgateway in Prometheus?
Medium
46.What is service discovery in Prometheus?
Medium
47.What is the ELK/EFK stack?
Easy
48.What is structured logging vs unstructured logging?
Medium
49.What is distributed tracing?
Medium
50.What is Jaeger or Zipkin?
Easy
51.What is span and trace in distributed tracing?
Medium
52.What is sampling in tracing?
Hard
53.How do you correlate logs, metrics, and traces?
Hard
54.What is cardinality in monitoring and why does it matter?
Hard
55.What is alert fatigue and how do you prevent it?
Easy
56.What makes a good alert?
Medium
57.What is the difference between alert and notification?
Easy
58.What are the alert severity levels you use?
Easy
59.How do you write actionable alerts?
Medium
60.What is alert routing?
Easy
61.What is escalation policy?
Medium
62.What is PagerDuty/OpsGenie/VictorOps?
Easy
63.How do you handle alert fatigue?
Medium
64.What is alert suppression vs alert inhibition?
Hard
65.What is flapping in alerting?
Medium
66.How do you reduce false positives?
Medium
67.What is on-call rotation?
Easy
68.What is a reasonable on-call schedule?
Medium
69.How do you handle burnout from on-call?
Medium
70.What do you do when you get paged at 3 AM?
Easy
71.What is runbook automation?
Medium
72.What should be in an on-call runbook?
Easy
73.Difference between page-worthy and non-page-worthy alerts?
Medium
74.How do you prioritize multiple simultaneous incidents?
Hard
75.What is alert aggregation?
Medium
76.What is maintenance window and how do you handle alerts during it?
Easy
77.How do you measure on-call quality?
Medium
78.What is time to acknowledge (TTA)?
Easy
79.What is time to resolve (TTR)?
Easy
80.How do you improve MTTR?
Medium
81.What is an incident?
Easy
82.What are incident severity levels?
Easy
83.What is SEV-1, SEV-2, SEV-3 incidents?
Easy
84.What is the incident response process?
Medium
85.Roles in incident response (IC, scribe, communications)?
Medium
86.What is an Incident Commander (IC)?
Medium
87.What are the responsibilities of IC?
Medium
88.What is incident communication strategy?
Hard
89.How do you communicate during an outage?
Easy
90.What is a status page?
Easy
91.What is incident timeline?
Medium
92.How do you conduct an incident call/war room?
Medium
93.What is incident escalation?
Easy
94.When do you escalate an incident?
Medium
95.What is incident mitigation vs resolution?
Medium
96.What is rollback vs roll-forward during incident?
Medium
97.What is incident retrospective/postmortem?
Easy
98.What is blameless culture?
Easy
99.What are the 5 whys technique?
Medium
100.What is root cause analysis (RCA)?
Easy
101.What is contributing factor vs root cause?
Hard
102.What should be in a postmortem document?
Easy
103.What are action items from postmortem?
Medium
104.How do you track action items?
Medium
105.How do you prevent incident recurrence?
Medium
106.What is incident review meeting?
Easy
107.How do you learn from incidents?
Easy
108.Chaos engineering and how it relates to incidents?
Hard
109.What is game day exercise?
Medium
110.What is disaster recovery drill?
Hard
111.How do you design a highly available system?
Medium
112.What is the difference between high availability and fault tolerance?
Medium
113.What is redundancy?
Easy
114.What is active-active vs active-passive architecture?
Medium
115.What is load balancing?
Easy
116.What are load balancing algorithms?
Medium
117.What is health check in load balancing?
Easy
118.What is circuit breaker pattern?
Hard
119.What is retry logic and exponential backoff?
Medium
120.What is rate limiting?
Easy
121.What is throttling vs rate limiting?
Medium
122.What is caching strategy?
Medium
123.What is cache invalidation?
Medium
124.What is CDN and when to use it?
Easy
125.What is database replication?
Medium
126.Master-slave vs multi-master replication?
Medium
127.What is database sharding?
Hard
128.What is horizontal vs vertical scaling?
Easy
129.What is stateless vs stateful application?
Medium
130.How do you design for failure?
Medium
131.What is graceful degradation?
Medium
132.What is bulkhead pattern?
Hard
133.What is timeout strategy?
Medium
134.What is idempotency and why is it important?
Hard
135.What is eventual consistency?
Hard
136.What is CAP theorem?
Hard
137.How do you handle single point of failure (SPOF)?
Easy
138.What is disaster recovery (DR)?
Easy
139.What is RTO (Recovery Time Objective)?
Medium
140.What is RPO (Recovery Point Objective)?
Medium
141.What is capacity planning?
Easy
142.How do you forecast capacity needs?
Medium
143.What is the difference between load and capacity?
Easy
144.What is headroom in capacity planning?
Medium
145.What is utilization vs saturation?
Hard
146.How do you measure system capacity?
Medium
147.What is performance testing?
Easy
148.What is load testing vs stress testing?
Medium
149.What is spike testing?
Medium
150.What is soak testing (endurance testing)?
Medium
151.Tools for load testing (JMeter, Gatling, Locust)?
Easy
152.What is latency budget?
Hard
153.What is throughput?
Easy
154.What is the difference between latency and throughput?
Easy
155.What is queueing theory in SRE?
Hard
156.What is Little's Law?
Hard
157.How do you optimize database performance?
Medium
158.How do you optimize API performance?
Medium
159.What is connection pooling?
Medium
160.What is database query optimization?
Medium
161.What is indexing strategy?
Medium
162.What is N+1 query problem?
Hard
163.How do you identify performance bottlenecks?
Medium
164.What is profiling?
Hard
165.What is the USE method (Utilization, Saturation, Errors)?
Hard
166.What is Kubernetes in SRE context?
Easy
167.What Kubernetes metrics do you monitor?
Medium
168.What is pod crash loop backoff?
Easy
169.What is OOMKilled error?
Medium
170.How do you troubleshoot pending pods?
Medium
171.What is resource requests vs limits?
Medium
172.What is HPA (Horizontal Pod Autoscaler)?
Easy
173.What is VPA (Vertical Pod Autoscaler)?
Hard
174.What is cluster autoscaler?
Medium
175.What is liveness probe vs readiness probe?
Medium
176.How do you set probe thresholds?
Medium
177.What is PodDisruptionBudget (PDB)?
Hard
178.What is node affinity and pod affinity?
Hard
179.What is taints and tolerations?
Hard
180.How do you perform rolling updates safely?
Medium
181.What is deployment strategy (RollingUpdate, Recreate)?
Easy
182.How do you rollback a deployment?
Easy
183.What is StatefulSet and when to use it?
Medium
184.What is DaemonSet use case?
Easy
185.What is resource quota and limit range?
Hard
186.How do you monitor Kubernetes cluster health?
Medium
187.What is kube-state-metrics?
Medium
188.What is node exporter?
Easy
189.What is kubectl top command?
Easy
190.How do you debug a container?
Easy
191.What is ephemeral containers?
Hard
192.What is Kubernetes Events?
Medium
193.How do you handle persistent storage in K8s?
Medium
194.What is StorageClass?
Medium
195.What is CNI (Container Network Interface)?
Hard
196.How do you check system load average?
Easy
197.What does load average 1, 5, 15 mean?
Medium
198.How do you identify high CPU usage process?
Easy
199.How do you identify high memory usage?
Easy
200.What is the difference between memory and swap?
Easy
201.What is OOM killer?
Medium
202.How do you troubleshoot disk space issues?
Medium
203.What is inode and how can you run out of inodes?
Hard
204.How do you find which process is using a file?
Easy
205.What is lsof command?
Medium
206.How do you check network connections?
Easy
207.What is netstat/ss command?
Easy
208.How do you troubleshoot DNS issues?
Medium
209.Trace network packets (tcpdump, wireshark)?
Medium
210.What is strace and when do you use it?
Hard
211.What is system calls?
Medium
212.How do you check process threads?
Medium
213.What is context switching?
Hard
214.What is soft vs hard limits (ulimit)?
Medium
215.How do you tune kernel parameters (sysctl)?
Hard
216.What is a file descriptor and how to increase limits?
Medium
217.What is TCP time_wait state?
Hard
218.What is TCP connection states?
Hard
219.How do you troubleshoot performance using perf?
Hard
220.What is eBPF and its use in SRE?
Hard
221.How do you automate toil?
Easy
222.What is configuration management (Ansible, Puppet, Chef)?
Easy
223.What is infrastructure as code (Terraform, CloudFormation)?
Easy
224.What is the difference between imperative and declarative IaC?
Medium
225.What is GitOps?
Medium
226.What is reconciliation loop?
Hard
227.How do you manage secrets in automation?
Medium
228.What is idempotency in automation?
Medium
229.How do you test infrastructure code?
Hard
230.What is policy as code?
Hard
231.What is continuous deployment vs continuous delivery?
Medium
232.How do you implement safe deployment practices?
Medium
233.What is canary deployment?
Easy
234.What is blue-green deployment?
Medium
235.What is feature flag?
Easy
236.How do you automate rollback?
Hard
237.What is progressive delivery?
Hard
238.What is deployment pipeline?
Easy
239.How do you automate incident response?
Hard
240.What is ChatOps?
Easy
241.How do you monitor database health?
Easy
242.What is connection pool exhaustion?
Medium
243.How do you troubleshoot slow queries?
Medium
244.What is query execution plan?
Hard
245.How do you handle database failover?
Medium
246.What is replication lag?
Easy
247.What is split-brain problem in databases?
Hard
248.How do you perform database backup and restore?
Medium
249.What is point-in-time recovery (PITR)?
Hard
250.How do you test database backups?
Medium
251.What cloud platforms have you worked with?
Easy
252.How do you ensure reliability in cloud?
Medium
253.What is multi-AZ deployment?
Easy
254.What is multi-region deployment?
Medium
255.How do you handle cloud provider outages?
Hard
256.What is AWS Auto Scaling?
Easy
257.What is ELB health checks?
Easy
258.How do you monitor cloud resources?
Medium
259.What is CloudWatch vs Prometheus for cloud monitoring?
Medium
260.What is distributed system?
Easy
261.What are challenges in distributed systems?
Medium
262.What is network partition?
Medium
263.What is split-brain scenario?
Hard
264.How do you handle eventual consistency?
Hard
265.What is distributed consensus (Raft, Paxos)?
Hard
266.What is service mesh (Istio, Linkerd)?
Medium
267.What is sidecar pattern?
Medium
268.What is API gateway?
Easy
269.What is backpressure in distributed systems?
Hard
270.How do you handle cascading failures?
Hard
271.What is bulkhead pattern in microservices?
Hard
272.What is timeout propagation?
Hard
273.What is distributed tracing importance?
Medium
274.How do you debug distributed systems?
Medium
275.What is observability in microservices?
Medium
276.What is security in SRE role?
Easy
277.How do you implement least privilege principle?
Medium
278.What is secrets management?
Medium
279.What is certificate rotation?
Medium
280.How do you handle security incidents?
Medium
281.What is DDoS mitigation?
Medium
282.What is rate limiting for security?
Easy
283.How do you monitor for security threats?
Medium
284.What is audit logging?
Easy
285.What is compliance monitoring?
Medium
286.How do you handle vulnerability patching?
Medium
287.What is patch management strategy?
Medium
288.What is security scanning in CI/CD?
Medium
289.How do you ensure data encryption?
Easy
290.What is principle of defense in depth?
Medium
291.Website is slow - how do you troubleshoot?
Medium
292.Database is down - what are your steps?
Medium
293.CPU is at 100% - how do you investigate?
Medium
294.Memory is exhausted - what do you do?
Medium
295.Disk is full - how do you handle it?
Easy
296.Pods are crash looping - how do you debug?
Easy
297.Load balancer shows unhealthy targets - what do you check?
Medium
298.Latency suddenly increased - how do you investigate?
Medium
299.Error rate spiked - what are your steps?
Easy
300.Traffic dropped to zero - how do you troubleshoot?
Medium
301.Deployment failed - how do you rollback?
Easy
302.Database replication lag is high - what do you do?
Medium
303.You're paged for high memory usage - what do you do?
Medium
304.SSL certificate expired - how do you handle?
Easy
305.DNS resolution failing - how do you troubleshoot?
Medium
306.Application throwing 500 errors - how do you debug?
Easy
307.Kafka consumer lag increasing - what do you check?
Hard
308.Redis cache hit rate dropped - what do you investigate?
Medium
309.Network latency between services increased - how do you debug?
Hard
310.Cloud cost suddenly increased - how do you investigate?
Medium
311.How would you design monitoring for a new service?
Medium
312.How would you implement zero-downtime deployment?
Medium
313.How would you handle a complete datacenter failure?
Hard
314.How would you migrate a service with zero downtime?
Hard
315.How would you handle Black Friday traffic spike?
Medium
316.How would you implement disaster recovery?
Medium
317.Reduce deployment time from 30 mins to 5 mins?
Hard
318.How would you improve MTTR for your team?
Medium
319.Handle runaway process consuming resources?
Medium
320.Debug intermittent timeout issues?
Hard
321.Service is healthy but receiving no traffic - what do you check?
Medium
322.Memory leak - identify cause?
Medium
323.Database connections maxed out - what are your steps?
Medium
324.How do you handle third-party API outage?
Medium
325.How would you design SLOs for a payment service?
Medium
326.How would you reduce toil in your current role?
Easy
327.Handle competing incidents simultaneously?
Medium
328.Onboard a new service to your monitoring stack?
Easy
329.How would you implement chaos engineering?
Hard
330.Improve observability of legacy system?
Hard
331.Tell me about your most challenging production incident.
Medium
332.How do you prioritize during an outage?
Easy
333.Describe a time you prevented a major incident.
Medium
334.How do you handle stress during critical incidents?
Easy
335.How do you balance reliability and feature velocity?