Site Reliability Engineer (SRE) Interview Questions

A deep-dive guide for SREs and Platform Engineers, focusing on Error Budgets, SLIs/SLOs, Observability, Kubernetes orchestration, Incident Response, and system scalability.

Total Questions:400

Difficulty Levels:

BeginnerIntermediateAdvanced

Overall Progress

0/400

Status

Problem

Level

1.What is Site Reliability Engineering?

Easy

2.What is the difference between SRE and DevOps?

Medium

3.What is the difference between SRE and traditional operations?

Medium

4.What are the main responsibilities of an SRE?

Easy

5.What is the error budget?

Medium

6.How do you calculate error budget?

Medium

7.What happens when error budget is exhausted?

Medium

8.What is SLI (Service Level Indicator)?

Easy

9.What is SLO (Service Level Objective)?

Easy

10.What is SLA (Service Level Agreement)?

Easy

11.What is the difference between SLI, SLO, and SLA?

Medium

12.Give examples of good SLIs for a web service.

Easy

13.How do you set SLOs?

Medium

14.What is the recommended SLO target (e.g., 99.9% vs 99.99%)?

Medium

15.What is toil in SRE?

Easy

16.How do you measure toil?

Medium

17.What is the acceptable percentage of toil for SREs?

Easy

18.How do you reduce toil?

Medium

19.What is the 50% engineering time rule in SRE?

Medium

20.What is blameless postmortem?

Easy

21.What should be included in a postmortem document?

Medium

22.What is the difference between incident and outage?

Easy

23.What is mean time to detect (MTTD)?

Easy

24.What is mean time to repair/recover (MTTR)?

Easy

25.What is mean time between failures (MTBF)?

Medium

26.What is the difference between monitoring and observability?

Medium

27.What are the three pillars of observability?

Easy

28.What is the difference between metrics, logs, and traces?

Medium

29.What is the four golden signals of monitoring?

Medium

30.Explain latency as a golden signal.

Easy

31.Explain traffic as a golden signal.

Easy

32.Explain errors as a golden signal.

Easy

33.Explain saturation as a golden signal.

Medium

34.What metrics would you monitor for a web application?

Easy

35.What metrics would you monitor for a database?

Medium

36.What metrics would you monitor for a Kubernetes cluster?

Medium

37.What is white-box monitoring vs black-box monitoring?

Medium

38.What is Prometheus and how does it work?

Easy

39.What is the Prometheus data model?

Medium

40.What is PromQL?

Medium

41.Write a PromQL query to calculate request rate.

Medium

42.Write a PromQL query to calculate error rate.

Hard

43.Write a PromQL query for p95 latency.

Hard

44.What is Grafana and how do you use it with Prometheus?

Easy

45.What is the pushgateway in Prometheus?

Medium

46.What is service discovery in Prometheus?

Medium

47.What is the ELK/EFK stack?

Easy

48.What is structured logging vs unstructured logging?

Medium

49.What is distributed tracing?

Medium

50.What is Jaeger or Zipkin?

Easy

51.What is span and trace in distributed tracing?

Medium

52.What is sampling in tracing?

Hard

53.How do you correlate logs, metrics, and traces?

Hard

54.What is cardinality in monitoring and why does it matter?

Hard

55.What is alert fatigue and how do you prevent it?

Easy

56.What makes a good alert?

Medium

57.What is the difference between alert and notification?

Easy

58.What are the alert severity levels you use?

Easy

59.How do you write actionable alerts?

Medium

60.What is alert routing?

Easy

61.What is escalation policy?

Medium

62.What is PagerDuty/OpsGenie/VictorOps?

Easy

63.How do you handle alert fatigue?

Medium

64.What is alert suppression vs alert inhibition?

Hard

65.What is flapping in alerting?

Medium

66.How do you reduce false positives?

Medium

67.What is on-call rotation?

Easy

68.What is a reasonable on-call schedule?

Medium

69.How do you handle burnout from on-call?

Medium

70.What do you do when you get paged at 3 AM?

Easy

71.What is runbook automation?

Medium

72.What should be in an on-call runbook?

Easy

73.Difference between page-worthy and non-page-worthy alerts?

Medium

74.How do you prioritize multiple simultaneous incidents?

Hard

75.What is alert aggregation?

Medium

76.What is maintenance window and how do you handle alerts during it?

Easy

77.How do you measure on-call quality?

Medium

78.What is time to acknowledge (TTA)?

Easy

79.What is time to resolve (TTR)?

Easy

80.How do you improve MTTR?

Medium

81.What is an incident?

Easy

82.What are incident severity levels?

Easy

83.What is SEV-1, SEV-2, SEV-3 incidents?

Easy

84.What is the incident response process?

Medium

85.Roles in incident response (IC, scribe, communications)?

Medium

86.What is an Incident Commander (IC)?

Medium

87.What are the responsibilities of IC?

Medium

88.What is incident communication strategy?

Hard

89.How do you communicate during an outage?

Easy

90.What is a status page?

Easy

91.What is incident timeline?

Medium

92.How do you conduct an incident call/war room?

Medium

93.What is incident escalation?

Easy

94.When do you escalate an incident?

Medium

95.What is incident mitigation vs resolution?

Medium

96.What is rollback vs roll-forward during incident?

Medium

97.What is incident retrospective/postmortem?

Easy

98.What is blameless culture?

Easy

99.What are the 5 whys technique?

Medium

100.What is root cause analysis (RCA)?

Easy

101.What is contributing factor vs root cause?

Hard

102.What should be in a postmortem document?

Easy

103.What are action items from postmortem?

Medium

104.How do you track action items?

Medium

105.How do you prevent incident recurrence?

Medium

106.What is incident review meeting?

Easy

107.How do you learn from incidents?

Easy

108.Chaos engineering and how it relates to incidents?

Hard

109.What is game day exercise?

Medium

110.What is disaster recovery drill?

Hard

111.How do you design a highly available system?

Medium

112.What is the difference between high availability and fault tolerance?

Medium

113.What is redundancy?

Easy

114.What is active-active vs active-passive architecture?

Medium

115.What is load balancing?

Easy

116.What are load balancing algorithms?

Medium

117.What is health check in load balancing?

Easy

118.What is circuit breaker pattern?

Hard

119.What is retry logic and exponential backoff?

Medium

120.What is rate limiting?

Easy

121.What is throttling vs rate limiting?

Medium

122.What is caching strategy?

Medium

123.What is cache invalidation?

Medium

124.What is CDN and when to use it?

Easy

125.What is database replication?

Medium

126.Master-slave vs multi-master replication?

Medium

127.What is database sharding?

Hard

128.What is horizontal vs vertical scaling?

Easy

129.What is stateless vs stateful application?

Medium

130.How do you design for failure?

Medium

131.What is graceful degradation?

Medium

132.What is bulkhead pattern?

Hard

133.What is timeout strategy?

Medium

134.What is idempotency and why is it important?

Hard

135.What is eventual consistency?

Hard

136.What is CAP theorem?

Hard

137.How do you handle single point of failure (SPOF)?

Easy

138.What is disaster recovery (DR)?

Easy

139.What is RTO (Recovery Time Objective)?

Medium

140.What is RPO (Recovery Point Objective)?

Medium

141.What is capacity planning?

Easy

142.How do you forecast capacity needs?

Medium

143.What is the difference between load and capacity?

Easy

144.What is headroom in capacity planning?

Medium

145.What is utilization vs saturation?

Hard

146.How do you measure system capacity?

Medium

147.What is performance testing?

Easy

148.What is load testing vs stress testing?

Medium

149.What is spike testing?

Medium

150.What is soak testing (endurance testing)?

Medium

151.Tools for load testing (JMeter, Gatling, Locust)?

Easy

152.What is latency budget?

Hard

153.What is throughput?

Easy

154.What is the difference between latency and throughput?

Easy

155.What is queueing theory in SRE?

Hard

156.What is Little's Law?

Hard

157.How do you optimize database performance?

Medium

158.How do you optimize API performance?

Medium

159.What is connection pooling?

Medium

160.What is database query optimization?

Medium

161.What is indexing strategy?

Medium

162.What is N+1 query problem?

Hard

163.How do you identify performance bottlenecks?

Medium

164.What is profiling?

Hard

165.What is the USE method (Utilization, Saturation, Errors)?

Hard

166.What is Kubernetes in SRE context?

Easy

167.What Kubernetes metrics do you monitor?

Medium

168.What is pod crash loop backoff?

Easy

169.What is OOMKilled error?

Medium

170.How do you troubleshoot pending pods?

Medium

171.What is resource requests vs limits?

Medium

172.What is HPA (Horizontal Pod Autoscaler)?

Easy

173.What is VPA (Vertical Pod Autoscaler)?

Hard

174.What is cluster autoscaler?

Medium

175.What is liveness probe vs readiness probe?

Medium

176.How do you set probe thresholds?

Medium

177.What is PodDisruptionBudget (PDB)?

Hard

178.What is node affinity and pod affinity?

Hard

179.What is taints and tolerations?

Hard

180.How do you perform rolling updates safely?

Medium

181.What is deployment strategy (RollingUpdate, Recreate)?

Easy

182.How do you rollback a deployment?

Easy

183.What is StatefulSet and when to use it?

Medium

184.What is DaemonSet use case?

Easy

185.What is resource quota and limit range?

Hard

186.How do you monitor Kubernetes cluster health?

Medium

187.What is kube-state-metrics?

Medium

188.What is node exporter?

Easy

189.What is kubectl top command?

Easy

190.How do you debug a container?

Easy

191.What is ephemeral containers?

Hard

192.What is Kubernetes Events?

Medium

193.How do you handle persistent storage in K8s?

Medium

194.What is StorageClass?

Medium

195.What is CNI (Container Network Interface)?

Hard

196.How do you check system load average?

Easy

197.What does load average 1, 5, 15 mean?

Medium

198.How do you identify high CPU usage process?

Easy

199.How do you identify high memory usage?

Easy

200.What is the difference between memory and swap?

Easy

201.What is OOM killer?

Medium

202.How do you troubleshoot disk space issues?

Medium

203.What is inode and how can you run out of inodes?

Hard

204.How do you find which process is using a file?

Easy

205.What is lsof command?

Medium

206.How do you check network connections?

Easy

207.What is netstat/ss command?

Easy

208.How do you troubleshoot DNS issues?

Medium

209.Trace network packets (tcpdump, wireshark)?

Medium

210.What is strace and when do you use it?

Hard

211.What is system calls?

Medium

212.How do you check process threads?

Medium

213.What is context switching?

Hard

214.What is soft vs hard limits (ulimit)?

Medium

215.How do you tune kernel parameters (sysctl)?

Hard

216.What is a file descriptor and how to increase limits?

Medium

217.What is TCP time_wait state?

Hard

218.What is TCP connection states?

Hard

219.How do you troubleshoot performance using perf?

Hard

220.What is eBPF and its use in SRE?

Hard

221.How do you automate toil?

Easy

222.What is configuration management (Ansible, Puppet, Chef)?

Easy

223.What is infrastructure as code (Terraform, CloudFormation)?

Easy

224.What is the difference between imperative and declarative IaC?

Medium

225.What is GitOps?

Medium

226.What is reconciliation loop?

Hard

227.How do you manage secrets in automation?

Medium

228.What is idempotency in automation?

Medium

229.How do you test infrastructure code?

Hard

230.What is policy as code?

Hard

231.What is continuous deployment vs continuous delivery?

Medium

232.How do you implement safe deployment practices?

Medium

233.What is canary deployment?

Easy

234.What is blue-green deployment?

Medium

235.What is feature flag?

Easy

236.How do you automate rollback?

Hard

237.What is progressive delivery?

Hard

238.What is deployment pipeline?

Easy

239.How do you automate incident response?

Hard

240.What is ChatOps?

Easy

241.How do you monitor database health?

Easy

242.What is connection pool exhaustion?

Medium

243.How do you troubleshoot slow queries?

Medium

244.What is query execution plan?

Hard

245.How do you handle database failover?

Medium

246.What is replication lag?

Easy

247.What is split-brain problem in databases?

Hard

248.How do you perform database backup and restore?

Medium

249.What is point-in-time recovery (PITR)?

Hard

250.How do you test database backups?

Medium

251.What cloud platforms have you worked with?

Easy

252.How do you ensure reliability in cloud?

Medium

253.What is multi-AZ deployment?

Easy

254.What is multi-region deployment?

Medium

255.How do you handle cloud provider outages?

Hard

256.What is AWS Auto Scaling?

Easy

257.What is ELB health checks?

Easy

258.How do you monitor cloud resources?

Medium

259.What is CloudWatch vs Prometheus for cloud monitoring?

Medium

260.What is distributed system?

Easy

261.What are challenges in distributed systems?

Medium

262.What is network partition?

Medium

263.What is split-brain scenario?

Hard

264.How do you handle eventual consistency?

Hard

265.What is distributed consensus (Raft, Paxos)?

Hard

266.What is service mesh (Istio, Linkerd)?

Medium

267.What is sidecar pattern?

Medium

268.What is API gateway?

Easy

269.What is backpressure in distributed systems?

Hard

270.How do you handle cascading failures?

Hard

271.What is bulkhead pattern in microservices?

Hard

272.What is timeout propagation?

Hard

273.What is distributed tracing importance?

Medium

274.How do you debug distributed systems?

Medium

275.What is observability in microservices?

Medium

276.What is security in SRE role?

Easy

277.How do you implement least privilege principle?

Medium

278.What is secrets management?

Medium

279.What is certificate rotation?

Medium

280.How do you handle security incidents?

Medium

281.What is DDoS mitigation?

Medium

282.What is rate limiting for security?

Easy

283.How do you monitor for security threats?

Medium

284.What is audit logging?

Easy

285.What is compliance monitoring?

Medium

286.How do you handle vulnerability patching?

Medium

287.What is patch management strategy?

Medium

288.What is security scanning in CI/CD?

Medium

289.How do you ensure data encryption?

Easy

290.What is principle of defense in depth?

Medium

291.Website is slow - how do you troubleshoot?

Medium

292.Database is down - what are your steps?

Medium

293.CPU is at 100% - how do you investigate?

Medium

294.Memory is exhausted - what do you do?

Medium

295.Disk is full - how do you handle it?

Easy

296.Pods are crash looping - how do you debug?

Easy

297.Load balancer shows unhealthy targets - what do you check?

Medium

298.Latency suddenly increased - how do you investigate?

Medium

299.Error rate spiked - what are your steps?

Easy

300.Traffic dropped to zero - how do you troubleshoot?

Medium

301.Deployment failed - how do you rollback?

Easy

302.Database replication lag is high - what do you do?

Medium

303.You're paged for high memory usage - what do you do?

Medium

304.SSL certificate expired - how do you handle?

Easy

305.DNS resolution failing - how do you troubleshoot?

Medium

306.Application throwing 500 errors - how do you debug?

Easy

307.Kafka consumer lag increasing - what do you check?

Hard

308.Redis cache hit rate dropped - what do you investigate?

Medium

309.Network latency between services increased - how do you debug?

Hard

310.Cloud cost suddenly increased - how do you investigate?

Medium

311.How would you design monitoring for a new service?

Medium

312.How would you implement zero-downtime deployment?

Medium

313.How would you handle a complete datacenter failure?

Hard

314.How would you migrate a service with zero downtime?

Hard

315.How would you handle Black Friday traffic spike?

Medium

316.How would you implement disaster recovery?

Medium

317.Reduce deployment time from 30 mins to 5 mins?

Hard

318.How would you improve MTTR for your team?

Medium

319.Handle runaway process consuming resources?

Medium

320.Debug intermittent timeout issues?

Hard

321.Service is healthy but receiving no traffic - what do you check?

Medium

322.Memory leak - identify cause?

Medium

323.Database connections maxed out - what are your steps?

Medium

324.How do you handle third-party API outage?

Medium

325.How would you design SLOs for a payment service?

Medium

326.How would you reduce toil in your current role?

Easy

327.Handle competing incidents simultaneously?

Medium

328.Onboard a new service to your monitoring stack?

Easy

329.How would you implement chaos engineering?

Hard

330.Improve observability of legacy system?

Hard

331.Tell me about your most challenging production incident.

Medium

332.How do you prioritize during an outage?

Easy

333.Describe a time you prevented a major incident.

Medium

334.How do you handle stress during critical incidents?

Easy

335.How do you balance reliability and feature velocity?

Medium

400.Post-deploy performance regression - identify cause?

Hard