Site Reliability Engineer (SRE) Interview Questions

A deep-dive guide for SREs and Platform Engineers, focusing on Error Budgets, SLIs/SLOs, Observability, Kubernetes orchestration, Incident Response, and system scalability.

Total Questions:400
Difficulty Levels:
BeginnerIntermediateAdvanced
0%

Overall Progress

0/400

1.What is Site Reliability Engineering?

2.What is the difference between SRE and DevOps?

3.What is the difference between SRE and traditional operations?

4.What are the main responsibilities of an SRE?

5.What is the error budget?

6.How do you calculate error budget?

7.What happens when error budget is exhausted?

8.What is SLI (Service Level Indicator)?

9.What is SLO (Service Level Objective)?

10.What is SLA (Service Level Agreement)?

11.What is the difference between SLI, SLO, and SLA?

12.Give examples of good SLIs for a web service.

13.How do you set SLOs?

14.What is the recommended SLO target (e.g., 99.9% vs 99.99%)?

15.What is toil in SRE?

16.How do you measure toil?

17.What is the acceptable percentage of toil for SREs?

18.How do you reduce toil?

19.What is the 50% engineering time rule in SRE?

20.What is blameless postmortem?

21.What should be included in a postmortem document?

22.What is the difference between incident and outage?

23.What is mean time to detect (MTTD)?

24.What is mean time to repair/recover (MTTR)?

25.What is mean time between failures (MTBF)?

26.What is the difference between monitoring and observability?

27.What are the three pillars of observability?

28.What is the difference between metrics, logs, and traces?

29.What is the four golden signals of monitoring?

30.Explain latency as a golden signal.

31.Explain traffic as a golden signal.

32.Explain errors as a golden signal.

33.Explain saturation as a golden signal.

34.What metrics would you monitor for a web application?

35.What metrics would you monitor for a database?

36.What metrics would you monitor for a Kubernetes cluster?

37.What is white-box monitoring vs black-box monitoring?

38.What is Prometheus and how does it work?

39.What is the Prometheus data model?

40.What is PromQL?

41.Write a PromQL query to calculate request rate.

42.Write a PromQL query to calculate error rate.

43.Write a PromQL query for p95 latency.

44.What is Grafana and how do you use it with Prometheus?

45.What is the pushgateway in Prometheus?

46.What is service discovery in Prometheus?

47.What is the ELK/EFK stack?

48.What is structured logging vs unstructured logging?

49.What is distributed tracing?

50.What is Jaeger or Zipkin?

51.What is span and trace in distributed tracing?

52.What is sampling in tracing?

53.How do you correlate logs, metrics, and traces?

54.What is cardinality in monitoring and why does it matter?

55.What is alert fatigue and how do you prevent it?

56.What makes a good alert?

57.What is the difference between alert and notification?

58.What are the alert severity levels you use?

59.How do you write actionable alerts?

60.What is alert routing?

61.What is escalation policy?

62.What is PagerDuty/OpsGenie/VictorOps?

63.How do you handle alert fatigue?

64.What is alert suppression vs alert inhibition?

65.What is flapping in alerting?

66.How do you reduce false positives?

67.What is on-call rotation?

68.What is a reasonable on-call schedule?

69.How do you handle burnout from on-call?

70.What do you do when you get paged at 3 AM?

71.What is runbook automation?

72.What should be in an on-call runbook?

73.Difference between page-worthy and non-page-worthy alerts?

74.How do you prioritize multiple simultaneous incidents?

75.What is alert aggregation?

76.What is maintenance window and how do you handle alerts during it?

77.How do you measure on-call quality?

78.What is time to acknowledge (TTA)?

79.What is time to resolve (TTR)?

80.How do you improve MTTR?

81.What is an incident?

82.What are incident severity levels?

83.What is SEV-1, SEV-2, SEV-3 incidents?

84.What is the incident response process?

85.Roles in incident response (IC, scribe, communications)?

86.What is an Incident Commander (IC)?

87.What are the responsibilities of IC?

88.What is incident communication strategy?

89.How do you communicate during an outage?

90.What is a status page?

91.What is incident timeline?

92.How do you conduct an incident call/war room?

93.What is incident escalation?

94.When do you escalate an incident?

95.What is incident mitigation vs resolution?

96.What is rollback vs roll-forward during incident?

97.What is incident retrospective/postmortem?

98.What is blameless culture?

99.What are the 5 whys technique?

100.What is root cause analysis (RCA)?

101.What is contributing factor vs root cause?

102.What should be in a postmortem document?

103.What are action items from postmortem?

104.How do you track action items?

105.How do you prevent incident recurrence?

106.What is incident review meeting?

107.How do you learn from incidents?

108.Chaos engineering and how it relates to incidents?

109.What is game day exercise?

110.What is disaster recovery drill?

111.How do you design a highly available system?

112.What is the difference between high availability and fault tolerance?

113.What is redundancy?

114.What is active-active vs active-passive architecture?

115.What is load balancing?

116.What are load balancing algorithms?

117.What is health check in load balancing?

118.What is circuit breaker pattern?

119.What is retry logic and exponential backoff?

120.What is rate limiting?

121.What is throttling vs rate limiting?

122.What is caching strategy?

123.What is cache invalidation?

124.What is CDN and when to use it?

125.What is database replication?

126.Master-slave vs multi-master replication?

127.What is database sharding?

128.What is horizontal vs vertical scaling?

129.What is stateless vs stateful application?

130.How do you design for failure?

131.What is graceful degradation?

132.What is bulkhead pattern?

133.What is timeout strategy?

134.What is idempotency and why is it important?

135.What is eventual consistency?

136.What is CAP theorem?

137.How do you handle single point of failure (SPOF)?

138.What is disaster recovery (DR)?

139.What is RTO (Recovery Time Objective)?

140.What is RPO (Recovery Point Objective)?

141.What is capacity planning?

142.How do you forecast capacity needs?

143.What is the difference between load and capacity?

144.What is headroom in capacity planning?

145.What is utilization vs saturation?

146.How do you measure system capacity?

147.What is performance testing?

148.What is load testing vs stress testing?

149.What is spike testing?

150.What is soak testing (endurance testing)?

151.Tools for load testing (JMeter, Gatling, Locust)?

152.What is latency budget?

153.What is throughput?

154.What is the difference between latency and throughput?

155.What is queueing theory in SRE?

156.What is Little's Law?

157.How do you optimize database performance?

158.How do you optimize API performance?

159.What is connection pooling?

160.What is database query optimization?

161.What is indexing strategy?

162.What is N+1 query problem?

163.How do you identify performance bottlenecks?

164.What is profiling?

165.What is the USE method (Utilization, Saturation, Errors)?

166.What is Kubernetes in SRE context?

167.What Kubernetes metrics do you monitor?

168.What is pod crash loop backoff?

169.What is OOMKilled error?

170.How do you troubleshoot pending pods?

171.What is resource requests vs limits?

172.What is HPA (Horizontal Pod Autoscaler)?

173.What is VPA (Vertical Pod Autoscaler)?

174.What is cluster autoscaler?

175.What is liveness probe vs readiness probe?

176.How do you set probe thresholds?

177.What is PodDisruptionBudget (PDB)?

178.What is node affinity and pod affinity?

179.What is taints and tolerations?

180.How do you perform rolling updates safely?

181.What is deployment strategy (RollingUpdate, Recreate)?

182.How do you rollback a deployment?

183.What is StatefulSet and when to use it?

184.What is DaemonSet use case?

185.What is resource quota and limit range?

186.How do you monitor Kubernetes cluster health?

187.What is kube-state-metrics?

188.What is node exporter?

189.What is kubectl top command?

190.How do you debug a container?

191.What is ephemeral containers?

192.What is Kubernetes Events?

193.How do you handle persistent storage in K8s?

194.What is StorageClass?

195.What is CNI (Container Network Interface)?

196.How do you check system load average?

197.What does load average 1, 5, 15 mean?

198.How do you identify high CPU usage process?

199.How do you identify high memory usage?

200.What is the difference between memory and swap?

201.What is OOM killer?

202.How do you troubleshoot disk space issues?

203.What is inode and how can you run out of inodes?

204.How do you find which process is using a file?

205.What is lsof command?

206.How do you check network connections?

207.What is netstat/ss command?

208.How do you troubleshoot DNS issues?

209.Trace network packets (tcpdump, wireshark)?

210.What is strace and when do you use it?

211.What is system calls?

212.How do you check process threads?

213.What is context switching?

214.What is soft vs hard limits (ulimit)?

215.How do you tune kernel parameters (sysctl)?

216.What is a file descriptor and how to increase limits?

217.What is TCP time_wait state?

218.What is TCP connection states?

219.How do you troubleshoot performance using perf?

220.What is eBPF and its use in SRE?

221.How do you automate toil?

222.What is configuration management (Ansible, Puppet, Chef)?

223.What is infrastructure as code (Terraform, CloudFormation)?

224.What is the difference between imperative and declarative IaC?

225.What is GitOps?

226.What is reconciliation loop?

227.How do you manage secrets in automation?

228.What is idempotency in automation?

229.How do you test infrastructure code?

230.What is policy as code?

231.What is continuous deployment vs continuous delivery?

232.How do you implement safe deployment practices?

233.What is canary deployment?

234.What is blue-green deployment?

235.What is feature flag?

236.How do you automate rollback?

237.What is progressive delivery?

238.What is deployment pipeline?

239.How do you automate incident response?

240.What is ChatOps?

241.How do you monitor database health?

242.What is connection pool exhaustion?

243.How do you troubleshoot slow queries?

244.What is query execution plan?

245.How do you handle database failover?

246.What is replication lag?

247.What is split-brain problem in databases?

248.How do you perform database backup and restore?

249.What is point-in-time recovery (PITR)?

250.How do you test database backups?

251.What cloud platforms have you worked with?

252.How do you ensure reliability in cloud?

253.What is multi-AZ deployment?

254.What is multi-region deployment?

255.How do you handle cloud provider outages?

256.What is AWS Auto Scaling?

257.What is ELB health checks?

258.How do you monitor cloud resources?

259.What is CloudWatch vs Prometheus for cloud monitoring?

260.What is distributed system?

261.What are challenges in distributed systems?

262.What is network partition?

263.What is split-brain scenario?

264.How do you handle eventual consistency?

265.What is distributed consensus (Raft, Paxos)?

266.What is service mesh (Istio, Linkerd)?

267.What is sidecar pattern?

268.What is API gateway?

269.What is backpressure in distributed systems?

270.How do you handle cascading failures?

271.What is bulkhead pattern in microservices?

272.What is timeout propagation?

273.What is distributed tracing importance?

274.How do you debug distributed systems?

275.What is observability in microservices?

276.What is security in SRE role?

277.How do you implement least privilege principle?

278.What is secrets management?

279.What is certificate rotation?

280.How do you handle security incidents?

281.What is DDoS mitigation?

282.What is rate limiting for security?

283.How do you monitor for security threats?

284.What is audit logging?

285.What is compliance monitoring?

286.How do you handle vulnerability patching?

287.What is patch management strategy?

288.What is security scanning in CI/CD?

289.How do you ensure data encryption?

290.What is principle of defense in depth?

291.Website is slow - how do you troubleshoot?

292.Database is down - what are your steps?

293.CPU is at 100% - how do you investigate?

294.Memory is exhausted - what do you do?

295.Disk is full - how do you handle it?

296.Pods are crash looping - how do you debug?

297.Load balancer shows unhealthy targets - what do you check?

298.Latency suddenly increased - how do you investigate?

299.Error rate spiked - what are your steps?

300.Traffic dropped to zero - how do you troubleshoot?

301.Deployment failed - how do you rollback?

302.Database replication lag is high - what do you do?

303.You're paged for high memory usage - what do you do?

304.SSL certificate expired - how do you handle?

305.DNS resolution failing - how do you troubleshoot?

306.Application throwing 500 errors - how do you debug?

307.Kafka consumer lag increasing - what do you check?

308.Redis cache hit rate dropped - what do you investigate?

309.Network latency between services increased - how do you debug?

310.Cloud cost suddenly increased - how do you investigate?

311.How would you design monitoring for a new service?

312.How would you implement zero-downtime deployment?

313.How would you handle a complete datacenter failure?

314.How would you migrate a service with zero downtime?

315.How would you handle Black Friday traffic spike?

316.How would you implement disaster recovery?

317.Reduce deployment time from 30 mins to 5 mins?

318.How would you improve MTTR for your team?

319.Handle runaway process consuming resources?

320.Debug intermittent timeout issues?

321.Service is healthy but receiving no traffic - what do you check?

322.Memory leak - identify cause?

323.Database connections maxed out - what are your steps?

324.How do you handle third-party API outage?

325.How would you design SLOs for a payment service?

326.How would you reduce toil in your current role?

327.Handle competing incidents simultaneously?

328.Onboard a new service to your monitoring stack?

329.How would you implement chaos engineering?

330.Improve observability of legacy system?

331.Tell me about your most challenging production incident.

332.How do you prioritize during an outage?

333.Describe a time you prevented a major incident.

334.How do you handle stress during critical incidents?

335.How do you balance reliability and feature velocity?

400.Post-deploy performance regression - identify cause?