Data Scientist Interview Questions

Comprehensive guide covering Statistics, Machine Learning, Deep Learning, NLP, Computer Vision, Model Deployment, and Business Scenarios.

Total Questions:450
Difficulty Levels:
BeginnerIntermediateAdvanced
0%

Overall Progress

0/450

1.What is the difference between population and sample?

2.Explain the Central Limit Theorem and its significance.

3.What is the difference between Type I and Type II errors?

4.What is p-value and how do you interpret it?

5.Explain confidence intervals.

6.What is the difference between confidence interval and prediction interval?

7.What is statistical significance and practical significance?

8.Explain hypothesis testing process.

9.What is the difference between one-tailed and two-tailed tests?

10.What is power of a statistical test?

11.Explain Bayes' Theorem with an example.

12.What is the difference between frequentist and Bayesian statistics?

13.What is maximum likelihood estimation (MLE)?

14.What is the difference between parametric and non-parametric tests?

15.Explain the t-test, chi-square test, and ANOVA.

16.What is correlation vs causation?

17.What is the difference between Pearson and Spearman correlation?

18.What is covariance and how does it differ from correlation?

19.Explain probability distributions (normal, binomial, Poisson, exponential).

20.What is the law of large numbers?

21.What is sampling and different sampling methods?

22.What is stratified sampling vs random sampling?

23.What is the difference between mean, median, and mode?

24.Explain standard deviation and variance.

25.What is z-score and standardization?

26.What is skewness and kurtosis?

27.What is the 68-95-99.7 rule?

28.What is conditional probability?

29.What is the birthday paradox?

30.What is Monte Carlo simulation?

31.What is bootstrapping?

32.What is permutation test?

33.What is A/B testing and how do you design one?

34.How do you determine sample size for experiments?

35.What is multiple hypothesis testing problem?

36.What is Bonferroni correction?

37.What is false discovery rate (FDR)?

38.What is survival analysis?

39.What is time series analysis basics?

40.What is stationary vs non-stationary time series?

41.What is the difference between supervised and unsupervised learning?

42.Explain semi-supervised and reinforcement learning.

43.What is the bias-variance tradeoff?

44.What is overfitting and underfitting?

45.How do you detect and prevent overfitting?

46.What is regularization and why is it important?

47.Explain L1 (Lasso) and L2 (Ridge) regularization.

48.What is Elastic Net?

49.What is cross-validation and different types?

50.What is k-fold cross-validation?

51.What is stratified cross-validation?

52.What is train-test-validation split?

53.What is the curse of dimensionality?

54.What is feature engineering?

55.What is feature selection vs feature extraction?

56.What is dimensionality reduction?

57.What is the difference between PCA and t-SNE?

58.What is PCA and how does it work?

59.What are eigenvectors and eigenvalues in PCA?

60.What is LDA (Linear Discriminant Analysis)?

61.What is the difference between PCA and LDA?

62.What is batch learning vs online learning?

63.What is instance-based vs model-based learning?

64.What is ensemble learning?

65.What is bagging vs boosting?

66.What is stacking in ensemble methods?

67.What is the difference between classification and regression?

68.What is multi-class vs multi-label classification?

69.What is imbalanced dataset and how to handle it?

70.What is SMOTE?

71.What is cost-sensitive learning?

72.What is anomaly detection?

73.What is one-class classification?

74.What is active learning?

75.What is transfer learning?

76.What is data augmentation?

77.What is the no free lunch theorem?

78.What is inductive bias?

79.What is Occam's razor in ML?

80.What is model interpretability vs explainability?

81.Explain Linear Regression in detail.

82.What are the assumptions of Linear Regression?

83.How do you interpret regression coefficients?

84.What is multicollinearity and how to detect it?

85.What is VIF (Variance Inflation Factor)?

86.What is heteroscedasticity?

87.What is R-squared and adjusted R-squared?

88.What is the difference between R-squared and RMSE?

89.What is Logistic Regression?

90.What is the sigmoid function?

91.What is the difference between Linear and Logistic Regression?

92.How do you interpret logistic regression coefficients?

93.What is odds ratio?

94.What is softmax function for multi-class classification?

95.What is Decision Tree and how does it work?

96.What is entropy and information gain?

97.What is Gini impurity?

98.What is the difference between Gini and Entropy?

99.What is pruning in decision trees?

100.What are the advantages and disadvantages of Decision Trees?

101.What is Random Forest?

102.How does Random Forest reduce overfitting?

103.What is out-of-bag (OOB) error?

104.What is Gradient Boosting?

105.What is XGBoost and its advantages?

106.What is LightGBM?

107.What is CatBoost?

108.What is the difference between Random Forest and Gradient Boosting?

109.What is AdaBoost?

110.What is Support Vector Machine (SVM)?

111.What is the kernel trick in SVM?

112.What are different kernel functions (linear, RBF, polynomial)?

113.What is the margin in SVM?

114.What is C parameter in SVM?

115.What is K-Nearest Neighbors (KNN)?

116.How do you choose K in KNN?

117.What is distance metric in KNN (Euclidean, Manhattan, Minkowski)?

118.What are advantages and disadvantages of KNN?

119.What is Naive Bayes classifier?

120.What is the 'naive' assumption in Naive Bayes?

121.What is clustering?

122.What is K-Means clustering?

123.How do you choose the number of clusters (K)?

124.What is the elbow method?

125.What is silhouette score?

126.What is Hierarchical clustering?

127.What is dendrogram?

128.What is DBSCAN?

129.What is the difference between K-Means and DBSCAN?

130.What is Gaussian Mixture Models (GMM)?

131.What is the difference between K-Means and GMM?

132.What is association rule learning?

133.What is Apriori algorithm?

134.What is support, confidence, and lift in association rules?

135.What is collaborative filtering?

136.What is content-based filtering?

137.What is matrix factorization?

138.What is recommendation system approaches?

139.What is topic modeling?

140.What is Latent Dirichlet Allocation (LDA)?

141.What is a neural network?

142.What is a perceptron?

143.What is an activation function?

144.Explain different activation functions (ReLU, Sigmoid, Tanh, Leaky ReLU).

145.Why is ReLU preferred over Sigmoid?

146.What is the vanishing gradient problem in deep networks?

147.What is backpropagation?

148.What is gradient descent?

149.What is stochastic gradient descent (SGD)?

150.What is mini-batch gradient descent?

151.What is momentum in gradient descent?

152.What is Adam optimizer?

153.What is learning rate and learning rate scheduling?

154.What is the difference between epoch, batch, and iteration?

155.What is weight initialization and why is it important?

156.What is Xavier/He initialization?

157.What is a Convolutional Neural Network (CNN)?

158.What is convolution operation?

159.What is pooling (max pooling, average pooling)?

160.What is stride and padding?

161.What is a filter/kernel in CNN?

162.What are popular CNN architectures (VGG, ResNet, Inception)?

163.What is transfer learning in CNNs?

164.What is Recurrent Neural Network (RNN)?

165.What is the vanishing gradient problem in RNN?

166.What is LSTM (Long Short-Term Memory)?

167.What is GRU (Gated Recurrent Unit)?

168.What is the difference between LSTM and GRU?

169.What is bidirectional RNN?

170.What is sequence-to-sequence model?

171.What is attention mechanism?

172.What is Transformer architecture?

173.What is self-attention?

174.What is BERT?

175.What is GPT?

176.What is the difference between BERT and GPT?

177.What is fine-tuning in NLP?

178.What is word embedding (Word2Vec, GloVe, FastText)?

179.What is autoencoder?

180.What is Generative Adversarial Network (GAN)?

181.What is accuracy and when is it not a good metric?

182.What is precision and recall?

183.What is F1-score?

184.What is the difference between micro and macro averaging?

185.What is confusion matrix?

186.What is ROC curve?

187.What is AUC-ROC?

188.What is PR curve (Precision-Recall)?

189.When to use ROC vs PR curve?

190.What is mean absolute error (MAE)?

191.What is mean squared error (MSE)?

192.What is root mean squared error (RMSE)?

193.What is mean absolute percentage error (MAPE)?

194.What is log loss?

195.What is Cohen's Kappa?

196.What is Matthews Correlation Coefficient (MCC)?

197.What is specificity and sensitivity?

198.What is true positive rate and false positive rate?

199.What is balanced accuracy?

200.What is top-k accuracy?

201.What is perplexity in NLP?

202.What is BLEU score?

203.What is mean reciprocal rank (MRR)?

204.What is NDCG (Normalized Discounted Cumulative Gain)?

205.How do you choose the right evaluation metric?

206.What is tokenization?

207.What is stemming vs lemmatization?

208.What is bag of words (BoW)?

209.What is TF-IDF?

210.What is n-gram?

211.What is part-of-speech (POS) tagging?

212.What is named entity recognition (NER)?

213.What is sentiment analysis?

214.What is topic modeling?

215.What is word embedding?

216.What is Word2Vec (Skip-gram and CBOW)?

217.What is the difference between Word2Vec and GloVe?

218.What is contextual embedding?

219.What is BERT and how does it work?

220.What is text classification?

221.What is sequence labeling?

222.What is language modeling?

223.What is machine translation?

224.What is text summarization (extractive vs abstractive)?

225.What is question answering system?

226.What is information extraction?

227.What is coreference resolution?

228.What is dependency parsing?

229.What is attention mechanism in NLP?

230.What are challenges in NLP?

231.What is image classification?

232.What is object detection?

233.What is the difference between classification and detection?

234.What is semantic segmentation?

235.What is instance segmentation?

236.What is image augmentation?

237.What is YOLO (You Only Look Once)?

238.What is R-CNN and Fast R-CNN?

239.What is U-Net architecture?

240.What is face recognition vs face detection?

241.What is optical character recognition (OCR)?

242.What is image captioning?

243.What is style transfer?

244.What is ResNet and residual connections?

245.What are common preprocessing techniques for images?

246.What is time series data?

247.What is stationarity and why is it important?

248.How do you test for stationarity (ADF test)?

249.What is differencing in time series?

250.What is autocorrelation (ACF)?

251.What is partial autocorrelation (PACF)?

252.What is ARIMA model?

253.What is AR, MA, and ARMA?

254.What is seasonal ARIMA (SARIMA)?

255.What is exponential smoothing?

256.What is Holt-Winters method?

257.What is trend and seasonality?

258.How do you decompose time series?

259.What is Prophet by Facebook?

260.What is LSTM for time series forecasting?

261.What is walk-forward validation?

262.What is rolling window approach?

263.What is lag features in time series?

264.What is change point detection?

265.What is anomaly detection in time series?

266.What is feature engineering and why is it important?

267.What is feature scaling (normalization vs standardization)?

268.When to use normalization vs standardization?

269.What is one-hot encoding?

270.What is label encoding?

271.What is target encoding?

272.What is binning/discretization?

273.What is polynomial features?

274.What is interaction features?

275.What is feature hashing?

276.How do you handle missing values?

277.How do you handle categorical variables?

278.How do you handle date-time features?

279.What is feature selection methods?

280.What is recursive feature elimination (RFE)?

281.What Python libraries do you use for data science?

282.What is NumPy and its advantages?

283.What is pandas DataFrame?

284.What is the difference between loc and iloc?

285.How do you handle missing data in pandas?

286.What is scikit-learn?

287.What is TensorFlow vs PyTorch?

288.What is Keras?

289.What is the difference between fit, transform, and fit_transform?

290.What is pipeline in scikit-learn?

291.How do you save and load models?

292.What is pickle in Python?

293.What is lambda function?

294.What is list comprehension?

295.What is generator in Python?

296.How do you optimize Python code?

297.What is vectorization in NumPy?

298.What is broadcasting in NumPy?

299.How do you parallelize code in Python?

300.What is virtual environment?

301.How do you use SQL in data science projects?

302.What is the difference between JOIN types?

303.What is window function in SQL?

304.How do you calculate running totals?

305.What is GROUP BY and HAVING?

306.How do you find duplicates in SQL?

307.What is subquery vs CTE?

308.How do you optimize SQL queries?

309.What is the difference between WHERE and HAVING?

310.How do you handle NULL values in SQL?

311.What is UNION vs UNION ALL?

312.How do you pivot data in SQL?

313.What is the difference between RANK and DENSE_RANK?

314.How do you calculate percentiles in SQL?

315.What is sampling data in SQL?

316.What is A/B testing?

317.How do you design an A/B test?

318.What is statistical power?

319.How do you calculate sample size for A/B test?

320.What is p-hacking?

321.What is the multiple testing problem?

322.How do you handle novelty effect?

323.What is selection bias?

324.What is Hawthorne effect?

325.What is multivariate testing?

326.What is sequential testing?

327.What is Bayesian A/B testing?

328.How long should you run an A/B test?

329.What is the difference between statistical and practical significance?

330.How do you analyze A/B test results?

331.How do you deploy a machine learning model?

332.What is model serving?

333.What is the difference between batch and real-time prediction?

334.What is REST API for ML models?

335.What is Flask/FastAPI for deployment?

336.What is Docker and containerization?

337.What is model versioning?

338.What is model monitoring in production?

339.What is data drift?

340.What is concept drift?

341.How do you detect model degradation?

342.What is A/B testing for models?

343.What is shadow deployment?

344.What is canary deployment?

345.What is model retraining strategy?

346.What is feature store?

347.What is MLOps?

348.What is CI/CD for ML?

349.How do you ensure model reproducibility?

350.What is model explainability in production?

351.What is Apache Spark for data science?

352.What is PySpark?

353.How do you train models on big data?

354.What is distributed machine learning?

355.What is Dask for parallel computing?

356.What is GPU computing for deep learning?

357.What is the difference between CPU and GPU training?

358.What is distributed training in deep learning?

359.What is data parallelism vs model parallelism?

360.How do you handle large datasets that don't fit in memory?

361.How do you translate business problems into data science problems?

362.How do you communicate technical results to non-technical stakeholders?

363.What KPIs have you worked with?

364.How do you measure ROI of a data science project?

365.How do you prioritize data science projects?

366.What is the data science project lifecycle?

367.How do you handle stakeholder expectations?

368.How do you present model results?

369.What is storytelling with data?

370.How do you justify model decisions?

371.How do you handle ethical considerations in ML?

372.What is bias in machine learning?

373.What is fairness in ML?

374.How do you ensure model fairness?

375.What are privacy concerns in data science?

376.How would you build a recommendation system for e-commerce?

377.Design a fraud detection system.

378.How would you predict customer churn?

379.Design a credit scoring model.

380.How would you build a spam detection system?

381.Design a sentiment analysis system for social media.

382.How would you forecast sales for a retail company?

383.Design an image classification system.

384.How would you build a chatbot?

385.Design a predictive maintenance system.

386.How would you detect anomalies in network traffic?

387.Design a personalized marketing campaign model.

388.How would you build a price optimization model?

389.Design a demand forecasting system.

390.How would you build a customer segmentation model?

391.Your model has 95% accuracy but performs poorly in production. Why?

392.How would you handle imbalanced data in fraud detection?

393.Your model is overfitting. What would you do?

394.How would you improve model performance?

395.How do you handle missing data in a dataset with 40% nulls?

396.How would you detect fake news?

397.Design a movie recommendation system.

398.How would you predict employee attrition?

399.Design a medical diagnosis system.

400.How would you build a speech recognition system?

401.Design a real-time bidding system.

402.How would you optimize delivery routes?

403.Design a face recognition system.

404.How would you predict stock prices?

405.Design a customer lifetime value model.

406.How would you build a question-answering system?

407.Design a document classification system.

408.How would you detect plagiarism?

409.Design an object detection system for autonomous vehicles.

410.How would you build a music recommendation system?

411.Tell me about your most challenging data science project.

412.How do you approach a new data science problem?

413.Describe a time when your model failed in production.

414.How do you stay updated with latest ML trends?

415.Tell me about a time you had to explain a complex model to stakeholders.

416.How do you handle disagreements with team members?

417.Describe your experience with cross-functional collaboration.

418.How do you manage multiple projects?

419.Tell me about a time you had to make a trade-off between accuracy and interpretability.

420.How do you handle ambiguous requirements?

421.Describe a time you found an unexpected insight.

422.How do you approach debugging ML models?

423.Tell me about a time you had to learn a new technique quickly.

424.How do you ensure reproducibility in your work?

425.Describe your code review process.

426.How do you handle tight deadlines?

427.Tell me about a time you optimized a slow model.

428.How do you handle negative feedback?

429.Describe a time you failed and what you learned.

430.How do you mentor junior data scientists?

431.What's your approach to experimentation?

432.How do you balance exploration vs exploitation?

433.Tell me about a time you had to make a decision with incomplete data.

434.How do you prioritize feature requests?

435.Describe your experience with agile methodology.

436.How do you handle model bias issues?

437.Tell me about a time you automated a manual process.

438.How do you ensure data quality?

439.Describe your approach to documentation.

440.Why do you want to be a Data Scientist?