Comprehensive guide covering SQL & Database Fundamentals, Python for DE, ETL/ELT Pipelines, Data Warehousing, Big Data Technologies (Spark, Kafka, Airflow), Cloud Platforms, and System Design.
Total Questions:300
Difficulty Levels:
BeginnerIntermediateAdvanced
0%
Overall Progress
0/300
Status
Problem
Level
2.Explain INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN, and CROSS JOIN.
Easy
3.What are window functions and provide use cases?
Medium
4.How do you optimize a slow-running SQL query?
Hard
5.What is the difference between DELETE, TRUNCATE, and DROP?
Easy
6.Explain clustered vs non-clustered indexes.
Medium
7.What is query execution plan and how do you read it?
Hard
8.What are CTEs and how do they differ from subqueries?
Medium
9.Explain partitioning in databases.
Medium
10.What is database sharding?
Hard
11.What is the difference between OLTP and OLAP?
Easy
12.Explain ACID properties in databases.
Medium
13.What is a materialized view?
Medium
14.How do you handle database deadlocks?
Hard
15.What is database replication and what types exist?
Medium
16.Explain the CAP theorem.
Hard
17.What is the difference between normalization and denormalization?
Easy
18.What are the different normal forms (1NF, 2NF, 3NF, BCNF)?
Hard
19.How do you design a database schema?
Medium
20.What is a star schema vs snowflake schema?
Medium
21.Explain fact tables and dimension tables.
Easy
22.What is slowly changing dimension (SCD) Type 1, 2, and 3?
Medium
23.How do you handle NULL values in SQL?
Easy
24.What is the difference between UNION and UNION ALL?
Easy
25.Explain database indexing strategies.
Medium
26.What are composite keys and surrogate keys?
Medium
27.How do you perform incremental data loads?
Medium
28.What is a transaction and transaction isolation levels?
Hard
29.Explain optimistic vs pessimistic locking.
Hard
30.What is database connection pooling?
Medium
31.What Python libraries do you use for data engineering?
Easy
32.Explain the Global Interpreter Lock (GIL).
Hard
33.What is the difference between multiprocessing and multithreading?
Medium
34.How do you handle large files in Python?
Medium
35.What are generators and why are they useful?
Medium
36.Explain decorators in Python.
Hard
37.What is the difference between list comprehension and generator expression?
Easy
38.How do you implement error handling in Python?
Easy
39.What is context manager and the 'with' statement?
Easy
40.Explain *args and **kwargs.
Medium
41.How do you optimize Python code for performance?
Hard
42.What is the difference between shallow copy and deep copy?
Medium
43.Explain Python's garbage collection.
Hard
44.What are async/await in Python?
Hard
45.How do you work with APIs in Python?
Easy
46.What is the difference between @staticmethod and @classmethod?
Medium
47.How do you implement logging in Python?
Easy
48.What is pickle in Python?
Medium
49.Explain lambda functions and map/filter/reduce.
Easy
50.How do you handle memory management in Python?
Hard
51.What are Python dataclasses?
Medium
52.How do you implement unit testing in Python?
Medium
53.What is virtual environment and why use it?
Easy
54.Explain Python's type hints.
Medium
55.How do you profile Python code for bottlenecks?
Hard
56.What is ETL and ELT? What's the difference?
Easy
57.Explain the ETL process you follow.
Easy
58.What ETL tools have you worked with?
Easy
59.How do you handle incremental loads vs full loads?
Medium
60.What is data lineage?
Medium
61.How do you implement error handling in ETL pipelines?
Medium
62.What is idempotency and why is it important in data pipelines?
Hard
63.How do you handle slowly changing dimensions (SCD)?
Medium
64.What is CDC (Change Data Capture)?
Hard
65.Explain different CDC methods.
Hard
66.How do you ensure data quality in ETL processes?
Medium
67.What is data validation and where do you implement it?
Easy
68.How do you handle duplicate records in ETL?
Easy
69.What is data transformation and give examples?
Easy
70.How do you optimize ETL performance?
Hard
71.What is parallel processing in ETL?
Medium
72.How do you handle schema changes in source systems?
Hard
73.What is data reconciliation?
Medium
74.How do you implement retry logic in data pipelines?
Medium
75.What is backfilling in data pipelines?
Medium
76.How do you monitor ETL jobs?
Easy
77.What is SLA in data pipelines?
Easy
78.How do you handle timezone conversions in ETL?
Medium
79.What is metadata management in ETL?
Medium
80.How do you version control your ETL code?
Easy
81.What is a data warehouse?
Easy
82.Explain the difference between data warehouse, data lake, and data mart.
Easy
83.What is dimensional modeling?
Medium
84.What is Kimball vs Inmon methodology?
Hard
85.Explain fact and dimension tables in detail.
Easy
86.What are factless fact tables?
Hard
87.What is a conformed dimension?
Hard
88.How do you handle late-arriving facts?
Hard
89.What is a bridge table?
Hard
90.Explain grain in dimensional modeling.
Medium
91.What is an aggregate table and when to use it?
Medium
92.How do you implement SCD Type 2?
Medium
93.What is a junk dimension?
Hard
94.What is a role-playing dimension?
Medium
95.How do you optimize data warehouse queries?
Hard
96.What is partition pruning?
Medium
97.Explain columnar storage vs row storage.
Medium
98.What is data vault modeling?
Hard
99.How do you handle historical data in warehouse?
Medium
100.What is the medallion architecture (Bronze, Silver, Gold)?
Easy
101.What is Apache Spark and its architecture?
Medium
102.Explain RDD, DataFrame, and Dataset in Spark.
Medium
103.What are Spark transformations vs actions?
Easy
104.What is lazy evaluation in Spark?
Medium
105.How do you optimize Spark jobs?
Hard
106.What is data skewness and how do you handle it?
Hard
107.Explain Spark partitioning and bucketing.
Hard
108.What is Apache Hadoop and HDFS?
Easy
109.Explain MapReduce paradigm.
Medium
110.What is YARN in Hadoop ecosystem?
Medium
111.What is Apache Hive and HiveQL?
Easy
112.What is the difference between Hive and traditional databases?
Medium
113.What is Apache Kafka and its use cases?
Easy
114.Explain Kafka producers, consumers, and brokers.
Easy
115.What are Kafka topics and partitions?
Medium
116.How does Kafka ensure message delivery?
Hard
117.What is Apache Airflow?
Easy
118.How do you design DAGs in Airflow?
Medium
119.What are Airflow operators, sensors, and hooks?
Medium
120.What is Apache Flink?
Hard
121.What is the difference between batch and stream processing?
Easy
122.What is Apache Beam?
Hard
123.Explain data lakehouse architecture.
Medium
124.What is Delta Lake/Apache Iceberg/Apache Hudi?
Hard
125.How do you implement ACID transactions in data lakes?
Hard
126.What is Parquet file format and its advantages?
Medium
127.What is ORC file format?
Medium
128.What is Avro and when to use it?
Medium
129.How do you choose file formats for different scenarios?
Hard
130.What is data partitioning strategy in big data?
Medium
131.What cloud platforms have you worked with (AWS/Azure/GCP)?
Easy
132.What is AWS S3 and its storage classes?
Easy
133.What is AWS Redshift?
Medium
134.Explain AWS Glue and its components.
Medium
135.What is AWS Lambda and serverless architecture?
Easy
136.What is AWS EMR?
Medium
137.What is AWS Kinesis?
Medium
138.What is Azure Data Factory?
Medium
139.What is Azure Databricks?
Medium
140.What is Azure Synapse Analytics?
Hard
141.What is Azure Data Lake Storage?
Medium
142.What is Google BigQuery?
Medium
143.What is Google Dataflow?
Hard
144.What is Google Cloud Storage?
Easy
145.What is Google Pub/Sub?
Medium
146.How do you implement data security in cloud?
Hard
147.What is IAM and how do you manage permissions?
Medium
148.What is VPC and network security?
Hard
149.How do you optimize cloud costs?
Medium
150.What is cloud data migration strategy?
Hard
151.What are managed vs self-hosted services?
Easy
152.How do you implement disaster recovery in cloud?
Hard
153.What is multi-cloud vs hybrid cloud architecture?
Medium
154.How do you monitor cloud resources?
Easy
155.What is Infrastructure as Code (IaC)?
Medium
156.What is the difference between SQL and NoSQL databases?
Easy
157.When would you choose NoSQL over SQL?
Medium
158.What are different types of NoSQL databases?
Easy
159.What is MongoDB and document-based databases?
Easy
160.What is Cassandra and column-family stores?
Medium
161.What is Redis and key-value stores?
Easy
162.What is Neo4j and graph databases?
Medium
163.Explain eventual consistency.
Hard
164.What is BASE properties in NoSQL?
Hard
165.How do you model data in NoSQL?
Medium
166.What is denormalization in NoSQL?
Medium
167.How do you handle transactions in NoSQL?
Hard
168.What is sharding in NoSQL databases?
Medium
169.How do you choose between different NoSQL databases?
Hard
170.What are the trade-offs of using NoSQL?
Medium
171.How do you ensure data quality?
Easy
172.What are data quality dimensions?
Medium
173.How do you implement data validation?
Easy
174.What is data profiling?
Medium
175.How do you test data pipelines?
Medium
176.What is unit testing vs integration testing in DE?
Medium
177.How do you implement data reconciliation?
Hard
178.What are data quality metrics you track?
Medium
179.How do you handle data quality issues?
Medium
180.What is schema validation?
Easy
181.How do you implement data monitoring?
Medium
182.What is data observability?
Hard
183.How do you set up alerts for data issues?
Easy
184.What is regression testing for data pipelines?
Medium
185.How do you implement data quality frameworks?
Hard
186.What is CI/CD for data pipelines?
Easy
187.What version control systems have you used?
Easy
188.How do you implement Git workflow for data projects?
Medium
189.What is Docker and containerization?
Easy
190.What is Kubernetes and container orchestration?
Hard
191.How do you implement automated testing for pipelines?
Medium
192.What is infrastructure as code (Terraform, CloudFormation)?
Hard
193.How do you manage environment configurations?
Medium
194.What is blue-green deployment?
Hard
195.How do you implement logging and monitoring?
Medium
196.What is Jenkins/GitHub Actions/GitLab CI?
Easy
197.How do you handle secrets management?
Hard
198.What is deployment strategy for data pipelines?
Medium
199.How do you implement rollback mechanisms?
Hard
200.What is observability in data engineering?
Medium
201.How do you design a data model?
Medium
202.What is data modeling best practices?
Easy
203.What is entity-relationship diagram (ERD)?
Easy
204.How do you handle many-to-many relationships?
Medium
205.What is data modeling for analytical workloads?
Medium
206.How do you design for scalability?
Hard
207.What is Lambda architecture?
Hard
208.What is Kappa architecture?
Hard
209.How do you design real-time data pipelines?
Medium
210.What is microservices architecture for data?
Hard
211.How do you implement data governance?
Hard
212.What is master data management?
Hard
213.How do you handle data privacy (GDPR, CCPA)?
Hard
214.What is data catalog and metadata management?
Medium
215.How do you design for high availability?
Hard
216.How do you optimize database queries?
Easy
217.What is query optimization techniques?
Medium
218.How do you handle large-scale data processing?
Medium
219.What is data compression and when to use it?
Medium
220.How do you optimize Spark jobs?
Hard
221.What is partition pruning and predicate pushdown?
Hard
222.How do you optimize data pipeline performance?
Medium
223.What is caching strategy in data engineering?
Medium
224.How do you handle memory optimization?
Hard
225.What is broadcast join in Spark?
Hard
226.How do you optimize storage costs?
Medium
227.What is data tiering strategy?
Hard
228.How do you implement parallel processing?
Medium
229.What is batch size optimization?
Medium
230.How do you profile and debug performance issues?
Hard
231.What is stream processing?
Easy
232.What is the difference between batch and streaming?
Easy
233.How does Kafka streaming work?
Medium
234.What is event-driven architecture?
Medium
235.How do you handle late-arriving data?
Hard
236.What is windowing in stream processing?
Medium
237.What is exactly-once processing?
Hard
238.How do you implement real-time analytics?
Hard
239.What is stateful vs stateless processing?
Hard
240.How do you handle backpressure?
Hard
241.What is stream-table join?
Hard
242.How do you implement event sourcing?
Hard
243.What is CQRS pattern?
Hard
244.How do you handle streaming data quality?
Medium
245.What is real-time data pipeline architecture?
Medium
246.Design a data pipeline for processing millions of events per day.
Hard
247.How would you migrate from on-premise to cloud?
Hard
248.Design a real-time recommendation system data pipeline.
Hard
249.How would you handle a failed ETL job in production?
Medium
250.Design a data warehouse from scratch.
Medium
251.How would you optimize a slow-running Spark job?
Hard
252.Design a solution for handling duplicate data.
Medium
253.How would you implement data lake architecture?
Medium
254.Design a CDC pipeline from MySQL to data warehouse.
Hard
255.How would you handle schema evolution in data pipeline?
Hard
256.Design a solution for real-time fraud detection.
Hard
257.How would you implement data retention policy?
Medium
258.Design a multi-region data replication strategy.
Hard
259.How would you handle data quality issues in production?
Medium
260.Design a solution for processing streaming and batch data together.
Hard
261.How would you implement disaster recovery?
Medium
262.Design a solution for handling PII data.
Hard
263.How would you optimize cloud costs for data infrastructure?
Medium
264.Design a data pipeline for IoT sensor data.
Hard
265.How would you implement data versioning?
Medium
266.Design a solution for A/B testing analytics.
Medium
267.How would you handle timezone issues in global data?
Medium
268.Design a data pipeline monitoring system.
Medium
269.How would you implement incremental processing?
Medium
270.Design a solution for handling late-arriving dimensions.
Hard
271.How would you build a customer 360 view?
Hard
272.Design a clickstream data processing pipeline.
Medium
273.How would you handle data pipeline failures gracefully?
Medium
274.Design a solution for cross-region data compliance.
Hard
275.How would you implement data lineage tracking?
Medium
276.Tell me about the most complex data pipeline you built.
Hard
277.How do you handle production incidents?
Medium
278.Describe a time when you optimized a data pipeline.
Medium
279.How do you stay updated with data engineering trends?
Easy
280.How do you handle conflicting requirements from stakeholders?
Medium
281.Describe your experience with agile/scrum methodology.
Easy
282.How do you prioritize multiple projects?
Easy
283.Tell me about a time you made a critical mistake.
Medium
284.How do you document your data pipelines?
Easy
285.How do you mentor junior engineers?
Medium
286.Describe your code review process.
Easy
287.How do you handle technical debt?
Medium
288.Tell me about a time you disagreed with a technical decision.
Medium
289.How do you ensure data pipeline reliability?
Medium
290.Describe your testing strategy.
Medium
291.How do you handle on-call responsibilities?
Easy
292.What's your approach to learning new technologies?
Easy
293.How do you communicate technical concepts to non-technical stakeholders?
Easy
294.Describe a time when you improved data quality.
Medium
295.How do you handle tight deadlines?
Easy
296.What's your experience with cross-functional collaboration?