Add office photos
Engaged Employer
IBM
based on 20.8k Reviews
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Company Overview
Associated Companies
Company Locations
Filter interviews by
20+ Interview Questions and Answers
Updated 25 Oct 2024
Popular Designations
Asked in
Data Engineer InterviewQ1. 1) How to handle data skewness in spark.
Ans.
Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques. Partitioning the data based on a key column can distribute the data evenly across the nodes. Bucketing can group the data into buckets based on a key column, which can improve join performance. Salting involves adding a random prefix to the key column, which can distribute the data evenly. Using broadcast joins for small tables can also help in reducing skewness. Using dynamic allo...
read more
View 4 more answers
Asked in
Data Engineer InterviewQ2. 5) How to create a kafka topic with replication factor 2
Ans.
To create a Kafka topic with replication factor 2, use the command line tool or Kafka API. Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2. Alternatively, use the Kafka API to create a topic with a replication factor of 2. Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor. Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas...
read more
Add your answer
Asked in
Data Engineer InterviewQ3. 4) How to read json data using spark
Ans.
To read JSON data using Spark, use the SparkSession.read.json() method. Create a SparkSession object Use the read.json() method to read the JSON data Specify the path to the JSON file or directory containing JSON files The resulting DataFrame can be manipulated using Spark's DataFrame API
View 2 more answers
Asked in
Data Engineer InterviewQ4. 2) Difference between partitioning and Bucketing
Ans.
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function. Partitioning is used for organizing data for efficient querying and processing. Bucketing is used for evenly distributing data across nodes in a cluster. Partitioning is done based on a column value, such as date or region. Bucketing is done based on a hash function, such as MD5 or SHA-1. Partitioning can improve query perfo...
read more
View 1 answer
Discover null interview dos and don'ts from real experiences
Asked in
Data Engineer InterviewQ5. 3) Difference between cache and persistent storage
Ans.
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss. Cache is faster but smaller than persistent storage Cache is volatile and data is lost when power is lost Persistent storage is non-volatile and data is retained even after power loss Examples of cache include CPU cache, browser cache, and CDN cache Examples of persistent storage include hard disk drives, solid-stat...
read more
View 2 more answers
Asked in
Data Engineer InterviewQ6. what is difference between union vs union all
Ans.
Union combines and removes duplicates, Union All combines all rows including duplicates. Union merges two tables and removes duplicates Union All merges two tables and includes duplicates Union is slower than Union All as it removes duplicates Syntax: SELECT column1, column2 FROM table1 UNION/UNION ALL SELECT column1, column2 FROM table2 Example: SELECT name FROM table1 UNION SELECT name FROM table2
View 1 answer
Are these interview questions helpful?
Asked in
Data Engineer InterviewQ7. What are the components we use in graphs to remove duplicates
Ans.
Components used in graphs to remove duplicates include HashSet and HashMap. Use HashSet to store unique elements Use HashMap to store key-value pairs with unique keys Iterate through the graph and add elements to HashSet or HashMap to remove duplicates
Add your answer
Asked in
Data Engineer InterviewQ8. What do you know about Forms and Templates and its use in workflow and webreports
Ans.
Forms and Templates are used in workflow and web reports to standardize data input and presentation. Forms are used to collect data in a structured manner, often with predefined fields and formats Templates are pre-designed layouts for presenting data in a consistent way Forms and Templates help streamline processes, ensure data consistency, and improve reporting accuracy In workflow management, Forms can be used to gather input from users at different stages of a process We...
read more
Add your answer
Share interview questions and help millions of jobseekers 🌟
Asked in
Data Engineer InterviewQ9. 1) Project Architecture 2) Complex job handles in project 3) Types of lookup 4) SCD -2 implementation in datastage 5) sql - analytical functions,scenario based question 6) Unix - SED/GREP command
Ans.
The interview questions cover project architecture, complex job handling, lookup types, SCD-2 implementation, SQL analytical functions, and Unix commands. Project architecture involves designing the overall structure of a data project. Complex job handling refers to managing intricate data processing tasks within a project. Lookup types include exact match, range match, and fuzzy match. SCD-2 implementation in DataStage involves capturing historical changes in data. SQL anal...
read more
Add your answer
Asked in
Data Engineer InterviewQ10. Why there are 2 keys available in azure resources
Ans.
Two keys are available in Azure resources for security purposes. One key is used for authentication and the other for authorization. Authentication key is used to verify the identity of the user or application accessing the resource. Authorization key is used to grant or deny access to specific resources or actions. Having two keys adds an extra layer of security to Azure resources. Examples of Azure resources that use two keys are Azure Storage and Azure Event Hubs.
Add your answer
Asked in
Data Engineer InterviewQ11. How would you implement SCD TYPE 2 IN INFORMATICA?
Ans.
Implementing SCD Type 2 in Informatica involves using Slowly Changing Dimension transformations and mapping variables. Use Slowly Changing Dimension (SCD) transformations in Informatica to track historical changes in data. Create mapping variables to keep track of effective start and end dates for each record. Use Update Strategy transformations to handle inserts, updates, and deletes in the target table. Implement Type 2 SCD by inserting new records with updated data and ma...
read more
Add your answer
Asked in
Data Engineer InterviewQ12. What do you know about CS Workflows
Ans.
CS Workflows refer to the processes and steps involved in managing and analyzing data in a computer science context. CS Workflows involve defining data sources and transformations They often include data cleaning, processing, and analysis steps Tools like Apache Airflow and Luigi are commonly used for managing workflows CS Workflows help automate data pipelines and ensure data quality and consistency
Add your answer
Asked in
Data Engineer InterviewQ13. Difference between Colease and repartition in pyspark
Ans.
coalesce and repartition are both used to control the number of partitions in a PySpark DataFrame. coalesce reduces the number of partitions by combining them, while repartition shuffles the data to create new partitions coalesce is a narrow transformation and does not trigger a full shuffle, while repartition is a wide transformation and triggers a shuffle coalesce is useful when reducing the number of partitions, while repartition is useful when increasing the number of pa...
read more
Add your answer
Asked in
Data Engineer InterviewQ14. Tell me about overall IT experiance
Ans.
I have over 5 years of experience in IT, with a focus on data engineering and database management. Worked on designing and implementing data pipelines to extract, transform, and load data from various sources Managed and optimized databases for performance and scalability Collaborated with cross-functional teams to develop data-driven solutions Experience with tools like SQL, Python, Hadoop, and Spark Participated in data modeling and data architecture design
Add your answer
Asked in
Data Engineer InterviewQ15. How many graphs build yet?
Ans.
I have built 10 graphs so far, including network graphs, bar graphs, and pie charts. I have built 10 graphs in total I have experience building network graphs, bar graphs, and pie charts I have used tools like matplotlib and seaborn for graph building
Add your answer
Asked in
Data Engineer InterviewQ16. Delete vs truncate vs drop
Ans.
Difference between delete, truncate and drop in SQL DELETE removes specific rows from a table TRUNCATE removes all rows from a table DROP removes the entire table from the database
Add your answer
Asked in
Data Engineer InterviewQ17. Difference between row_number and dense_rank
Ans.
row_number assigns unique sequential integers to rows, while dense_rank assigns ranks to rows with no gaps between ranks. row_number function assigns a unique sequential integer to each row in the result set dense_rank function assigns ranks to rows with no gaps between ranks row_number does not handle ties, while dense_rank does Example: row_number - 1, 2, 3, 4; dense_rank - 1, 2, 2, 3
Add your answer
Asked in
Data Engineer InterviewQ18. how you deal with escalaltions
Ans.
I address escalations by identifying the root cause, communicating effectively, collaborating with stakeholders, and finding a resolution. Identify the root cause of the escalation to understand the issue thoroughly Communicate effectively with all parties involved to ensure clarity and transparency Collaborate with stakeholders to gather necessary information and work towards a resolution Find a resolution that addresses the escalation and prevents similar issues in the fut...
read more
Add your answer
Asked in
Data Engineer InterviewQ19. What is broadcast variable
Ans.
Broadcast variable is a read-only variable that is cached on each machine in a cluster instead of being shipped with tasks. Broadcast variables are used to efficiently distribute large read-only datasets to worker nodes in Spark applications. They are cached in memory on each machine and can be reused across multiple stages of a job. Broadcast variables help in reducing the amount of data that needs to be transferred over the network during task execution.
Add your answer
Asked in
Data Engineer InterviewQ20. Advantages and disadvantages of Hive?
Ans.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Advantages: SQL-like query language for querying large datasets, optimized for OLAP workloads, supports partitioning and bucketing for efficient queries. Disadvantages: Slower performance compared to traditional databases for OLTP workloads, limited support for complex queries and transactions. Example: Hive can be used to analyze large volumes of log data to...
read more
Add your answer
Asked in
Data Engineer InterviewQ21. Optimisation done in the code
Ans.
Optimisation in code involves improving efficiency and performance. Use of efficient data structures and algorithms Minimizing unnecessary computations Reducing memory usage Parallel processing for faster execution Profiling and identifying bottlenecks
Add your answer
Asked in
Data Engineer InterviewQ22. Spark optimizations technique
Ans.
Spark optimizations techniques improve performance and efficiency of Spark jobs. Partitioning data to optimize parallelism Caching data in memory to avoid recomputation Using broadcast variables for small lookup tables Avoiding shuffles by using narrow transformations Tuning memory and executor settings for optimal performance
Add your answer
Asked in
Data Engineer InterviewQ23. Oops concepts in java
Ans.
Oops concepts in Java refer to Object-Oriented Programming principles like Inheritance, Encapsulation, Polymorphism, and Abstraction. Inheritance: Allows a class to inherit properties and behavior from another class. Encapsulation: Bundling data and methods that operate on the data into a single unit. Polymorphism: Ability of a method to do different things based on the object it is acting upon. Abstraction: Hiding the implementation details and showing only the necessary fe...
read more
Add your answer
Asked in
Data Engineer InterviewQ24. Difference between the two
Ans.
The difference between the two is the key factor that sets them apart. Data Engineer focuses on designing and maintaining data pipelines and infrastructure for data storage and processing. Data Scientist focuses on analyzing and interpreting complex data to provide insights and make data-driven decisions. Data Engineer typically works on building and optimizing data pipelines using tools like Apache Spark or Hadoop. Data Scientist uses statistical and machine learning techni...
read more
Add your answer
Asked in
Data Engineer InterviewQ25. Day to day tasks
Ans.
Day to day tasks involve data collection, processing, analysis, and maintenance to ensure data quality and availability. Collecting and storing data from various sources Cleaning and preprocessing data for analysis Developing and maintaining data pipelines Analyzing data to extract insights and trends Collaborating with data scientists and analysts to support their work
Add your answer
More about working at IBM
Top Rated Mega Company - 2024
Top Rated Company for Women - 2024
Top Rated IT/ITES Company - 2024
HQ - Armonk,New York, United States
Contribute & help others!
Write a review
Share interview
Contribute salary
Add office photos
Top HR Questions asked in null
Interview Process at null
based on 18 interviews in the last 1 year
2 Interview rounds
Technical Round 1
Technical Round 2
View more
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories
Top Data Engineer Interview Questions from Similar Companies
39 Interview Questions
22 Interview Questions
15 Interview Questions
12 Interview Questions
11 Interview Questions
10 Interview Questions
>
Top IBM Data Engineer Interview Questions And Answers
Share Interview Advice
Stay ahead in your career. Get AmbitionBox app
Helping over 1 Crore job seekers every month in choosing their right fit company
70 Lakh+
Reviews
5 Lakh+
Interviews
4 Crore+
Salaries
1 Cr+
Users/Month
Contribute to help millions
Get AmbitionBox app