i
TCS
Filter interviews by
Spark is a fast and general-purpose cluster computing system for big data processing.
Spark is popular for its speed and ease of use in processing large datasets.
It provides in-memory processing capabilities, making it faster than traditional disk-based processing systems.
Spark supports multiple programming languages like Java, Scala, Python, and R.
It offers a wide range of libraries for diverse tasks such as SQL, ...
Answers to interview questions for Data Engineer position.
1. Partitioning is a way to divide a large dataset into smaller, more manageable parts based on a specific column or expression. Bucketing is a technique to further organize the data within each partition into smaller, equally-sized files based on a hash function.
2. UNION combines the result sets of two or more SELECT statements, removing duplicate rows. UN...
1) Partitioning is dividing data into smaller parts based on a column, while bucketing is dividing data into equal-sized files based on a hash function. 2) Internal tables store data in a default location managed by Hive, while external tables store data in a user-defined location. 3) Hive architecture consists of a metastore, driver, compiler, optimizer, and execution engine. 4) Cache stores data in memory for fa...
Yes, I have experience working with Oracle, Python, PySpark, and SQL in my previous roles as a Data Engineer.
Worked extensively with Oracle databases for data storage and retrieval
Utilized Python for data manipulation, analysis, and automation tasks
Implemented data processing and analytics using PySpark
Proficient in writing and optimizing SQL queries for data extraction and transformation
What people are saying about TCS
View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.
A view is a saved SQL query that can be used as a table
Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks
Dense rank is used to rank rows based on a specific column or set of columns
Example: SELECT * FROM my_view WHERE column_name = 'value'
Example: SELECT column_na...
Queue & Stack Algorithm involves data structures for storing and retrieving data in a specific order.
Queue follows First In First Out (FIFO) principle, like a line at a grocery store.
Stack follows Last In First Out (LIFO) principle, like a stack of plates.
Examples: Queue - BFS algorithm in graph traversal. Stack - Undo feature in text editors.
Rotate an array of strings by a given number of positions.
Create a new array and copy elements from the original array based on the rotation index.
Handle cases where the rotation index is greater than the array length by using modulo operation.
Example: Original array ['a', 'b', 'c', 'd', 'e'], rotate by 2 positions -> ['c', 'd', 'e', 'a', 'b']
Corrupt record handling in Spark involves identifying and handling data that does not conform to expected formats.
Use DataFrameReader option("badRecordsPath", "path/to/bad/records") to save corrupt records to a separate location for further analysis.
Use DataFrame.na.drop() or DataFrame.na.fill() to handle corrupt records by dropping or filling missing values.
Implement custom logic to identify and handle corrupt re...
SCD 1 overwrites old data with new data, while SCD 2 keeps track of historical changes.
SCD 1 updates existing records with new data, losing historical information.
SCD 2 creates new records for each change, preserving historical data.
SCD 1 is simpler and faster, but can lead to data loss.
SCD 2 is more complex and slower, but maintains a full history of changes.
Optimizations in pyspark involve techniques to improve performance and efficiency of data processing.
Use partitioning to distribute data evenly across nodes for parallel processing
Utilize caching to store intermediate results in memory for faster access
Avoid unnecessary shuffling of data by using appropriate join strategies
Optimize the execution plan by analyzing and adjusting the stages of the job
Use broadcast va...
I appeared for an interview in Apr 2025, where I was asked the following questions.
I applied via Walk-in
Rank assigns unique ranks to rows, while dense_rank handles ties by assigning the same rank to tied rows. Left join includes all rows from the left table and matching rows from the right table, while left anti join includes only rows from the left table that do not have a match in the right table.
Rank assigns unique ranks to rows based on the specified order, while dense_rank handles ties by assigning the same rank to ...
I applied via Recruitment Consulltant and was interviewed in Aug 2024. There were 2 interview rounds.
Focus of quantitative maths and aptitude a bit more
I applied via LinkedIn and was interviewed in Oct 2024. There was 1 interview round.
Reverse strings in a Python list
Use list comprehension to iterate through the list and reverse each string
Use the slice notation [::-1] to reverse each string
Example: strings = ['hello', 'world'], reversed_strings = [s[::-1] for s in strings]
To find the 2nd highest salary in SQL, use the 'SELECT' statement with 'ORDER BY' and 'LIMIT' clauses.
Use the 'SELECT' statement to retrieve the salary column from the table.
Use the 'ORDER BY' clause to sort the salaries in descending order.
Use the 'LIMIT' clause to limit the result to the second row.
I appeared for an interview in Sep 2024.
I applied via Approached by Company and was interviewed in Sep 2024. There was 1 interview round.
SCD 1 overwrites old data with new data, while SCD 2 keeps track of historical changes.
SCD 1 updates existing records with new data, losing historical information.
SCD 2 creates new records for each change, preserving historical data.
SCD 1 is simpler and faster, but can lead to data loss.
SCD 2 is more complex and slower, but maintains a full history of changes.
Corrupt record handling in Spark involves identifying and handling data that does not conform to expected formats.
Use DataFrameReader option("badRecordsPath", "path/to/bad/records") to save corrupt records to a separate location for further analysis.
Use DataFrame.na.drop() or DataFrame.na.fill() to handle corrupt records by dropping or filling missing values.
Implement custom logic to identify and handle corrupt records...
Object-oriented programming (OOP) is a programming paradigm based on the concept of objects, which can contain data in the form of fields and code in the form of procedures.
OOP focuses on creating objects that interact with each other to solve a problem
Key concepts include encapsulation, inheritance, polymorphism, and abstraction
Encapsulation involves bundling data and methods that operate on the data into a single uni...
Data engineer life cycle involves collecting, storing, processing, and analyzing data using various tools.
Data collection: Gathering data from various sources such as databases, APIs, and logs.
Data storage: Storing data in databases, data lakes, or data warehouses.
Data processing: Cleaning, transforming, and enriching data using tools like Apache Spark or Hadoop.
Data analysis: Analyzing data to extract insights and mak...
Spark join strategies include broadcast join, shuffle hash join, and shuffle sort merge join.
Broadcast join is used when one of the DataFrames is small enough to fit in memory on all nodes.
Shuffle hash join is used when joining two large DataFrames by partitioning and shuffling the data based on the join key.
Shuffle sort merge join is used when joining two large DataFrames by sorting and merging the data based on the j...
Spark is a fast and general-purpose cluster computing system for big data processing.
Spark is popular for its speed and ease of use in processing large datasets.
It provides in-memory processing capabilities, making it faster than traditional disk-based processing systems.
Spark supports multiple programming languages like Java, Scala, Python, and R.
It offers a wide range of libraries for diverse tasks such as SQL, strea...
Clustering is the process of grouping similar data points together. Pods are groups of one or more containers, while nodes are individual machines in a cluster.
Clustering is a technique used in machine learning to group similar data points together based on certain features or characteristics.
Pods in a cluster are groups of one or more containers that share resources and are scheduled together on the same node.
Nodes ar...
The duration of TCS Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 101 interview experiences
Difficulty level
Duration
based on 513 reviews
Rating in categories
Hyderabad / Secunderabad,
Bangalore / Bengaluru
+16-11 Yrs
Not Disclosed
System Engineer
1.1L
salaries
| ₹1 L/yr - ₹9 L/yr |
IT Analyst
65.6k
salaries
| ₹7.7 L/yr - ₹12.7 L/yr |
AST Consultant
53.5k
salaries
| ₹12 L/yr - ₹20.6 L/yr |
Assistant System Engineer
33.2k
salaries
| ₹2.5 L/yr - ₹6.4 L/yr |
Associate Consultant
32.9k
salaries
| ₹16.2 L/yr - ₹28 L/yr |
Amazon
Wipro
Infosys
Accenture