i
Tech
Mahindra
Filter interviews by
Optimization techniques in Spark improve performance and efficiency of data processing.
Partitioning data to distribute workload evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Avoiding shuffling operations whenever possible
Tuning configuration settings like memory allocation and parallelism
Methods to transfer data from on-premises storage to Azure Data Lake Storage Gen2
Use Azure Data Factory to create pipelines for data transfer
Utilize Azure Data Box for offline data transfer
Leverage Azure Storage Explorer for manual data transfer
Implement Azure Data Migration Service for large-scale data migration
The output after inner join of table 1 and table 2 will be 2,3,5.
Inner join only includes rows that have matching values in both tables.
Values 2, 3, and 5 are present in both tables, so they will be included in the output.
Null values are not considered as matching values in inner join.
Databricks enhances data processing with advanced analytics, collaboration, and scalability beyond ADF's capabilities.
Databricks provides a collaborative environment for data scientists and engineers to work together using notebooks.
It supports advanced analytics and machine learning workflows, which ADF lacks natively.
Databricks can handle large-scale data processing with Apache Spark, making it more efficient fo...
What people are saying about Tech Mahindra
Query to find customer names with the maximum orders from Customers and Orders tables.
Use JOIN to combine Customers and Orders tables on CustomerID.
Group by CustomerID and count orders to find the maximum.
Use a subquery to filter customers with the maximum order count.
Example SQL: SELECT c.customerName FROM Customers c JOIN Orders o ON c.customerID = o.CustomerID GROUP BY c.customerID HAVING COUNT(o.OrderId) = (SE...
Use Slowly Changing Dimensions (SCD) to preserve historical data while reconstructing a table.
Implement SCD Type 1 for overwriting old data without keeping history.
Use SCD Type 2 to create new records for changes, preserving history.
Example of SCD Type 2: If a customer's address changes, add a new record with the new address and mark the old record as inactive.
SCD Type 3 allows for limited history by adding new co...
Use window functions like ROW_NUMBER() to find highest sales from each city in SQL.
Use PARTITION BY clause in ROW_NUMBER() to partition data by city
Order the data by sales in descending order
Filter the results to only include rows with row number 1
Databricks can be mounted using the Databricks CLI or the Databricks REST API.
Use the Databricks CLI command 'databricks fs mount' to mount a storage account to a Databricks workspace.
Alternatively, you can use the Databricks REST API to programmatically mount storage.
Optimizing Spark performance involves tuning configurations, data partitioning, and efficient resource management.
Use DataFrame API instead of RDDs for better optimization and performance.
Optimize data partitioning by using 'repartition' or 'coalesce' to balance workloads.
Leverage broadcast variables to reduce data shuffling in joins.
Cache intermediate results using 'persist()' to avoid recomputation.
Adjust Spark ...
Types of joins include inner, outer, left, right, and full joins in Spark queries.
Inner join: Returns rows that have matching values in both tables
Outer join: Returns all rows when there is a match in one of the tables
Left join: Returns all rows from the left table and the matched rows from the right table
Right join: Returns all rows from the right table and the matched rows from the left table
Full join: Returns r...
I appeared for an interview in Jan 2025.
I applied via Recruitment Consulltant and was interviewed in Aug 2024. There were 3 interview rounds.
The output after inner join of table 1 and table 2 will be 2,3,5.
Inner join only includes rows that have matching values in both tables.
Values 2, 3, and 5 are present in both tables, so they will be included in the output.
Null values are not considered as matching values in inner join.
Query to find customer names with the maximum orders from Customers and Orders tables.
Use JOIN to combine Customers and Orders tables on CustomerID.
Group by CustomerID and count orders to find the maximum.
Use a subquery to filter customers with the maximum order count.
Example SQL: SELECT c.customerName FROM Customers c JOIN Orders o ON c.customerID = o.CustomerID GROUP BY c.customerID HAVING COUNT(o.OrderId) = (SELECT ...
The project involves building a data pipeline to ingest, process, and analyze large volumes of data from various sources in Azure.
Utilizing Azure Data Factory for data ingestion and orchestration
Implementing Azure Databricks for data processing and transformation
Storing processed data in Azure Data Lake Storage
Using Azure Synapse Analytics for data warehousing and analytics
Leveraging Azure DevOps for CI/CD pipeline aut...
Designing an effective ADF pipeline involves considering various metrics and factors.
Understand the data sources and destinations
Identify the dependencies between activities
Optimize data movement and processing for performance
Monitor and track pipeline execution for troubleshooting
Consider security and compliance requirements
Use parameterization and dynamic content for flexibility
Implement error handling and retries fo...
Optimize data processing by partitioning, indexing, and using efficient storage formats.
Partitioning: Divide large datasets into smaller, manageable chunks. For example, partitioning a sales dataset by year.
Indexing: Create indexes on frequently queried columns to speed up data retrieval. For instance, indexing customer IDs in a transaction table.
Data Compression: Use compressed formats like Parquet or ORC to reduce st...
Use Slowly Changing Dimensions (SCD) to preserve historical data while reconstructing a table.
Implement SCD Type 1 for overwriting old data without keeping history.
Use SCD Type 2 to create new records for changes, preserving history.
Example of SCD Type 2: If a customer's address changes, add a new record with the new address and mark the old record as inactive.
SCD Type 3 allows for limited history by adding new columns...
Databricks enhances data processing with advanced analytics, collaboration, and scalability beyond ADF's capabilities.
Databricks provides a collaborative environment for data scientists and engineers to work together using notebooks.
It supports advanced analytics and machine learning workflows, which ADF lacks natively.
Databricks can handle large-scale data processing with Apache Spark, making it more efficient for big...
I appeared for an interview in Dec 2024.
Optimization techniques in Spark improve performance and efficiency of data processing.
Partitioning data to distribute workload evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Avoiding shuffling operations whenever possible
Tuning configuration settings like memory allocation and parallelism
Methods to transfer data from on-premises storage to Azure Data Lake Storage Gen2
Use Azure Data Factory to create pipelines for data transfer
Utilize Azure Data Box for offline data transfer
Leverage Azure Storage Explorer for manual data transfer
Implement Azure Data Migration Service for large-scale data migration
Types of joins include inner, outer, left, right, and full joins in Spark queries.
Inner join: Returns rows that have matching values in both tables
Outer join: Returns all rows when there is a match in one of the tables
Left join: Returns all rows from the left table and the matched rows from the right table
Right join: Returns all rows from the right table and the matched rows from the left table
Full join: Returns rows w...
I am willing to work in a firm office environment.
I am comfortable working in a structured office setting
I value collaboration and communication with colleagues
I am adaptable to different office environments and cultures
I appeared for an interview in May 2025, where I was asked the following questions.
To find the 2nd highest salary by department, use SQL queries with ranking functions or subqueries.
Use the SQL 'ROW_NUMBER()' or 'RANK()' function to assign ranks to salaries within each department.
Example SQL query: SELECT dept, salary FROM (SELECT dept, salary, RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as rank FROM salary_table) as ranked WHERE rank = 2;
Alternatively, use a subquery to first find the highe...
Implementing Slowly Changing Dimension (SCD) Type 2 in PySpark for data versioning.
SCD Type 2 tracks historical data by creating new records for changes.
Use a DataFrame to represent the current state of the dimension.
Identify changes by comparing the incoming data with existing records.
Set the 'end_date' of the old record and mark it as inactive.
Insert the new record with the updated values and an active status.
I applied via Naukri.com and was interviewed in May 2024. There were 2 interview rounds.
The project architecture includes Spark transformations for processing large volumes of data.
Spark transformations are used to manipulate data in distributed computing environments.
Examples of Spark transformations include map, filter, reduceByKey, join, etc.
Use window functions like ROW_NUMBER() to find highest sales from each city in SQL.
Use PARTITION BY clause in ROW_NUMBER() to partition data by city
Order the data by sales in descending order
Filter the results to only include rows with row number 1
Databricks can be mounted using the Databricks CLI or the Databricks REST API.
Use the Databricks CLI command 'databricks fs mount' to mount a storage account to a Databricks workspace.
Alternatively, you can use the Databricks REST API to programmatically mount storage.
It was a SQL-related question that required you to solve the problem.
Incremental load is the process of loading only new or updated data into a data warehouse, rather than reloading all data each time.
Incremental load helps in reducing the time and resources required for data processing.
It involves identifying new or updated data since the last load and merging it with the existing data.
Common techniques for incremental load include using timestamps or change data capture (CDC) mechanis...
Optimizing Spark performance involves tuning configurations, data partitioning, and efficient resource management.
Use DataFrame API instead of RDDs for better optimization and performance.
Optimize data partitioning by using 'repartition' or 'coalesce' to balance workloads.
Leverage broadcast variables to reduce data shuffling in joins.
Cache intermediate results using 'persist()' to avoid recomputation.
Adjust Spark confi...
I applied via Naukri.com and was interviewed before Aug 2020. There were 4 interview rounds.
Some of the top questions asked at the Tech Mahindra Azure Data Engineer interview -
based on 8 interview experiences
Difficulty level
Duration
based on 14 reviews
Rating in categories
Software Engineer
26.6k
salaries
| ₹3.7 L/yr - ₹9.2 L/yr |
Senior Software Engineer
22.2k
salaries
| ₹9.1 L/yr - ₹18.5 L/yr |
Technical Lead
12.5k
salaries
| ₹16.9 L/yr - ₹30 L/yr |
Associate Software Engineer
6.1k
salaries
| ₹1.9 L/yr - ₹5.7 L/yr |
Team Lead
5.4k
salaries
| ₹5.7 L/yr - ₹17.7 L/yr |
Infosys
Cognizant
Accenture
Wipro