Filter interviews by
Identify duplicate rows in a table
Use SQL query with GROUP BY and HAVING clause to identify duplicate rows based on specific columns
Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1
The answer to the question is that in which state which gender makes the most purchases.
Aggregate the data by state and gender to calculate the total purchases made by each gender in each state.
Identify the gender with the highest total purchases in each state.
Present the results in a table or chart for easy visualization.
Use SQL to calculate the difference in marks for each student ID across different years.
Use a self join on the table to compare marks for the same student ID across different years.
Calculate the difference in marks by subtracting the marks from different years.
Group the results by student ID to get the difference in marks for each student.
Only one job will run in parallel in Spark with four cores and four worker nodes.
In Spark, each core can only run one task at a time, so with four cores, only four tasks can run concurrently.
Since there are four worker nodes, each with four cores, a total of four tasks can run in parallel.
Therefore, only one job will run in parallel in this scenario.
Spark handles fault tolerance through resilient distributed datasets (RDDs) and lineage tracking.
Spark achieves fault tolerance through RDDs, which are immutable distributed collections of objects that can be rebuilt if a partition is lost.
RDDs track the lineage of transformations applied to the data, allowing lost partitions to be recomputed based on the original data and transformations.
Spark also replicates dat...
Lineage refers to the history and origin of data, including its source, transformations, and dependencies.
Lineage helps in understanding how data is generated, processed, and transformed throughout its lifecycle.
It tracks the flow of data from its source to its destination, including any intermediate steps or transformations.
Lineage is important for data governance, data quality, and troubleshooting data issues.
Ex...
DAG stands for Directed Acyclic Graph, a data structure used to represent dependencies between tasks in a workflow.
DAG is a collection of nodes connected by edges, where each edge has a direction and there are no cycles.
It is commonly used in data engineering for representing data pipelines and workflows.
DAGs help in visualizing and optimizing the order of tasks to be executed in a workflow.
Popular tools like Apac...
I have used techniques like indexing, query optimization, and parallel processing in my projects.
Indexing: Used to improve the speed of data retrieval by creating indexes on columns frequently used in queries.
Query optimization: Rewriting queries to improve efficiency and reduce execution time.
Parallel processing: Distributing tasks across multiple processors to speed up data processing.
Caching: Storing frequently...
ADF stands for Azure Data Factory, a cloud-based data integration service that allows you to create, schedule, and manage data pipelines.
ADF is used for building, scheduling, and monitoring data pipelines to move and transform data from various sources to destinations.
It supports data integration between various data stores such as Azure SQL Database, Azure Blob Storage, and on-premises data sources.
ADF provides a...
Group by is a SQL clause used to aggregate data based on one or more columns.
Used to group rows that have the same values in specified columns.
Commonly used with aggregate functions like COUNT, SUM, AVG.
Example: SELECT department, COUNT(*) FROM employees GROUP BY department;
Can include HAVING clause to filter groups based on aggregate values.
Example: SELECT department, AVG(salary) FROM employees GROUP BY departmen...
I applied via Naukri.com and was interviewed in Aug 2024. There was 1 interview round.
I am a Data Engineer with experience in designing and implementing project architectures. My day-to-day responsibilities include data processing, ETL tasks, and ensuring data quality.
Designing and implementing project architectures for data processing
Performing ETL tasks to extract, transform, and load data into the system
Ensuring data quality and integrity through data validation and cleansing
Collaborating with cross-...
Use SQL to calculate the difference in marks for each student ID across different years.
Use a self join on the table to compare marks for the same student ID across different years.
Calculate the difference in marks by subtracting the marks from different years.
Group the results by student ID to get the difference in marks for each student.
The answer to the question is that in which state which gender makes the most purchases.
Aggregate the data by state and gender to calculate the total purchases made by each gender in each state.
Identify the gender with the highest total purchases in each state.
Present the results in a table or chart for easy visualization.
ADF stands for Azure Data Factory, a cloud-based data integration service that allows you to create, schedule, and manage data pipelines.
ADF is used for building, scheduling, and monitoring data pipelines to move and transform data from various sources to destinations.
It supports data integration between various data stores such as Azure SQL Database, Azure Blob Storage, and on-premises data sources.
ADF provides a visu...
Lineage refers to the history and origin of data, including its source, transformations, and dependencies.
Lineage helps in understanding how data is generated, processed, and transformed throughout its lifecycle.
It tracks the flow of data from its source to its destination, including any intermediate steps or transformations.
Lineage is important for data governance, data quality, and troubleshooting data issues.
Example...
Spark handles fault tolerance through resilient distributed datasets (RDDs) and lineage tracking.
Spark achieves fault tolerance through RDDs, which are immutable distributed collections of objects that can be rebuilt if a partition is lost.
RDDs track the lineage of transformations applied to the data, allowing lost partitions to be recomputed based on the original data and transformations.
Spark also replicates data par...
Only one job will run in parallel in Spark with four cores and four worker nodes.
In Spark, each core can only run one task at a time, so with four cores, only four tasks can run concurrently.
Since there are four worker nodes, each with four cores, a total of four tasks can run in parallel.
Therefore, only one job will run in parallel in this scenario.
Identify duplicate rows in a table
Use SQL query with GROUP BY and HAVING clause to identify duplicate rows based on specific columns
Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1
I appeared for an interview before Oct 2022.
1.ETL Pipeline
2.PySpark Code
3.SQL
Group by is a SQL clause used to aggregate data based on one or more columns.
Used to group rows that have the same values in specified columns.
Commonly used with aggregate functions like COUNT, SUM, AVG.
Example: SELECT department, COUNT(*) FROM employees GROUP BY department;
Can include HAVING clause to filter groups based on aggregate values.
Example: SELECT department, AVG(salary) FROM employees GROUP BY department HAV...
Top trending discussions
posted on 18 Feb 2025
SQL AND POWERBI related questions
I applied via cutshort and was interviewed before Sep 2023. There were 3 interview rounds.
Asked about JavaScript Topics
1. Promise
2. Clousure
3. Hoisting.
Create an Application Routes are protected by different roles in ReactJS
I applied via Instahyre and was interviewed before Sep 2023. There were 2 interview rounds.
Seeking new challenges and opportunities for growth. I bring a unique blend of skills and experience to add value to your team.
Looking for new challenges and opportunities for professional growth
Seeking a more dynamic work environment to utilize my skills effectively
Excited about the possibility of contributing to a new team and organization
Believe my diverse skill set and experience make me a valuable addition to your...
I appeared for an interview in Oct 2024, where I was asked the following questions.
I appeared for an interview before May 2024, where I was asked the following questions.
APIs, or Application Programming Interfaces, enable different software applications to communicate and share data seamlessly.
APIs allow for integration between different systems, such as a mobile app accessing a web service.
RESTful APIs use standard HTTP methods like GET, POST, PUT, and DELETE for operations.
SOAP APIs rely on XML for message format and are often used in enterprise environments.
Examples include the Goog...
Key Java concepts assessed in interviews include OOP principles, exception handling, collections, and multithreading.
Object-Oriented Programming (OOP): Concepts like inheritance, encapsulation, polymorphism, and abstraction are fundamental.
Example: Inheritance allows a subclass to inherit properties and methods from a superclass.
Exception Handling: Understanding try-catch blocks, custom exceptions, and the importance o...
I applied via Campus Placement and was interviewed in Oct 2024. There were 3 interview rounds.
PEN PAPER round consisting of three coding and c/c++ output based questions.
posted on 23 Apr 2025
I appeared for an interview before Apr 2024, where I was asked the following questions.
Some of the top questions asked at the Fragma Data Systems Data Engineer interview -
based on 3 interview experiences
Difficulty level
Duration
based on 8 reviews
Rating in categories
Data Engineer
53
salaries
| ₹8.3 L/yr - ₹16 L/yr |
Software Engineer
30
salaries
| ₹5.7 L/yr - ₹16.8 L/yr |
Senior Software Engineer
25
salaries
| ₹11.4 L/yr - ₹31.6 L/yr |
Business Analyst
24
salaries
| ₹3.5 L/yr - ₹6 L/yr |
Data Analyst
13
salaries
| ₹3.5 L/yr - ₹6.4 L/yr |
SE2 DIGITAL SERVICE LLP
Pragmasys Consulting LLP
Vowelweb
TantranZm Technologies