i
Infosys
Work with us
Filter interviews by
Pardo in dataflow refers to a parallel data processing model that optimizes performance and resource utilization.
Pardo stands for 'Parallel Do', enabling distributed processing of data across multiple nodes.
It allows for efficient handling of large datasets by breaking them into smaller chunks.
For example, in Apache Beam, Pardo can be used to apply a function to each element in a collection in parallel.
This model ...
Calculate trailing zeros in a factorial using Python by counting factors of 5 in the numbers leading to n.
Trailing zeros in n! are produced by factors of 10, which are made from pairs of 2 and 5.
Since there are usually more factors of 2 than 5, we only need to count the factors of 5.
The formula to calculate trailing zeros is: n // 5 + n // 25 + n // 125 + ... until n // 5^k is 0.
Example: For 100!, trailing zeros =...
Identify the city with the highest revenue by analyzing data from various regions.
Aggregate revenue data from all regions within the city.
Use SQL queries like 'SELECT city, SUM(revenue) FROM sales GROUP BY city ORDER BY SUM(revenue) DESC LIMIT 1;'
Consider factors like population, economic activity, and industry presence in each region.
Example: If Region A has $1M and Region B has $2M, the total for the city is $3M...
SQL query to compare today's sales with yesterday's sales using aggregation and date functions.
Use a table with sales data that includes a date column.
Aggregate sales by date using SUM() function.
Use a Common Table Expression (CTE) or subquery to get sales for today and yesterday.
Calculate the difference between today's and yesterday's sales.
Spark optimization techniques enhance performance and resource utilization in distributed data processing tasks.
Use DataFrames and Datasets for optimized execution plans.
Leverage Catalyst Optimizer for query optimization.
Apply Tungsten for memory management and code generation.
Utilize partitioning to minimize data shuffling, e.g., using 'repartition' or 'coalesce'.
Cache intermediate results with 'persist()' to avo...
ETL is a data integration process that involves extracting data, transforming it for analysis, and loading it into a target system.
Extract: Gather data from various sources like databases, APIs, or flat files. Example: Pulling customer data from a CRM system.
Transform: Clean and format the data to meet business requirements. Example: Converting date formats or aggregating sales data.
Load: Insert the transformed da...
Efficiently joining large and small datasets requires strategic approaches to optimize performance and resource usage.
Use a distributed computing framework like Apache Spark to handle large datasets efficiently.
Consider filtering the larger dataset before the join to reduce its size, e.g., using a WHERE clause.
Leverage indexing on the join keys to speed up the join operation.
Use a broadcast join if the smaller dat...
Incremental data loading is a process of updating a database with only new or changed data since the last load.
Reduces data transfer time by only loading new or modified records.
Commonly used in ETL (Extract, Transform, Load) processes.
Example: Loading only new customer records added since the last update.
Helps in maintaining data consistency and reducing redundancy.
Can be implemented using timestamps or change da...
A palindrome is a word, phrase, number, or other sequence of characters that reads the same forward and backward.
Check if the string is equal to its reverse to determine if it's a palindrome.
Ignore spaces and punctuation when checking for palindromes.
Convert the string to lowercase before checking for palindromes.
Examples: 'racecar', 'A man, a plan, a canal, Panama'
Developed a data pipeline to ingest, process, and analyze customer feedback data for product improvement.
Designed and implemented ETL processes to extract data from various sources
Utilized Apache Spark for data processing and analysis
Built data visualizations to present insights to stakeholders
I appeared for an interview in May 2025, where I was asked the following questions.
Developed a data pipeline for processing and analyzing large datasets from various sources to support business intelligence.
Designed ETL processes to extract data from APIs and databases, ensuring data integrity.
Utilized Apache Spark for distributed data processing, improving performance by 30%.
Implemented data warehousing solutions using Amazon Redshift for efficient querying.
Created dashboards in Tableau for visualiz...
SQL query to compare today's sales with yesterday's sales using aggregation and date functions.
Use a table with sales data that includes a date column.
Aggregate sales by date using SUM() function.
Use a Common Table Expression (CTE) or subquery to get sales for today and yesterday.
Calculate the difference between today's and yesterday's sales.
I applied via Naukri.com
I appeared for an interview in May 2025, where I was asked the following questions.
Identify the city with the highest revenue by analyzing data from various regions.
Aggregate revenue data from all regions within the city.
Use SQL queries like 'SELECT city, SUM(revenue) FROM sales GROUP BY city ORDER BY SUM(revenue) DESC LIMIT 1;'
Consider factors like population, economic activity, and industry presence in each region.
Example: If Region A has $1M and Region B has $2M, the total for the city is $3M.
I applied via Naukri.com and was interviewed in Oct 2024. There was 1 interview round.
DSA question was asked
I appeared for an interview in Feb 2025, where I was asked the following questions.
DROP removes a table permanently; TRUNCATE deletes all rows but retains the table structure.
DROP TABLE table_name; - Completely removes the table and its data.
TRUNCATE TABLE table_name; - Deletes all rows but keeps the table structure.
DROP cannot be rolled back if not in a transaction; TRUNCATE can be rolled back if in a transaction.
TRUNCATE is usually faster than DROP because it doesn't log individual row deletions.
I appeared for an interview in Dec 2024, where I was asked the following questions.
Spark optimization techniques enhance performance and resource utilization in distributed data processing tasks.
Use DataFrames and Datasets for optimized execution plans.
Leverage Catalyst Optimizer for query optimization.
Apply Tungsten for memory management and code generation.
Utilize partitioning to minimize data shuffling, e.g., using 'repartition' or 'coalesce'.
Cache intermediate results with 'persist()' to avoid re...
ETL is a data integration process that involves extracting data, transforming it for analysis, and loading it into a target system.
Extract: Gather data from various sources like databases, APIs, or flat files. Example: Pulling customer data from a CRM system.
Transform: Clean and format the data to meet business requirements. Example: Converting date formats or aggregating sales data.
Load: Insert the transformed data in...
I applied via Company Website and was interviewed in Mar 2024. There were 3 interview rounds.
It went fine and interactive
What people are saying about Infosys
Some of the top questions asked at the Infosys Data Engineer interview -
The duration of Infosys Data Engineer interview process can vary, but typically it takes about 2-4 weeks to complete.
based on 29 interview experiences
Difficulty level
Duration
based on 104 reviews
Rating in categories
Technology Analyst
55.1k
salaries
| ₹4.8 L/yr - ₹10 L/yr |
Senior Systems Engineer
54.4k
salaries
| ₹2.5 L/yr - ₹6.3 L/yr |
Technical Lead
35.4k
salaries
| ₹9.5 L/yr - ₹16.5 L/yr |
System Engineer
32.6k
salaries
| ₹2.4 L/yr - ₹6 L/yr |
Senior Associate Consultant
32.4k
salaries
| ₹8.3 L/yr - ₹15 L/yr |
TCS
Wipro
Cognizant
Accenture