Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

VIEW WINNERS
- ABECA 2025
  
  VIEW WINNERS
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
Participate in ABECA 2026

Premium Employer

Infosys Work with us

Compare

3.6

based on 43.9k Reviews

Filter interviews by

Infosys Data Scientist Interview Questions and Answers

Updated 4 Apr 2025

14 Interview questions

🔥 Asked by recruiter 5 times

A Data Scientist was asked 4mo ago

Q. What are the core concepts of Object-Oriented Programming (OOP)?

Ans.

OOP concepts include encapsulation, inheritance, polymorphism, and abstraction, essential for structured programming.

Encapsulation: Bundling data and methods in a class. Example: A class 'Car' with attributes like 'speed' and methods like 'accelerate()'.
Inheritance: Deriving new classes from existing ones. Example: 'ElectricCar' inherits from 'Car', adding features like 'batteryCapacity'.
Polymorphism: Ability to t...

A Data Scientist was asked

Q. With the XGBoost algorithm using 10-20 features, how are the splits decided, and on which feature will they be divided?

Ans.

XgBoost algorithm uses a greedy approach to determine splits based on feature importance.

XgBoost algorithm calculates the information gain for each feature to determine the best split.
The feature with the highest information gain is chosen for the split.
This process is repeated recursively for each node in the tree.
Features can be split based on numerical values or categories.
Example: If a feature like 'age' has t...

A Data Scientist was asked

Q. Explain precision and recall, and when each is used.

Ans.

Precision and recall are metrics used in evaluating the performance of classification models.

Precision measures the accuracy of positive predictions, while recall measures the ability of the model to find all positive instances.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Precision is important when false positives are costly, while recall is important when false negatives are costly.
For example, in a spam ema...

A Data Scientist was asked

Q. What is data imbalance?

Ans.

Data imbalance refers to unequal distribution of classes in a dataset, where one class has significantly more samples than others.

Data imbalance can lead to biased models that favor the majority class.
It can result in poor performance for minority classes, as the model may struggle to accurately predict them.
Techniques like oversampling, undersampling, and using different evaluation metrics can help address data i...

A Data Scientist was asked

Q. Explain the XGBoost algorithm.

Ans.

XGBoost is a powerful machine learning algorithm known for its speed and performance in handling large datasets.

XGBoost stands for eXtreme Gradient Boosting, which is an implementation of gradient boosting machines.
It is widely used in machine learning competitions and is known for its speed and performance.
XGBoost uses a technique called boosting, where multiple weak learners are combined to create a strong learn...

A Data Scientist was asked

Q. What is L1 and L2 Regularization?

Ans.

L1 and L2 regularization are techniques used in machine learning to prevent overfitting by adding penalty terms to the cost function.

L1 regularization adds the absolute values of the coefficients as penalty term to the cost function.
L2 regularization adds the squared values of the coefficients as penalty term to the cost function.
L1 regularization can lead to sparse models by forcing some coefficients to be exactl...

A Data Scientist was asked

Q. What is data science?

Ans.

Data science is the field of extracting insights and knowledge from data using various techniques and tools.

Data science involves collecting, cleaning, and analyzing data to extract insights.
It uses various techniques such as machine learning, statistical modeling, and data visualization.
Data science is used in various fields such as finance, healthcare, and marketing.
Examples of data science applications include ...

Are these interview questions helpful?

A Data Scientist was asked

Q. What is SMOTE? Do you have any experience working on Time Series? Code analysis of global variable?

Ans.

SMOTE stands for Synthetic Minority Over-sampling Technique, used to balance imbalanced datasets by generating synthetic samples.

SMOTE is commonly used in machine learning to address class imbalance by creating synthetic samples of the minority class.
It works by generating new instances of the minority class by interpolating between existing instances.
SMOTE is particularly useful in scenarios where the minority cl...

A Data Scientist was asked

Q. What is entropy, information gain?

Ans.

Entropy is a measure of randomness or uncertainty in a dataset, while information gain is the reduction in entropy after splitting a dataset based on a feature.

Entropy is used in decision tree algorithms to determine the best feature to split on.
Information gain measures the effectiveness of a feature in classifying the data.
Higher information gain indicates that a feature is more useful for splitting the data.
Ent...

A Data Scientist was asked

Q. What is activation function? Explain Naive Bayes? Confusion matrix? Hyperparameters in DL? Hypothesis testing

Ans.

Activation function is a mathematical function used in neural networks to introduce non-linearity.

Activation function is applied to the weighted sum of inputs in a neural network node.
It helps in determining the output of a node or the activation of a neuron.
Common activation functions include sigmoid, tanh, ReLU, and softmax.
Activation functions introduce non-linearity, allowing neural networks to learn complex p...

Infosys Data Scientist Interview Experiences

20 interviews found

Data Scientist Interview Questions & Answers

Anonymous

posted on 5 Mar 2025

Interview experience

Bad

Difficulty level

Moderate

Process Duration

Less than 2 weeks

Result

No response

I appeared for an interview in Feb 2025.

Round 1 - Technical

(2 Questions)

Q1. Deployment of RAG

Ans.

RAG (Retrieval-Augmented Generation) deployment enhances AI models by integrating external data sources for improved responses.

Integrate RAG with existing NLP models to enhance context understanding.
Utilize APIs to fetch real-time data, improving response accuracy.
Example: Using RAG in customer support to pull relevant FAQs from a database.
Implement caching mechanisms to optimize retrieval speed.
Monitor and evaluate mo...

Answered by AI

Add your answer

Q2. Building of RAG

Ans.

RAG (Red, Amber, Green) is a visual tool for assessing project status and risk levels.

RAG status indicates project health: Red = critical issues, Amber = potential risks, Green = on track.
Example: A project with budget overruns may be marked Red.
RAG can be used in dashboards for quick visual assessments.
Regular updates to RAG status help in proactive risk management.

Answered by AI

Add your answer

Data Scientist Interview Questions & Answers

Anonymous

posted on 30 May 2024

Interview experience

Good

Difficulty level

Moderate

Process Duration

2-4 weeks

Result

Not Selected

I applied via Job Portal and was interviewed in Apr 2024. There was 1 interview round.

Round 1 - Technical

(9 Questions)

Q1. Explain XGBoost algoritm

Ans.

XGBoost is a powerful machine learning algorithm known for its speed and performance in handling large datasets.

XGBoost stands for eXtreme Gradient Boosting, which is an implementation of gradient boosting machines.
It is widely used in machine learning competitions and is known for its speed and performance.
XGBoost uses a technique called boosting, where multiple weak learners are combined to create a strong learner.
It...

Answered by AI

Add your answer

Q2. XgBoost algorithm has 10-20 features. How are the splits decided, on which feature are they going to be divided?

Ans.

XgBoost algorithm uses a greedy approach to determine splits based on feature importance.

XgBoost algorithm calculates the information gain for each feature to determine the best split.
The feature with the highest information gain is chosen for the split.
This process is repeated recursively for each node in the tree.
Features can be split based on numerical values or categories.
Example: If a feature like 'age' has the hi...

Answered by AI

View 1 more answer

Q3. Do you have any experience on cloud platform?

Add your answer

Q4. What is entropy, information gain?

Ans.

Entropy is a measure of randomness or uncertainty in a dataset, while information gain is the reduction in entropy after splitting a dataset based on a feature.

Entropy is used in decision tree algorithms to determine the best feature to split on.
Information gain measures the effectiveness of a feature in classifying the data.
Higher information gain indicates that a feature is more useful for splitting the data.
Entropy ...

Answered by AI

Add your answer

Q5. What is hypothesis testing?

Add your answer

Q6. Explain precision and recall, when are they used in which scenario?

Ans.

Precision and recall are metrics used in evaluating the performance of classification models.

Precision measures the accuracy of positive predictions, while recall measures the ability of the model to find all positive instances.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Precision is important when false positives are costly, while recall is important when false negatives are costly.
For example, in a spam email de...

Answered by AI

Add your answer

Q7. What is data imbalance?

Ans.

Data imbalance refers to unequal distribution of classes in a dataset, where one class has significantly more samples than others.

Data imbalance can lead to biased models that favor the majority class.
It can result in poor performance for minority classes, as the model may struggle to accurately predict them.
Techniques like oversampling, undersampling, and using different evaluation metrics can help address data imbala...

Answered by AI

Add your answer

Q8. What is SMOTE? Do you have any experience working on Time Series? Code analysis of global variable?

Ans.

SMOTE stands for Synthetic Minority Over-sampling Technique, used to balance imbalanced datasets by generating synthetic samples.

SMOTE is commonly used in machine learning to address class imbalance by creating synthetic samples of the minority class.
It works by generating new instances of the minority class by interpolating between existing instances.
SMOTE is particularly useful in scenarios where the minority class i...

Answered by AI

Add your answer

Q9. Find 5th highest salary in every department. What are window functions Difference between union and union all Difference between delete and truncate.

Ans.

Find the 5th highest salary in each department using SQL queries and understand key SQL concepts.

Use the ROW_NUMBER() window function to rank salaries within each department.
Example SQL: SELECT department, salary FROM (SELECT department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank FROM employees) AS ranked WHERE rank = 5;
Window functions allow calculations across a set of table rows...

Answered by AI

Add your answer

Interview Preparation Tips

Interview preparation tips for other job seekers - Prepare basics well. Go through the top questions asked for SQL,Python,Data Science.
Well versed with resume projects and concepts used in it.

Skills evaluated in this interview

Data Scientist Interview Questions & Answers

Anonymous

posted on 14 Aug 2024

Interview experience

Excellent

Difficulty level

Moderate

Process Duration

Less than 2 weeks

Result

Not Selected

I applied via Referral and was interviewed in Jul 2024. There were 2 interview rounds.

Round 1 - Coding Test

Basic operations on dataframe using Pandas and SQL basics.

Round 2 - Technical