Data Scientist

1000+ Data Scientist Interview Questions and Answers

Updated 7 Jul 2025

Asked in GeekBull Consulting

4d ago

Q. why does optimisers matter? what's their purpose? what do they do in addition to weights-updation that the vanilla gradient and back-prop does?

Ans.

Optimizers are used to improve the efficiency and accuracy of the training process in machine learning models.

Optimizers help in finding the optimal set of weights for a given model by minimizing the loss function.
They use various techniques like momentum, learning rate decay, and adaptive learning rates to speed up the training process.
Optimizers also prevent the model from getting stuck in local minima and help in generalizing the model to unseen data.
Examples of optimizers...read more

Asked in Citicorp

1w ago

Q. How to check outliers in a variable, what treatment should you use to remove such outliers

Ans.

Outliers can be detected using statistical methods like box plots, z-score, and IQR. Treatment can be removal or transformation.

Use box plots to visualize outliers
Calculate z-score and remove data points with z-score greater than 3
Calculate IQR and remove data points outside 1.5*IQR
Transform data using log or square root to reduce the impact of outliers

Asked in C5i

2w ago

Q. Why did you choose the Data Science field?

Ans.

I chose Data Science field because of its potential to solve complex problems and make a positive impact on society.

Fascination with data and its potential to drive insights
Desire to solve complex problems and make a positive impact on society
Opportunity to work with cutting-edge technology and tools
Ability to work in a variety of industries and domains
Examples: Predictive maintenance in manufacturing, fraud detection in finance, personalized medicine in healthcare

Asked in Affine

2d ago

Q. How do you perform manipulations more quickly in pandas?

Ans.

Use vectorized operations, avoid loops, and optimize memory usage.

Use vectorized operations like apply(), map(), and applymap() instead of loops.
Avoid using iterrows() and itertuples() as they are slower than vectorized operations.
Optimize memory usage by using appropriate data types and dropping unnecessary columns.
Use inplace=True parameter to modify the DataFrame in place instead of creating a copy.
Use the pd.eval() function to perform arithmetic operations on large DataFr...read more

Are these interview questions helpful?

Asked in Turing

2w ago

Q. Given a table of numbers, how would you find all numbers that appear at least three times consecutively? Return the result table in any order.

Ans.

Find numbers that appear at least three times consecutively in any order.

Use a window function to track consecutive numbers
Filter the result to only include numbers that appear at least three times consecutively
Return the result table in any order

Asked in Great Learning

2w ago

Q. How is object detection done using CNN?

Ans.

Object detection using CNN involves training a neural network to identify and locate objects within an image.

CNNs use convolutional layers to extract features from images
These features are then passed through fully connected layers to classify and locate objects
Common architectures for object detection include YOLO, SSD, and Faster R-CNN

Data Scientist Jobs

Data Scientist • 4-6 years

Robert Bosch Engineering and Business Solutions Private Limited

•

4.1

Bangalore / Bengaluru

Data Scientist-Advanced Analytics • 3-7 years

IBM India Pvt. Limited

•

4.0

₹ 5 L/yr - ₹ 19 L/yr

(AmbitionBox estimate)

Pune

Data Scientist-Artificial Intelligence • 3-7 years

IBM India Pvt. Limited

•

4.0

₹ 5 L/yr - ₹ 28 L/yr

(AmbitionBox estimate)

Hyderabad / Secunderabad

View all Data Scientist jobs

Asked in Sigmoid

6d ago

Q. How would you determine the number of WhatsApp users worldwide?

Ans.

Estimating the number of WhatsApp users worldwide requires a combination of data sources and statistical methods.

Collect data from WhatsApp's official reports and announcements
Use third-party analytics tools to estimate user numbers
Analyze demographic and geographic trends to extrapolate global user numbers
Consider factors such as population growth and smartphone adoption rates
Compare with similar messaging apps to validate estimates

Asked in ION Group

6d ago

Q. Is there any correlation between algorithms and law?

Ans.

Algorithms and law can be correlated through the use of algorithms in legal processes and decision-making.

Algorithms can be used in legal research to analyze large amounts of data and identify patterns or trends.
Predictive algorithms can be used in legal cases to assess the likelihood of success or failure.
Algorithmic tools can help in legal document review and contract analysis.
However, there are concerns about bias in algorithms used in law, as they can reflect and perpetua...read more

Share interview questions and help millions of jobseekers 🌟

Asked in Celebal Technologies

1w ago

Q. What is the purpose of a confusion matrix in data science?

Ans.

A confusion matrix is a table that is used to describe the performance of a classification model.

It shows the number of true positives, true negatives, false positives, and false negatives.
It helps in evaluating the performance of a machine learning model by providing insights into the model's accuracy, precision, recall, and F1 score.
It is particularly useful in scenarios where class imbalance exists or when different misclassification costs are involved.
Example: In a binary...read more

Asked in Zee Entertainment Enterprises

2d ago

Q. Explain how the LSTM model works and its architecture, referring to your sentiment analysis project.

Ans.

LSTM is a type of recurrent neural network designed to learn long-term dependencies in sequential data.

LSTM stands for Long Short-Term Memory, which is a special kind of RNN.
It uses memory cells to store information over long periods, addressing the vanishing gradient problem.
An LSTM cell consists of three gates: input gate, forget gate, and output gate.
The input gate controls how much new information to add to the cell state.
The forget gate decides what information to discar...read more

Asked in GeekBull Consulting

3d ago

Q. how are LSTMs better than RNNs? what makes them better? how does LSTMs do better what they do better than vanilla RNNs?

Ans.

LSTMs are better than RNNs due to their ability to handle long-term dependencies.

LSTMs have a memory cell that can store information for long periods of time.
They have gates that control the flow of information into and out of the cell.
This allows them to selectively remember or forget information.
Vanilla RNNs suffer from the vanishing gradient problem, which limits their ability to handle long-term dependencies.
LSTMs can be used in applications such as speech recognition, la...read more

Asked in kipi.ai

2w ago

Q. What are the details of your research topics, including aspects such as scalability and the reasoning behind choosing specific models?

Ans.

My research topics focus on developing scalable machine learning models for predictive analytics in finance.

I have researched and implemented various machine learning algorithms such as random forests, gradient boosting, and neural networks.
I have explored techniques for feature engineering and model optimization to improve scalability and performance.
I have chosen specific models based on their ability to handle large datasets and complex relationships within financial data....read more

Asked in Affine

2w ago

Q. explain PCA briefly? what can it be used for and what can it not be used for?

Ans.

PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space.

PCA can be used for feature extraction, data visualization, and noise reduction.
PCA cannot be used for causal inference or to handle missing data.
PCA assumes linear relationships between variables and may not work well with non-linear data.
PCA can be applied to various fields such as finance, image processing, and genetics.

Asked in EXL Service

1w ago

Q. What is the BLEU score in Regression?

Ans.

Blue score is not a term used in regression analysis.

Blue score is not a standard term in regression analysis
It is possible that the interviewer meant to ask about another metric such as R-squared or mean squared error
Without further context, it is difficult to provide a more specific answer

Asked in Citicorp

4d ago

Q. How do you check for multicollinearity in Logistic Regression?

Ans.

Multicollinearity in logistic regression can be checked using correlation matrix and variance inflation factor (VIF).

Calculate the correlation matrix of the independent variables and check for high correlation coefficients.
Calculate the VIF for each independent variable and check for values greater than 5 or 10.
Consider removing one of the highly correlated variables or variables with high VIF to address multicollinearity.
Example: If variables A and B have a correlation coeff...read more

Asked in C5i

2d ago

Q. What is your understanding of Linear Regression?

Ans.

Linear Regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables.

It assumes a linear relationship between the dependent and independent variables.
The equation of a simple linear regression is Y = a + bX + e, where Y is the dependent variable, X is the independent variable, a is the intercept, b is the slope, and e is the error term.
Multiple linear regression extends this to multiple independent variable...read more

Asked in EXL Service

5d ago

Q. Given a list of numbers, create a dictionary where the key is a unique value and the value of this key is the number of occurrences within the given list. Please share your screen and write Python code for this...

Ans.

Create a dictionary from a list where keys are unique numbers and values are their counts.

Use Python's built-in collections module, specifically Counter, to simplify counting occurrences.
Example: For the list [1, 2, 2, 3], the output should be {1: 1, 2: 2, 3: 1}.
Alternatively, use a loop to iterate through the list and build the dictionary manually.

Asked in Rolls-Royce

2d ago

Q. How do you perform time series classification?

Ans.

Time series classification involves using machine learning algorithms to classify time series data based on patterns and trends.

Preprocess the time series data by removing noise and outliers
Extract features from the time series data using techniques such as Fourier transforms or wavelet transforms
Train a machine learning algorithm such as a decision tree or neural network on the extracted features
Evaluate the performance of the algorithm using metrics such as accuracy or F1 s...read more

Asked in ADA Group

3d ago

Q. What are the joints and if we have two tables and in that, we have to find the inner join inner join contains null or not blank or not like that

Ans.

Inner join combines rows from two tables based on a related column between them.

Inner join returns only the rows where there is a match between the columns in both tables
Null values in the columns being joined will not affect the inner join result
Blank values or non-matching values will not be included in the inner join result

Asked in Bajaj Finserv

1w ago

Q. With 2 dependent and 6 independent variables available, which machine learning algorithm should we use?

Ans.

Use a regression algorithm like linear regression or decision tree regression.

Consider using linear regression if the relationship between variables is linear.
Decision tree regression can handle non-linear relationships between variables.
Evaluate the performance of different algorithms using cross-validation.
Consider the interpretability of the model when choosing an algorithm.

Asked in Fractal Analytics

4d ago

Q. 1. Describe one of your projects in detail. 2. Explain Random Forest and other ML models 3. Statistics

Ans.

Developed a predictive model for customer churn using Random Forest algorithm.

Used Python and scikit-learn library for model development
Performed data cleaning, feature engineering, and exploratory data analysis
Tuned hyperparameters using GridSearchCV and evaluated model performance using cross-validation
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions
Other ML models include logistic regression, support vector mac...read more

Asked in Caterpillar Inc

1w ago

Q. What is Logistic Regression and what are the assumptions of linear regression?

Ans.

Logistic Regression is a statistical method used to model the probability of a binary outcome.

Logistic Regression is used when the dependent variable is binary (e.g., 0 or 1, Yes or No).
It estimates the probability that a given input belongs to a certain category.
Assumptions of linear regression include linearity, independence of errors, homoscedasticity, and normality of errors.

Asked in Concentrix Catalyst

2d ago

Q. What is the difference between GROUP BY and window functions in SQL?

Ans.

Group by is used to group data based on a column while window function is used to perform calculations on a specific window of data.

Group by is used to aggregate data based on a specific column
Window function is used to perform calculations on a specific window of data
Group by is used with aggregate functions like sum, count, avg, etc.
Window function is used with analytical functions like rank, lead, lag, etc.
Group by creates a new table with aggregated data while window func...read more

Asked in MasterCard

2w ago

Q. How do you test for trend breaks in time series data?

Ans.

To test time series trend break up, statistical tests like Augmented Dickey-Fuller test can be used.

Augmented Dickey-Fuller test can be used to check if a time series is stationary or not.
If the time series is not stationary, we can use differencing to make it stationary.
After differencing, we can again perform the Augmented Dickey-Fuller test to check for stationarity.
If there is a significant change in the mean or variance of the time series, we can use change point detecti...read more

Asked in Coforge

3d ago

Q. 1) How decision tree works 2) what are the parameters used in OpenCV?

Ans.

Decision tree is a tree-like model used for classification and regression. OpenCV parameters include image processing and feature detection.

Decision tree is a supervised learning algorithm that recursively splits the data into subsets based on the most significant attribute.
It is used for both classification and regression tasks.
OpenCV parameters include image processing techniques like smoothing, thresholding, and morphological operations.
Feature detection parameters include...read more

Asked in C5i

1w ago

Q. Can we use a confusion matrix in Linear Regression?

Ans.

No, confusion matrix is not used in Linear Regression.

Confusion matrix is used to evaluate classification models.
Linear Regression is a regression model, not a classification model.
Evaluation metrics for Linear Regression include R-squared, Mean Squared Error, etc.

Asked in Affine

2d ago

Q. Do we minimize or maximize the loss in logistic regression?

Ans.

We minimize the loss in logistic regression.

The goal of logistic regression is to minimize the loss function.
The loss function measures the difference between predicted and actual values.
The optimization algorithm tries to find the values of coefficients that minimize the loss function.
Minimizing the loss function leads to better model performance.
Examples of loss functions used in logistic regression are cross-entropy and log loss.

Asked in AB InBev India

2w ago

Q. What approach did you use and why?

Ans.

I used a combination of supervised and unsupervised learning approaches to analyze the data.

I used supervised learning to train models for classification and regression tasks.
I used unsupervised learning to identify patterns and relationships in the data.
I also used feature engineering to extract relevant features from the data.
I chose this approach because it allowed me to gain insights from the data and make predictions based on it.

Asked in Green Rider Technology

1w ago

Q. Given an array of k linked lists, each of which is sorted in ascending order, how can you merge all the linked lists into a single sorted linked list and return the result?

Ans.

Merge k sorted linked lists into one sorted linked list efficiently.

Use a min-heap (priority queue) to keep track of the smallest elements from each list.
Initialize the heap with the head nodes of all k linked lists.
Extract the smallest node from the heap, add it to the result list, and push the next node from the same list into the heap.
Repeat until all nodes are processed.
Time complexity is O(N log k), where N is the total number of nodes.

Asked in GeekBull Consulting

1w ago

Q. What are p-values? Explain them in plain English without mentioning machine learning.

Ans.

P-values are a statistical measure that helps determine the likelihood of obtaining a result by chance.

P-values range from 0 to 1, with a smaller value indicating stronger evidence against the null hypothesis.
A p-value of 0.05 or less is typically considered statistically significant.
P-values are commonly used in hypothesis testing to determine if a result is statistically significant or not.