Data Science Interview Questions and Answers

Data Science Interview Question and Answers
Data Science Interview Question and Answers

Over the past few years, data is viewed as a valuable asset that makes data generation and collection a critical part of any business. Data Science helps in facilitating the organizations with the ability to process large volumes of data.

Every day, billions of tonnes of data is generated worldwide. This has resulted in data science becoming an obligatory requirement. Therefore there is a demand for the role of a data scientist among the recruiters. To prepare yourself for these job roles you must be familiar with the commonly asked Data Science Interview Questions and Answers.


  • Data Science can be defined as the fusion of multiple areas constituting science, predictive analysis, algorithms, statistics, system tools, and machine learning principles. It can be referred to as the interdisciplinary branch of science that emphasises on huge data sets or big data for knowledge extraction. These are repeatedly asked Data Science Interview Questions and Answers for Freshers.

  • Youtube uses Data Science’s recommendation algorithm to track the history of our previously watched videos and creates suggestions based on them which are displayed in the play next section. It reduces our effort to manually search for a related video.

  • In today’s world information is collected from various data sources resulting in massive heterogeneous data. Simple business intelligence tools can process this kind of data, therefore data science provides advanced analytics tools which use high-level algorithms for processing.

    1. Helps to build a connection with each customer personally so that their needs can be understood better.

      Allows the organization to know their target audience.

      Facilitates effective use of resources and provides the best possible solution.

  • Data scientists are expert analysts who gather and analyze huge structured and unstructured datasets. Their primary work involves transforming the available raw data into useful form and presenting it in such a way that is easy to understand.

  • Some of the responsibilities are:

      Collecting data from various sources and also cleaning it.

      Data analysis and processing

      Understanding business requirement

      Training and deployment of the model

      Documentation, Visualization, and Presentation of final results

    These are commonly asked Data Scientist Interview Questions and Answers.

  • Medical images such as X Rays, MRIs, CT scans, etc can be easily interpreted with the help of Data science. A data scientist can predict future medical health for a patient based on the data of his medical history. Various diseases like cancer, schizophrenia, Alzheimer's, etc can be diagnosed at early stages with help of pattern matching and spectrum analysis. It also provides a greater understanding of how genetic tissues are affected and the reaction to certain drugs or diseases.


  • Supervised

    Unsupervised

     Known and labeledinput and output data is used.

     The data used is unknown and unlabeled.

     It has a feedback mechanism.

     No such mechanism is present.

     Its goal is to predict the outcome of the new  data.

     It gives insights from large volumes of  data.

     Eg: logistic regression, decision trees.

     Eg: Kmeans, apriori.


  • Linear Regression is a supervised learning algorithm that provides a mathematical relationship between two or more variables, one dependent and others independent.

    Y = mx + c

    Here Y is the dependent variable and x is an independent variable; m and c are constants.

    In data science, this relationship helps predict the outcome of events.

  • Overfitting is a modeling error. In overfitting, the analysis of a model is too closely linked (or in some cases exactly linked) to a particular set of data therefore it fails to fit any other data or predict any future outcomes. Overfitted models train data too deep and thus fail to generalize the data.

  • NLP stands for Natural Language Processing. It is one of the branches of artificial intelligence that converts human language into a language that a machine understands.

  • NLP is used in Google translator, chatbots, and various virtual voice assistants like Siri and Alexa. It also finds its application in sentence correction, text completion, or word suggestion.

    1. Its usefulness is only restricted to linear relationships.

      Does not provide a descriptive relationship between variables.

      Is highly sensitive to noise and outliers.

      Assumes data is independent of each other.


  • Regression

    Classification

     Used in the prediction of continuous values like  age and salary.

     Used for predicting discrete values like True or  False.

     It predicts ordered data.

     It can predict unordered data.

     Calculates output by measuring accuracy.

     Output is calculated by measuring root mean  square error.

     Eg: Decision tree.

     Eg: Random forest.


  • Root mean square error is a general-purpose error metric for numerical predictions. It is a standard way to measure errors in a quantitative data predicting model. When we square root the mean of squares of all the errors we obtain the value of root mean squared error.

  • Logistic regression is a type of supervised learning algorithm. It is a statistical model that makes use of a logistic function to predict the binary value of a dependent variable like 0 and 1 or True and False. It is similar to linear regression except for the fact that linear regression is used for regression problems and logistic regression is used for solving classification problems.

    1. Assumes that the dependent variable would be binary.

      The correlation between the independent variables is almost negligible.

      Its accuracy is directly proportional to the size of the data set.

    These are frequently asked Data Scientist Interview Questions and Answers for Freshers.

  • Regression analysis is a type of predictive modeling technique that establishes a relation between a dependent variable and one or more independent variables. When we have only one independent variable it is all regression analysis. In the case of more than one independent variable, it is called multiple regression. It can be further classified into linear regression and logistic regression.

  • According to the number of categories, Logistic regression has the following types:

      Binomial: The dependent variable can have only 2 types of value like 0 or 1.

      Multinomial: Three or more values of the target variable are possible. These values are not ordered. Eg: sun, moon, and stars.

      Ordinal: Three or more ordered values of target variables are allowed. For example, very low, low, medium, high, very high.

  • Data sampling is a statistical technique in which we take a sample out of the whole data set and analyze it to find patterns in the original large data set. Sampling data can be done two in two ways: probability and non-probability.

    1. It helps in quick and easy analysis of the data set.

      More efficient results.

      Cost-effective models.

  • All the elements of the population have a known and non-zero probability. Its features like bias and sampling error are usually known. It can further be divided into:

      Simple random sampling: Subjects are randomly selected from the whole population.

      Stratified sampling: Based on a common factor, the data is divided into subsets, and samples are collected randomly from each subset.

      Cluster sampling: The data set is divided into clusters based on a defining factor, then a random cluster sample is analyzed.

      Multistage sampling: This method involves subsetting the larger population into a number of clusters. The subsets are further divided based on a secondary factor, and the obtained clusters are sampled and analyzed. This division of clusters continues till multiple subsets are identified. It is a more complicated version of cluster sampling.

      Systematic sampling: An interval is set at which a sample is created and data is extracted from the larger data set. Eg: If we select every 15th row in data containing 150 items, a sample size of 10 rows would be created.

  • In non-probability sampling, the analyst defines the factor based on which the data would be sampled and extracted. It can be difficult to estimate if the sample accurately represents the larger population.

    Some of the non-probability data sampling methods are:

      Convenience sampling: Easily available and accessible groups are used to collect data.

      Consecutive sampling: Every subject that meets the criteria is selected until the sample size limit is reached.

      Purposive or judgmental sampling: A predefined criterion is used to select the data from the sample.

      Quota sampling: Equal representation is given to all subgroups within the sample population by the researcher.

  • An underfit model is a statistical model which is unable to predict the accuracy of the data as it fails to capture relationships between the input and output data. Underfitting simply means that the model does not fit the data well. This usually happens when the model is not trained well or there are not enough features in the data.

    1. By increasing the complexity of the model.

      An increasing number of attributes.

      Removing noise and outliers from the data.

      Increasing the duration of training.

  • Bias is a kind of error that is caused due to the assumptions made by the model. High bias value fails an algorithm in finding relevant relation between feature and output.

    Variance is a type of error that occurs due to fluctuations in the training set. It is sensitive to even very small changes in the data. Higher the fluctuation, the higher the variance.

  • This statement is false. Overfit models have high variance and low bias.

  • In type 1 errors a true null hypothesis is rejected. It is also called false positive.

    Type 2 errors occur when a false null hypothesis is accepted. It is known as a false negative.

  • In sampling, there are three kinds of biases:

      Selection bias

      Under coverage bias

      Survivorship bias

    1. Beeping of metal detectors without the presence of any metal.

      Convicting an innocent person. Data Science Online Course at FITA Academy provides extensive training on the Data Science lifecycle and its concepts with numerous real-time practices.

    1. Python

      R

      Javascript

      SQL

      Scala

  • Recommender System predicts the ratings of an item or a product which the users are likely to give. It is a subset of information filtering techniques.

  • AB testing is a type of control experiment done using random testing. The goal of this testing method is to find out which variable or variable version works better when placed in a controlled environment. These are commonly asked Data Scientist Interview Questions and Answers for Freshers and Experienced candidates.

    1. Pandas

      Matplotlib

      SciPy

      Seaborn

      NumPy

      SciKit

  • Some methods to reduce overfitting :

      Increasing the value of training data.

      Reducing the complexity of the model.

      Ridge and Lasso Regularization.

      Reducing the training time.

  • Data analysis involves using statistical methods to collect, clean, analyze, manipulate data in order to discover valuable information which can be used for better decision making.

  • Data containing only one variable is known as a univariate variable and the analysis variable to it is called univariate analysis. Eg: boxplot.

    Bivariate data contains two types of variables. Bivariate analysis determines the relationship between the two variables.

  • In unsupervised learning, a model trains itself without the use of any classification or labels in the data. They act without any supervision from the user. Eg: Kmeans.

  • Clustering is a technique of grouping objects into different sets or clusters in such a way that the objects belonging to the same cluster are more similar to each other than to the objects in other clusters.

    1. Density-Basederrorstering

      Distribution Based Clustering.

      Partition Based Clustering

      Hierarchical Clustering.

  • for num in range(1,51):

    if (num % 3 == 0):

    print(“Apple”)

    elif (num % 5 == 0):

    print(“Pine”)

    elif (num % 3 == 0 and num % 5 == 0 ):

    print(“Pineapple”)

    else:

    print(num)

  • For large data sets, we can remove the rows with missing data values and the rest of the data can be used to predict the values.

    For small data sets, the mean of the dataset can replace the null values. This can be done using the methods of Python’s panda’s library such as df.mean(), df.fillna(mean).

  • Kmeans is a type of unsupervised learning algorithm. It categorizes data into K groups or clusters on the basis of similarity. The similarity between data points is calculated using Euclidean distance.

  • Kmeans algorithm works as follows:

      First, the k number of clusters is decided.

      A mean value of each cluster is randomly selected.

      The data points are assigned to each cluster depending upon the closed distance to the mean value.

      The mean value is updated to the average of the data points in the cluster.

      This process is repeated till the total number of iterations are reached and then we have our desired clusters.

  • For 2 data points A(x, y) and B(x1, y1) euclidean distance is calculated as :
    sqrt( (x-x1)**2 + (y-y1)**2 )

  • Statistics help in summarizing the data quickly. It provides various tools for analyzing the data. Statistics concepts help data scientists in gaining valuable insights from the data to perform quantitative analysis on it. Statistical methods such as classification, regression, hypothesis testing, time-series analyses are of great assistance to data scientists while experimenting on the data.

  • You can secure an S3 bucket in the following two ways:

      Data wrangling is the process of cleaning the data and organizing it so that it can be used for analyses.

  • A data analyst works on existing data while a data scientist finds new methods of manipulating, capturing, and analyzing the data for the use of data analysts.

  • There are 4 types of analysis:/p>

      Predictive analysis

      Prescriptive analysis

      Descriptive analysis

      Diagnostic analysis

  • ED = sqrt( (3-5)^2 + (4-2)^2 ) = 2.82

    1. Linear regression

      Random Forest

      KNN

      Logistic regression

  • The decision tree algorithm is a type of supervised learning algorithm that can be used to solve classification and regression problems. It has a tree-like structure where the internal nodes represent the attributes of a dataset, the branches represent the decision, and outcomes are represented by leaf nodes. It is a graphical representation of problems and their solutions according to the given conditions.

    The CART algorithm is used for building the tree. It stands for Classification and Regression Tree algorithm. These are commonly asked Data Science Interview Questions and Answers.

  • Dimensionality reduction is the process of reducing the size of a data set by removing some of its attributes in such a way that the information it conveys is unchanged.

    1. They are easy to understand as they enact human thinking ability while making any decision.

      The tree-like structure makes understanding the model easy.

  • The process of eliminating unwanted tree nodes to obtain an optimal tree is known as pruning. It is done in order to save the accuracy of the decision tree.

    1. The entire dataset is taken as input.

      Find a test or split such that the separation of the classes is maximum.

      The split is applied to the input data. This is known as the divided step.

      Apply steps one and two again to the divided data.

      Stop at stopping criteria.

      The tree is cleaned up if there are too many splits.

  • In ensemble learning various sets of learners are combined together in order to improve the model's stability and power of prediction. There are two types of Ensemble learning methods: Bagging and boosting.

  • Bagging helps in the implementation of the same learners on a sample population of small size and makes nearer predictions.

    Boosting helps build stronger models by reducing bias. It iterates and adjusts the weight of an observation based on the previous classification.

  • The term RMSE stands for - "root means square error". It is the measure of complete accuracy in the Regression. Generally, the RMSE permits you to calculate the total magnitude of an error that is produced by the regression model. You can calculate the RMSE by the method that is given below:

      Firstly, you should calculate the total number of errors in the predictions by using a regression model. To do this, you can calculate the complete differences between the actual & predicted values.

      Secondly, you should square those errors.

      Thirdly, you can calculate the mean of the square errors.

      Finally, you should take the square root of the total mean of all the squared errors.

    1. Compared to other algorithms, it requires less data cleaning.

      Follows the same decision-making approach as a human.

      Extremely useful in decision-related problems.

  • Data allocated to different categories in a high imbalanced manner is called imbalanced data. It gives significance to large values in a data set affecting the performance of a model.

  • Random forest is a type of ensemble learning method which uses a supervised learning approach. It constitutes multiple decision trees on various subsets of the data and takes the mean of all for improved predictive accuracy.

    1. Provides high accuracy irrespective of the size of the dataset.

      Less training time.

      Accuracy is maintained in case of missing data as well.

      Can help in classification as well as regression.

    1. High complexity due to the presence of multiple layers.

      Can produce an overfit model.

      Computational complexity increases with an increase in the number of class labels.

  • Random forest helps in identifying the risk of a loan in the banking sector. In the healthcare sector, it helps in finding patterns of diseases and the risk they can cause.

  • Cross-validation helps in estimating the accuracy of a model. It is a statistical method in which a part of a data set, called validation data, is removed while training the model and later on used for testing the model. If positive results are received after testing, the model is approved.

  • LASSO is a regression analysis method that stands for Least Absolute Shrinkage and Selection Operator. It enhances the accuracy of a model by performing selection as well as regularization of the data.

    1. Using fewer variables in the dataset.

      Making use of techniques like cross-validation.

      Using regularization techniques such as LASSO.

  • A hypothesis is a theory that describes the nature of a population. Hypothesis testing compares two mutually exclusive statements about a population and concludes the statement which best describes the sample data.

  • P-value is a numerical value ranging from 0 to 1 which helps in determining the strength of your outcome in a hypothesis test. Data Science Course in Bangalore at FITA Academy imparts the students of the training program with the required skills and knowledge that are required for a professional Data Scientist.

  • When p-value <= 0.5 it means that the null hypothesis is incorrect and should be rejected. While a value more than 0.5 indicates the accuracy of the null hypothesis, therefore, it is accepted. These are commonly asked Data Science Interview Questions and Answers for Freshers.

  • Machine learning is the ability of a machine to understand new things and automatically predict the outcomes of an event without being programmed by a developer.

  • Artificial neural networks are computational networks inspired by biological neural networks. They are designed to replicate the working of a human brain i.e., how the human brain processes and analyzes information. It adapts to the input to provide the best possible output.

    1. K- Fold Cross-Validation

      Leave p-out Cross-Validation

      Leave-one-out cross-validation.

      Holdout method

  • Deep learning belongs to the family of machine learning algorithms. It is based on artificial neural networks. It contains 3 or more layers. Deep learning models absorb data and learn from it automatically.

  • Python’s panda’s library contains high-level data analytical tools and data structures that are more suitable for text analysis.

  • Collaborative filtering is a kind of technique used by the recommenders system. Its algorithm automatically filters the preferences of a user and makes recommendations according to the user’s interests.

  • The most popular e-commerce website, Amazon, makes use of collaborative filtering. If a buyer purchases items A and B it would recommend item C to the buyer based on previous buying histories.

    1. Amazon

      Youtube

      Netflix

      Spotify

      LinkedIn

    1. Memory Based: Recommendations are made based on the likeness of an item through user rating information.

      Model-Based: Data mining helps in creating models which find trends based on training data. Then predictions for actual data are made using these models.

      Hybrid: It is a combined approach of memory and model-based collaborative filtering.

  • Linear regression would be the best algorithm since it builds a relationship between events having multiple independent variables.

  • Regularization is the addition of extra variables to the data to improve a model's performance. It is used to solve the problem of overfitting by appropriately fitting it to the model.

  • It is important to clean your data before using it because it increases the productivity of the model as it removes unwanted and duplicate values from the data. It eliminates the possibility of errors and inconsistencies in the model.

    1. Pytorch

      Tensorflow

      Keras

      Sonnet

      Chainer

  • Precision is a numerical value ranging from 0 to 1. It is the percentage of relevant results that the algorithm classifies.

  • The Law of large numbers states that with the increase in the number of trials, the mean or the average result comes in close range to the expected value.

  • The normal distribution is a bell-shaped curve that shows the distribution of continuous variables. It is a kind of probability distribution that shows the position of variables with respect to the mean of the data.

    1. If there is a change in the data source.

      For the evolution of data model through infrastructure.

      If the algorithm is not stationary.

  • The t-test helps in determining the similarity or differences between the means of two groups. It is often used in hypothesis testing to test the differences between the two populations.

  • DBSCAN algorithm is a type of clustering technique that uses an unsupervised learning approach. It divides the dataset into different clusters based on the minimum distance between data points and the number of points that can be placed in each cluster. It uses 2 primary parameters for clustering:

    Epsilon - the least possible distance between 2 data points.

    Min - the minimum number of data points that should be present in each cluster.

    1. Self-driving cars

      Virtual assistants

      Chatbots

      Computer vision

      Image processing

    1. Core Point: It has more than min points within epsilon.

      Border Point: It lies in the neighborhood of a core point and contains fewer than min points within epsilon.

      Noise or outlier: It is neither a core point nor a border point.

    1. No need to set the number of clusters.

      Clusters can be arbitrarily shaped.

      Remains unaffected by outliers.

      Only 2 parameters are required.

      Is not sensitive to data ordering.

    1. Data points reachable from more than one cluster can be placed in any cluster.

      Data with large different densities cannot be clustered properly.

      Choosing epsilon can be difficult if data is not well understood.

    1. dbscan

      fpc

      factoextra

  • install.packages(“factoextra”,”fpc”.”dbscan”) library(factoextra) library(fpc) library(dbscan) data("multishapes", package = "factoextra") df <- multishapes[, 1:2] db <- fpc::dbscan(df, eps = 0.15, MinPts = 5) plot(db, df, main = "DBSCAN", frame = FALSE)
  • Naive Bayes classifier is a type of probabilistic classifiers which use the Bayes theorem. It assumes that the features are independent of each other. It can be combined with other kernel functions to increase its accuracy.

  • The validation set is used for parameter selection so that overfitting can be avoided whereas the test set is used for testing the trained model performance.

  • It is the power of a binary hypothesis that determines the probability of a test to reject the null hypothesis because the alternative hypothesis is true.


  • Recurrent Neural Network(RNN)

    Convolutional Neural Network(CNN)

     Used for sequential data.

     Used in images and distributed data.

     Variable dimension data can be used.

     Fixed-size input and output required.

     Uses its mechanism for internal memory.

     Type of a feed-forward neural network.

     Used in time series and text classification.

     Used in image processing.


    These are commonly asked Data Science Interview Questions and Answers.

  • Naive Bayes assumes that the variables are not correlated to one another which is never true, this serves as a major drawback.

    Decorrelating its features can resolve this issue so that the assumption it makes becomes true.

Conclusion

This article covers the commonly asked questions in Data Science interviews which are extremely important to ace any interview. We hope these questions and answers will help you through your interview process. Apart from these questions and answers if you are considering upskilling your Data Science knowledge, check out Data Science Course in Chennai at FITA Academy. They provide extensive knowledge about courses on data science under expert mentorship.


Interview Questions


FITA Academy Branches


Chennai