## Top Machine Learning Questions Asked in Startup Interviews

**Explain prior probability, likelihood, and marginal likelihood in the context of the naive Bayes algorithm?**

**Answer:** Prior probability is nothing but the proportion of dependent (binary) variables in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam. The likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example, the probability that the word ‘FREE’ is used in a previous spam message is a likelihood. The marginal likelihood is the probability that the word ‘FREE’ is used in any message.

**You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?**

**Answer:** Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like a great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on unseen data, it gives disappointing results.

In such situations, we can use a bagging algorithm (like random forest) to tackle high variance problems. Bagging algorithms divide a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

**How is kNN different from k means clustering?**

**Answer:** Don’t get misled by ‘k’ in their names. You should know that the fundamental difference between both these algorithms is, kmeans is unsupervised in nature and kNN is supervised in nature. kmeans is a clustering algorithm. kNN is a classification (or regression) algorithm.

k means algorithm partitions a data set into clusters such that a cluster formed is homogeneous and the points in each cluster are close to each other. The algorithm tries to maintain enough separability between these clusters. Due to unsupervised nature, the clusters have no labels.

The kNN algorithm tries to classify an unlabeled observation based on its k (can be any number) surrounding neighbors. It is also known as lazy learning because it involves minimal training of models. Hence, it doesn’t use training data to make generalizations on unseen data sets.

**What is ‘Naive’ in a Naive Bayes?**

The Naive Bayes method is a supervised learning algorithm, it is naive since it makes assumptions by applying Bayes’ theorem that all attributes are independent of each other.

Bayes’ theorem states the following relationship, given class variable y and dependent vector x1 throughxn:

P (yi | x1,xn) =P(yi)P(x1,…, xn | yi)(P(x1,…, xn)

Using the naive conditional independence assumption that each xiis independent: for all I this relationship is simplified to:

P (xi | yi, x1, …, xi-1, xi+1, …., xn) = P (xi | yi)

Since, P (x1,xn) is a constant given the input, we can use the following classification rule:

P (yi | x1, …, xn) = P(y) ni=1P (xi | yi) P(x1,…,xn) and we can also use Maximum A Posteriori (MAP) estimation to estimate P(yi)and P(yi | xi) the former is then the relative frequency of class yin the training set.

P (yi | x1,…, xn) P(yi) ni=1P(xi | yi)

y = arg max P(yi)ni=1P (xi | yi)

The different naive Bayes classifiers mainly differ by the assumptions they make regarding the distribution of P (yi | xi): can be Bernoulli, binomial, Gaussian, and so on.

**Explain SVM Algorithm in Detail**

A Support Vector Machine (SVM) is a very powerful and versatile supervised machine learning model, capable of performing linear or non-linear classification, regression, and even outlier detection.

Suppose we have given some data points that each belong to one of two classes, and the goal is to separate two classes based on a set of examples.

In SVM, a data point is viewed as a p-dimensional vector (a list of p numbers), and we wanted to know whether we can separate such points with a (p-1)-dimensional hyperplane. This is called a linear classifier.

There are many hyperplanes that classify the data. To choose the best hyperplane that represents the largest separation or margin between the two classes.

If such a hyperplane exists, it is known as a maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier. The best hyperplane that divides the data in H3

We have data (x1, y1), …, (xn, yn), and different features (xii, …, xip), and yiis either 1 or -1.

The equation of the hyperplane H3 is the set of points satisfying:

x-b = 0

Where w is the normal vector of the hyperplane. The parameter b||w||determines the offset of the hyperplane from the original along the normal vector w

So, for each i, either xiis in the hyperplane of 1 or -1. Basically, xi satisfies:

xi – b 1 or w. xi – b -1