Category Archives: Machine Learning

Conditional Probability and Bayes Theorem

Understanding probability of occurring of an event is an important part to understand in Machine learning and Data Science.

Taking a simple example, say we need to figure out probability of two dice getting rolled and sum is greater or equal to 10, i.e. sum can be 10, 11, 12. This can occur in following manner

{{4,6},{5,5},{5,6},{6,4},{6,5},{6,6}} =6

so we have 6 outcomes which can fulfill this condition.
Total outcomes

{{1,1},{1,2},…,{5,5},{5,6}} = 36

So probability of sum to be more than or equal to 10 is

desirable outcomes/ total outcomes = 6/36 = 1/6

After the simple probability, comes conditional probability. There might be some preconditions for an event to occur, which increases probability of the event to occur. For example, probability of “it will rain” gets increased if we already have an event “it will be cloudy”.

Going back to previous example, lets say we add a condition “figure out probability of two dice getting rolled and sum is greater or equal to 10 – when the first dice already has 5”

P(A)= P(A and B)/P(B)
A = Event we need to check probability for, in this case, probability of sum greater than 10 when first dice has 5
B = Event that First dice has five
A and B = probability of both events occurring together.

We already know total outcomes for two dice being rolled is 36.
Probability for Event B, i.e. first dice has 5, we have {{5,1},{5,2},{5,3},{5,4},{5,5},{5,6}}
i.e. 6/36 or 1/6
Probability for A and B, i.e. first dice has 5 and sum is greater or equal to 10 {{5,5},{5,6}}
i.e. 2/36 or 1/18

putting these values in formula

P(A)= P(A and B)/P(B)
P(A)= (1/18)/(1/6) or (6/18) or (1/3)

Finally coming to Bayes Theorem, this gives relation between conditional probability and its reverse.

P(A/B) = (P(B/A)*(PA))/P(B)
or
P(A/B)*P(B) = P(B/A)*(PA)

A good example is here

But I will stick to simpler one

You have 2 bags, say X and Y. X has 6 Green and 6 Red balls. Y has 4 Red and 8 Green Balls. one of the bags is randomly chosen and a ball is picked randomly. If the ball is Red, what is probability that it was taken from bag X.

We need to find P(Bag X/ Ball is Red)
As per Bayes we have
P(Bag X/ Ball is Red) = (P(Ball is Red/Bag X)*P(Bag X))/ P(Ball is Red)

As bag is chosen randomly, we know
P(Bag X) and P(Bag y) both is 1/2
P(Ball is Red/Bag X) = Bag X has 12 balls of which 6 are Red, so 6/12 = 1/2
P(Ball is Red) = Red ball is chosen from X bag OR Red ball is chosen from Y bag
=P(Ball is Red/Bag X)*P(Bag X) + P(Ball is Red/Bag Y)*P(Bag Y)
= 6/12*1/2 + 4/12*1/2
= 10/24 = 5/12

P(Bag X/ Ball is Red) = (P(Ball is Red/Bag X)*P(Bag X))/ P(Ball is Red)
= (1/2*1/2)/5/12 = (1/4)/(5/12) = 3/5

A few basic statistics for Machine Learning

Machine learning itself is big area, but to get started one needs to be familiar with a few basic concepts of Statistics and Probability.

Here are a few basic concepts given a set of numbers say 12, 5, 7, 17, 22, 5, 10, 2, 5, 17, 2, 11

Mean- Given a set of N numbers, mean is average of these numbers. 115/12 = 9.58

Median- Sort the list, for even items, median is sum of the middle 2 numbers divided by 2. For odd items median is the middle number.
2, 2, 5, 5, 5, 7, 10, 11, 12, 17, 17, 22
7+10= 17/2=8.5

Mode- Number that appears most often in the list. In this case it is 5.

Range- Max number minus minimum number in the list: 22-2= 20

Variance- It is the measure of how the set of number is varying from the mean. This is calculated by finding difference of each number from mean and than squaring.

Taking a simple example in this case, say we have 3 numbers, 1, 2, 3 mean in this case would be 6/3 =2. So variance would be
(1-2)^2+(2-2)^2+(3-2)^2 / 3= 2 /3 = 0.67

Standard Deviation- The variance is figured by squaring the difference of mean and numbers in list. Standard deviation takes the square root of variance to reset the unit of original list. So in case variance was 0.67, standard deviation would be sqtr(0.67) or 0.81.

K-means clustering

Clustering, as the name suggest is simply grouping some random elements. Say, you are given points on a 2-D graph, you need to figure out some kind of relationship or pattern among them. For example you might have crime position in a city or Age vs Mobile Price graph for purchasing habit of an online website.

In Machine Learning, clustering is an important class of algorithms as it helps figure out a pattern in random set of data and helps take decision based on the outcome. For example, an online store needs to send a targeted marketing campaign for new Mobile phone launched. Clustering algorithm can help figure out buying patterns and single out customers who are most likely to buy the mobile phone launched.

There are many clustering algorithms, K-means being one of the simplest and highly used.

1. Place K points randomly on the graph. These will serve as initial center points or centroids of the K clusters.
2. Now assign each item (point) existing in the graph to one K clusters, based on its closeness (shortest distance from centroid) to the centroid of cluster
3. When all items are assigned a cluster, recalculate centroid of the cluster (point from where distance to all the items in cluster is minimum on an average).
4. Repeat step 2 and 3, until there is no scope of improvement (no movement in seen in centroids).

The above algorithm will divide all the items into K clusters. Another challenge will be to come up with optimum value of K. i.e. How many clusters can correctly or adequately represent the data. Well there are many ways to figure that, one of effective way is the elbow method. You start plotting number of clusters on X axis and Sum of squared errors on Y axis. Sum of squared errors is, you take each point in a cluster, find the distance from centroid and square it (to eliminate negative distances), this is the error for this particular point as ideally you want each point near to centroid of cluster to make it perfect. Then sum up all these errors and plot on graph. The resulting graph starts looking like an arm. We find out the elbow, that is, where we see the numbers getting flatter on X axis. This means any additional cluster is not adding too much value to the information we already have.