Joint Distributions¶

Consider two discrete random variables X and Y. The function given by f (x, y) = P(X = x, Y = y) for each pair of values (x, y) within the range of X is called the joint probability distribution of X and Y.

The joint probability mass function for discrete random variables (X=x, Y=y) is given by:

${\begin{aligned}\mathrm {P} (X=x\ \mathrm {and} \ Y=y)=\mathrm {P} (Y=y\mid X=x)\cdot \mathrm {P} (X=x)=\mathrm {P} (X=x\mid Y=y)\cdot \mathrm {P} (Y=y)\end{aligned}}$

Example¶

A coin is tossed twice. Let X denote the number of heads on the first toss and Y the total number of heads on the 2 tosses. Assume that the coin is biased and a head has a 60% chance of occurring:

X = First head
Y = Number of heads in 2 tosses

Compute the joint probability table and assign the values to the dictionary.

In [ ]:

# Assign the values of the dictionary of the form p_xy[X][Y] below
p_h = 0.6
p_t = 1-0.6
p_12 = 0
p_11 = 0
p_01 = 0
p_10 = 0

In [ ]:

p_12 = p_h * p_h
p_11 = p_h * p_t
p_01 = p_t * p_h
p_00 = p_t * p_t

print("p_12 %s, p_11 %s, p_01 %s, p_00 %.4s" % (p_12, p_11, p_01, p_00))

In [ ]:

ref_tmp_var = False

try:
    if (abs(p_12 - 0.36)<0.1) and (abs(p_11 - 0.24) < 0.1) and (abs(p_01 - 0.24) < 0.1) and (abs(p_00 - .16) < 0.1): 
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

\begin{array}{ l | c | r } \hline - & 1st - Toss & 2nd-Toss & JP \\ \hline HH & 0.6 & 0.6 & 0.36 \\ \hline HT & 0.6 & 0.4 & 0.24 \\ \hline TH & 0.4 & 0.6 & 0.24 \\ \hline TT & 0.4 & 0.4 & 0.16 \\ \hline \end{array}

The joint probability distribution looks like :

\begin{array}{ l | c | r } \hline H:T & X & Y & JP \\ \hline HH & 1 & 2 & 0.36 \\ \hline HT & 1 & 1 & 0.24 \\ \hline TH & 0 & 1 & 0.24 \\ \hline TT & 0 & 0 & 0.16 \\ \hline \end{array}

We can now organize the above in the form of a map with Y, X as:

\begin{array}{ l | c | r } \hline Y:X-> & 0 & 1 \\ \hline 0 & 0.16 & 0 \\ \hline 1 & 0.24 & 0.24 \\ \hline 2 & 0 & 0.36 \\ \hline \end{array}

Marginal Distribution¶

For a given two random variables X and Y whose joint distribution is known, the marginal distribution of X is simply the probability distribution of X averaging over information about Y. This is calculated by summing the joint probability distribution over Y.

For discrete random variable , marginal distribution of variable X is obtained by summing up the distribution of X over values of Y.

Let us consider the above joint distribution again:

\begin{array}{ l | c | r } \hline Y : X-> & 0 & 1 \\ \hline 0 & 0.16 & 0 \\ \hline 1 & 0.24 & 0.24 \\ \hline 2 & 0 & 0.36 \\ \hline \end{array}

Example¶

Compute the marginal distributions, f(X), f(y). Assign the list to the variables fX, fY.

In [ ]:

#Exercise
fX = []
fY = []

Sum over rows and columns for each marginal distribution.

In [5]:

fX = [0.4, 0.6]
fY = [0.16, 0.48, 0.36]

print("fX: ", fX)
print("fY: ", fY)

fX:  [0.4, 0.6]
fY:  [0.16, 0.48, 0.36]

In [8]:

ref_tmp_var = False

try:
    if fX == [0.4, 0.6] and fY == [0.16, 0.48, 0.36]: 
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

continue

For the above joint distribution the marginal distribution is below:

Marginal Distribution of X:

\begin{array}{ l | c | r } \hline X-> & 0 & 1 \\ \hline f(x) & 0.4 & 0.6 \\ \hline \end{array}

Marginal Distribution of Y:

\begin{array}{ l | c | r } \hline Y-> & 0 & 1 & 2 \\ \hline f(y) & 0.16 & 0.48 & 0.36 \\ \hline \end{array}

http://www.sci.csueastbay.edu/~btrumbo/Stat3401/Hand3401/JointDistnsCor.pdf

Corpus of words¶

Let us consider the case of a corpus (collection) of 100 words in a text. The words are tabulated below based on their frequency of occurrence and the probability - c(w) = count P(w) = Probability X = word length Y- number of Vowels.

Let us look at a joint probability table for this:

\begin{array}{ l | c | r } \hline word & c(w) & P(w) & X & Y \\ \hline the & 30 & 0.30 & 3 & 1 \\ \hline to & 18 & 0.18 & 2 & 1 \\ \hline will & 16 & 0.16 & 4 & 1 \\ \hline of & 10 & 0.10 & 2 & 1 \\ \hline hello & 7 & 0.07 & 5 & 2 \\ \hline in & 6 & 0.06 & 2 & 1 \\ \hline tools & 4 & 0.04 & 5 & 2 \\ \hline pose & 3 & 0.03 & 4 & 2 \\ \hline taste & 3 & 0.03 & 5 & 2 \\ \hline PGM & 3 & 0.03 & 3 & 0 \\ \hline \end{array}

From the above table, it is evident that the word "the" occurs 30 times (count column) out of a total of 100 words. Hence the probability of the word "the" is 0.30 (30/100 = 0.30). The X column refers to the length of the word. In this case x=3. The Y column refers to the number of vowels. In this case y=1. Similarly for the word "to" the probability of occurrence is 0.18 (18/100 = 0.18). X and Y are 2 and 1 respectively.

For arriving at joint probability distribution of variables X and Y, we must consider all the combinations of X and Y that are observed. For example, let us consider all the words with a length of 2 (that is X=2) and with exactly 1 vowel (Y=1). We have 3 occurrences namely "to", "of" and "in". We can get the joint probability by summing up the individual probabilities for these words. Those are 0.18, 0.10 and 0.06. Hence for X=2, Y=1 the joint probability is 0.18+0.10+0.06 which is 0.34. Similarly calculating the joint probabilities for all combinations of X and Y we get the Joint Probability Distribution table.

The joint probability distribution looks like this:

\begin{array}{ l | c | r } \hline Y/X-> & 2 & 3 & 4 & 5 \\ \hline 0 & 0 & 0.03 & 0 & 0 \\ \hline 1 & 0.34 & 0.30 & 0.16 & 0 \\ \hline 2 & 0 & 0 & 0.03 & 0.14 \\ \hline \end{array}

Exercise¶

Find the marginal distribution of X and Y from the above joint probability distribution.

Assign them to the variables fX and fY respectively.

In [2]:

#Exercise
fX = []
fY = []

Sum over rows(for fY array) and columns(for fX array) for each marginal distribution.

In [9]:

fX = [0.34, 0.33, 0.19, 0.14]
fY = [0.03, 0.80, 0.17]

print("fX: ", fX)
print("fY: ", fY)

fX:  [0.34, 0.33, 0.19, 0.14]
fY:  [0.03, 0.8, 0.17]

In [10]:

ref_tmp_var = False

try:
    if fX == [0.34, 0.33, 0.19, 0.14] and fY == [0.03, 0.8, 0.17]: 
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

continue

For the above joint distribution, the marginal distribution of X and Y are given below:

Marginal Distribution of X:

\begin{array}{ l | c | r } \hline X-> & 2 & 3 & 4 & 5 \\ \hline f(X) & 0.34 & 0.33 & 0.19 & 0.14 \\ \hline \end{array}

Marginal Distribution of Y:

\begin{array}{ l | c | r } \hline Y-> & 0 & 1 & 2 \\ \hline f(Y) & 0.03 & 0.80 & 0.17 \\ \hline \end{array}

Fraud Modeling Example¶

Consider a simple model of fraudulent transactions with data containing Sex (S), Age (A), Fraud (F), Jewelry (J) and probabilities P {P(S,A,F,J)}:

S	A	F	J	P
S_0	A_0	F_0	J_0	0.0025
S_0	A_0	F_0	J_1	0.0100
S_0	A_0	F_1	J_0	0.1069
...	...	...	...	...
S_1	A_2	F_1	J_1	0.0079

(F = No) corresponds to F_1

Compute p(S, A, F, J | F=No) and assign it to p_SAFJ

In [11]:

import pandas as pd

fraud_data = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/Fraud/fraud_data.csv')
fraud_data.head()

Out[11]:

	S	A	F	J	P
0	S_0	A_0	F_0	J_0	0.0025
1	S_0	A_0	F_0	J_1	0.0100
2	S_0	A_0	F_1	J_0	0.1069
3	S_0	A_0	F_1	J_1	0.0056
4	S_0	A_1	F_0	J_0	0.0008

Use fraud_data['F'].str.contains('F_1')

In [13]:

p_SAFJ = fraud_data[fraud_data['F'].str.contains('F_1')]
p_SAFJ['P'] = p_SAFJ['P']/p_SAFJ['P'].sum()
p_SAFJ

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[13]:

	S	A	F	J	P
2	S_0	A_0	F_1	J_0	0.118778
3	S_0	A_0	F_1	J_1	0.006222
6	S_0	A_1	F_1	J_0	0.190000
7	S_0	A_1	F_1	J_1	0.010000
10	S_0	A_2	F_1	J_0	0.166222
11	S_0	A_2	F_1	J_1	0.008778
14	S_1	A_0	F_1	J_0	0.118778
15	S_1	A_0	F_1	J_1	0.006222
18	S_1	A_1	F_1	J_0	0.190000
19	S_1	A_1	F_1	J_1	0.010000
22	S_1	A_2	F_1	J_0	0.166222
23	S_1	A_2	F_1	J_1	0.008778

In [14]:

ref_tmp_var = False

try:
    if abs(p_SAFJ['P'][2] - 0.1069) < 0.1: 
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

continue