Consider two discrete random variables X and Y. The function given by f (x, y) = P(X = x, Y = y) for each pair of values (x, y) within the range of X is called the joint probability distribution of X and Y.
The joint probability mass function for discrete random variables (X=x, Y=y) is given by:
${\begin{aligned}\mathrm {P} (X=x\ \mathrm {and} \ Y=y)=\mathrm {P} (Y=y\mid X=x)\cdot \mathrm {P} (X=x)=\mathrm {P} (X=x\mid Y=y)\cdot \mathrm {P} (Y=y)\end{aligned}}$
A coin is tossed twice. Let X denote the number of heads on the first toss and Y the total number of heads on the 2 tosses. Assume that the coin is biased and a head has a 60% chance of occurring:
Compute the joint probability table and assign the values to the dictionary.
# Assign the values of the dictionary of the form p_xy[X][Y] below
p_h = 0.6
p_t = 1-0.6
p_12 = 0
p_11 = 0
p_01 = 0
p_10 = 0
p_12 = p_h * p_h
p_11 = p_h * p_t
p_01 = p_t * p_h
p_00 = p_t * p_t
print("p_12 %s, p_11 %s, p_01 %s, p_00 %.4s" % (p_12, p_11, p_01, p_00))
ref_tmp_var = False
try:
if (abs(p_12 - 0.36)<0.1) and (abs(p_11 - 0.24) < 0.1) and (abs(p_01 - 0.24) < 0.1) and (abs(p_00 - .16) < 0.1):
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
The joint probability distribution looks like :
\begin{array}{ l | c | r } \hline H:T & X & Y & JP \\ \hline HH & 1 & 2 & 0.36 \\ \hline HT & 1 & 1 & 0.24 \\ \hline TH & 0 & 1 & 0.24 \\ \hline TT & 0 & 0 & 0.16 \\ \hline \end{array}We can now organize the above in the form of a map with Y, X as:
\begin{array}{ l | c | r } \hline Y:X-> & 0 & 1 \\ \hline 0 & 0.16 & 0 \\ \hline 1 & 0.24 & 0.24 \\ \hline 2 & 0 & 0.36 \\ \hline \end{array}For a given two random variables X and Y whose joint distribution is known, the marginal distribution of X is simply the probability distribution of X averaging over information about Y. This is calculated by summing the joint probability distribution over Y.
For discrete random variable , marginal distribution of variable X is obtained by summing up the distribution of X over values of Y.
Let us consider the above joint distribution again:
\begin{array}{ l | c | r } \hline Y : X-> & 0 & 1 \\ \hline 0 & 0.16 & 0 \\ \hline 1 & 0.24 & 0.24 \\ \hline 2 & 0 & 0.36 \\ \hline \end{array}#Exercise
fX = []
fY = []
Sum over rows and columns for each marginal distribution.
fX = [0.4, 0.6]
fY = [0.16, 0.48, 0.36]
print("fX: ", fX)
print("fY: ", fY)
fX: [0.4, 0.6] fY: [0.16, 0.48, 0.36]
ref_tmp_var = False
try:
if fX == [0.4, 0.6] and fY == [0.16, 0.48, 0.36]:
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
For the above joint distribution the marginal distribution is below:
Marginal Distribution of X:
\begin{array}{ l | c | r } \hline X-> & 0 & 1 \\ \hline f(x) & 0.4 & 0.6 \\ \hline \end{array}Marginal Distribution of Y:
\begin{array}{ l | c | r } \hline Y-> & 0 & 1 & 2 \\ \hline f(y) & 0.16 & 0.48 & 0.36 \\ \hline \end{array}http://www.sci.csueastbay.edu/~btrumbo/Stat3401/Hand3401/JointDistnsCor.pdf
Let us consider the case of a corpus (collection) of 100 words in a text. The words are tabulated below based on their frequency of occurrence and the probability - c(w) = count P(w) = Probability X = word length Y- number of Vowels.
Let us look at a joint probability table for this:
\begin{array}{ l | c | r } \hline word & c(w) & P(w) & X & Y \\ \hline the & 30 & 0.30 & 3 & 1 \\ \hline to & 18 & 0.18 & 2 & 1 \\ \hline will & 16 & 0.16 & 4 & 1 \\ \hline of & 10 & 0.10 & 2 & 1 \\ \hline hello & 7 & 0.07 & 5 & 2 \\ \hline in & 6 & 0.06 & 2 & 1 \\ \hline tools & 4 & 0.04 & 5 & 2 \\ \hline pose & 3 & 0.03 & 4 & 2 \\ \hline taste & 3 & 0.03 & 5 & 2 \\ \hline PGM & 3 & 0.03 & 3 & 0 \\ \hline \end{array}From the above table, it is evident that the word "the" occurs 30 times (count column) out of a total of 100 words. Hence the probability of the word "the" is 0.30 (30/100 = 0.30). The X column refers to the length of the word. In this case x=3. The Y column refers to the number of vowels. In this case y=1. Similarly for the word "to" the probability of occurrence is 0.18 (18/100 = 0.18). X and Y are 2 and 1 respectively.
For arriving at joint probability distribution of variables X and Y, we must consider all the combinations of X and Y that are observed. For example, let us consider all the words with a length of 2 (that is X=2) and with exactly 1 vowel (Y=1). We have 3 occurrences namely "to", "of" and "in". We can get the joint probability by summing up the individual probabilities for these words. Those are 0.18, 0.10 and 0.06. Hence for X=2, Y=1 the joint probability is 0.18+0.10+0.06 which is 0.34. Similarly calculating the joint probabilities for all combinations of X and Y we get the Joint Probability Distribution table.
The joint probability distribution looks like this:
\begin{array}{ l | c | r } \hline Y/X-> & 2 & 3 & 4 & 5 \\ \hline 0 & 0 & 0.03 & 0 & 0 \\ \hline 1 & 0.34 & 0.30 & 0.16 & 0 \\ \hline 2 & 0 & 0 & 0.03 & 0.14 \\ \hline \end{array}Find the marginal distribution of X and Y from the above joint probability distribution.
Assign them to the variables fX and fY respectively.
#Exercise
fX = []
fY = []
Sum over rows(for fY array) and columns(for fX array) for each marginal distribution.
fX = [0.34, 0.33, 0.19, 0.14]
fY = [0.03, 0.80, 0.17]
print("fX: ", fX)
print("fY: ", fY)
fX: [0.34, 0.33, 0.19, 0.14] fY: [0.03, 0.8, 0.17]
ref_tmp_var = False
try:
if fX == [0.34, 0.33, 0.19, 0.14] and fY == [0.03, 0.8, 0.17]:
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
For the above joint distribution, the marginal distribution of X and Y are given below:
Marginal Distribution of X:
\begin{array}{ l | c | r } \hline X-> & 2 & 3 & 4 & 5 \\ \hline f(X) & 0.34 & 0.33 & 0.19 & 0.14 \\ \hline \end{array}Marginal Distribution of Y:
\begin{array}{ l | c | r } \hline Y-> & 0 & 1 & 2 \\ \hline f(Y) & 0.03 & 0.80 & 0.17 \\ \hline \end{array}Consider a simple model of fraudulent transactions with data containing Sex (S), Age (A), Fraud (F), Jewelry (J) and probabilities P {P(S,A,F,J)}:
S | A | F | J | P |
---|---|---|---|---|
S_0 | A_0 | F_0 | J_0 | 0.0025 |
S_0 | A_0 | F_0 | J_1 | 0.0100 |
S_0 | A_0 | F_1 | J_0 | 0.1069 |
... | ... | ... | ... | ... |
S_1 | A_2 | F_1 | J_1 | 0.0079 |
(F = No) corresponds to F_1
import pandas as pd
fraud_data = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/Fraud/fraud_data.csv')
fraud_data.head()
S | A | F | J | P | |
---|---|---|---|---|---|
0 | S_0 | A_0 | F_0 | J_0 | 0.0025 |
1 | S_0 | A_0 | F_0 | J_1 | 0.0100 |
2 | S_0 | A_0 | F_1 | J_0 | 0.1069 |
3 | S_0 | A_0 | F_1 | J_1 | 0.0056 |
4 | S_0 | A_1 | F_0 | J_0 | 0.0008 |
Use fraud_data['F'].str.contains('F_1')
p_SAFJ = fraud_data[fraud_data['F'].str.contains('F_1')]
p_SAFJ['P'] = p_SAFJ['P']/p_SAFJ['P'].sum()
p_SAFJ
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
S | A | F | J | P | |
---|---|---|---|---|---|
2 | S_0 | A_0 | F_1 | J_0 | 0.118778 |
3 | S_0 | A_0 | F_1 | J_1 | 0.006222 |
6 | S_0 | A_1 | F_1 | J_0 | 0.190000 |
7 | S_0 | A_1 | F_1 | J_1 | 0.010000 |
10 | S_0 | A_2 | F_1 | J_0 | 0.166222 |
11 | S_0 | A_2 | F_1 | J_1 | 0.008778 |
14 | S_1 | A_0 | F_1 | J_0 | 0.118778 |
15 | S_1 | A_0 | F_1 | J_1 | 0.006222 |
18 | S_1 | A_1 | F_1 | J_0 | 0.190000 |
19 | S_1 | A_1 | F_1 | J_1 | 0.010000 |
22 | S_1 | A_2 | F_1 | J_0 | 0.166222 |
23 | S_1 | A_2 | F_1 | J_1 | 0.008778 |
ref_tmp_var = False
try:
if abs(p_SAFJ['P'][2] - 0.1069) < 0.1:
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue