Datasets and simulation


Comma Separated Values


id,name,age,height 0,Tom,32,1.84 1,Mary,45,1.67 2,Lisa,88,1.77 3,Brad,21,1.95 ...


R datasets



Pandas


In [1]:
import pandas as pd
In [2]:
dfsleep = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sleep.csv")
dfsleep
Out[2]:
Unnamed: 0 extra group ID
0 1 0.7 1 1
1 2 -1.6 1 2
2 3 -0.2 1 3
3 4 -1.2 1 4
4 5 -0.1 1 5
5 6 3.4 1 6
6 7 3.7 1 7
7 8 0.8 1 8
8 9 0.0 1 9
9 10 2.0 1 10
10 11 1.9 2 1
11 12 0.8 2 2
12 13 1.1 2 3
13 14 0.1 2 4
14 15 -0.1 2 5
15 16 4.4 2 6
16 17 5.5 2 7
17 18 1.6 2 8
18 19 4.6 2 9
19 20 3.4 2 10


Investigating


In [3]:
import seaborn as sns
In [4]:
dfiris = pd.read_csv("https://raw.githubusercontent.com/ianmcloughlin/datasets/master/iris.csv")
dfiris
Out[4]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [5]:
sns.pairplot(dfiris, hue="class");
In [6]:
sns.displot(dfiris["petal_width"], kde=True);
In [7]:
sns.displot(dfiris[dfiris["class"] == "setosa"]["petal_width"], kde=True);
In [8]:
dfauto = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dfauto
Out[8]:
18.0 8 307.0 130.0 3504. 12.0 70 1\t"chevrolet chevelle malibu"
0 15.0 8 350.0 165.0 3693. 11...
1 18.0 8 318.0 150.0 3436. 11...
2 16.0 8 304.0 150.0 3433. 12...
3 17.0 8 302.0 140.0 3449. 10...
4 15.0 8 429.0 198.0 4341. 10...
... ...
392 27.0 4 140.0 86.00 2790. 15...
393 44.0 4 97.00 52.00 2130. 24...
394 32.0 4 135.0 84.00 2295. 11...
395 28.0 4 120.0 79.00 2625. 18...
396 31.0 4 119.0 82.00 2720. 19...

397 rows × 1 columns

Use Google to fix it!


Self-learning


Have a look at the following blog post.

Can you replicate the analysis using Python?

https://uc-r.github.io/t_test


Simulating


Reformulation of parameters of the logistic function appliedto power curves of wind turbines

Daniel Villanueva and Andrés E. Feijóo; Electric Power Systems Research; Vol. 137; Pages 51-58; 2016

The current procedure for obtaining the parameters of the logistic function, used as a model for the power curve of wind turbines, provides meaningless values. These values are different for each wind turbine and obtaining them requires an optimization process. This paper proposes a procedure to obtain the parameters of the 4-parameter logistic function based on the features of the power curve, providing a model that is a function of the power curve parameters supplied by the manufacturer. Furthermore, that model can be used to derive another 4-parameter model and a 3-parameter model is proposed for certain conditions. The three models consist of a continuous function which simplifies the implementation of the curve in a computer program compared to piecewise models. In addition, the probability density function of the output power of a wind turbine is derived by using each model.

$$ P(u) = a \frac{1 + me^{-u/t}}{1+ ne^{-u/t}} $$
In [9]:
import numpy as np

a, m, n, t = 2011.1, 2.6650, 622.922, 1.4090

u = np.linspace(0.0, 30.0, 1000)

P_u = a * (1.0 + m * np.exp(-u / t)) / (1.0 + n * np.exp(-u / t))
In [10]:
import pandas as pd

df = pd.DataFrame({"wind": u, "power":P_u})
df
Out[10]:
wind power
0 0.00000 11.813466
1 0.03003 11.882492
2 0.06006 11.953000
3 0.09009 12.025021
4 0.12012 12.098589
... ... ...
995 29.87988 2011.099231
996 29.90991 2011.099247
997 29.93994 2011.099263
998 29.96997 2011.099278
999 30.00000 2011.099293

1000 rows × 2 columns

In [11]:
import seaborn as sns

sns.scatterplot(data=df, x="wind", y="power");

End