Notebook

Pandas bevezető 2.¶

https://klajosw.blogspot.com/

pandas: NumPy-ra épülő adatfeldolgozó és elemző eszköz

In [0]:

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:

# DataFrame-et létre lehet hozni szótárból...
data1 = {"a": [1, 1, 2], "b": [3.0, 4.0, None]}
df1 = pd.DataFrame(data1)
print('Szótárból: ')
print(df1)
print('-----------------')

# ...sztring-lista párok listájából
data2 = [("a", [1, 1, 2]), ("b", [3.0, 4.0, None])]
df2 = pd.DataFrame.from_dict(dict(data2))
print('String Listapárból: ')
print(df2)
print('-----------------')

# ...szótárak listájából
data3 = [{"a": 1, "b": 3}, {"a": 1, "b": 4}, {"a": 2}]
df3 = pd.DataFrame(data3)
print('Szótárak listájából: ')
print(df3)
print('-----------------')

# ...és még számos egyéb módon

Szótárból: 
   a    b
0  1  3.0
1  1  4.0
2  2  NaN
-----------------
String Listapárból: 
   a    b
0  1  3.0
1  1  4.0
2  2  NaN
-----------------
Szótárak listájából: 
   a    b
0  1  3.0
1  1  4.0
2  2  NaN
-----------------

In [8]:

## egyszerű elem kimetszés
def slices(s, *args):  ## Kimetsző
    position = 0  ## Kezdő pozició
    for length in args:
        yield s[position:position + length] 
# a yield egy különleges függvény, amely időről időre értékeket állít elő, mint egy folytatható függvény, a meghívása egy generátort ad vissza
        
print(list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50)))
print('---------------------------')
d,c,h = slices('LajosBélaAttilaFeri', 5, 4, 6)
print(d,c,h)

['ab', 'abcdefghij', 'abcdefghijklmnopqrstuvwxyz0123456789']
---------------------------
Lajos Lajo LajosB

yield utasítás¶

Ezt akkor használjuk, amikor egy generátor függvényt definiálunk és csak a függvény törzsében használjuk.

A yield utasítás használata egy függvény definiálásánál elegendő ahhoz, hogy egy normál függvényből egy generátor függvényt készítsünk.

Amikor a generátor függvényt meghívjuk, akkor visszatér egy iterátorral, amit generátor iterátornak, vagy másképp generátornak nevezünk.

A generátor next() hívásának hatására a függvény törzse híváskor hajtódik végre, és addig ismétlődik, amíg egy kivételt nem vált ki.

A yield utasítás végrehajtásakor, a generátor állapota fagyott lesz, és a kifejezés lista értékével tér vissza next() hívójához.

In [17]:

## yield minta
def fib(max):
    a, b = 0, 1          
    while a < max:
        yield a      ## yield Generátort ad vissza, csak a meghívásakor töltődik fel    
        a, b = b, a + b 

print('Fibonacci sor lista bejárása: ') 
for n in fib(200):    ## Generátoros függvény meghívása és for ciklusban olvasása és kiírása   
    print(n, end=' ') 

Fibonacci sor lista bejárása: 
0 1 1 2 3 5 8 13 21 34 55 89 144

In [18]:

# minden DataFrame-hez és Series-hez tartozik index
print(df1.index)
# (alapértelmezés szerint az index 0-tól induló, 1-esével növekedő sorszám)

# ...de természetesen mást is megadhatunk indexnek
df4 = pd.DataFrame(data1, ["xx", "yy", "zz"])
print(df4.index)

#Int64Index([0, 1, 2], dtype='int64')
#Index(['xx', 'yy', 'zz'], dtype='object')

RangeIndex(start=0, stop=3, step=1)
Index(['xx', 'yy', 'zz'], dtype='object')

In [19]:

# példák Series létrehozásra:
se1 = pd.Series([2, 3, 4])
se2 = pd.Series([2, 3, 4], ["xx", "yy", "zz"]) # a 2. argumentum az index

# DataFrame-ből oszlopot [] operátorral lehet kiválasztani
df1["a"] # <= Series-t ad eredményül
# ...illetve ha az oszlop neve érvényes azonosítónév, akkor . operátorral is
df1.a    # <= Series-t ad eredményül

Out[19]:

0    1
1    1
2    2
Name: a, dtype: int64

In [20]:

# DataFrame-ből sort a .iloc attribútumon keresztül lehet kiválasztani
df1.iloc[0]      # <= ez is Series-t ad eredményül
df1.iloc[[1, 0]] # <= DataFrame-et ad eredményül, mivel 2 sort választottunk ki

Out[20]:

	a	b
1	1	4.0
0	1	3.0

In [21]:

# Series-ből elemet [] operátorral lehet kiválasztani
print(se1[0])
print(se2["xx"])

# a nyers adattartalmát a values attribútumon keresztül lehet elérni
se1.values # <= numpy tömböt ad eredményül

2
2

Out[21]:

array([2, 3, 4])

In [0]:

# Import
#import pandas as pd
#import numpy as np

path = r'c:\Users\User\Documents\mintak\jupiter\kl\aa_kl_2020\fixlinefile.txt'

# Using Pandas with a column specification
col_specification = [(0, 9), (10, 18), (19, 27), (29, 36), (38, 45), (46, 100)]
data = pd.read_fwf(path, colspecs=col_specification)  ## Read a table of fixed-width formatted lines into DataFrame.
#print(data.dtypes)
#print(data.columns)  ## Index(['ncalls', 'tottime', 'percall', 'cumtime', 'percall.1', 'filename:lineno(function)'], dtype='object')
#print(data.index)    ## RangeIndex(start=0, stop=10, step=1)

print(data.describe())  ## adatframe info
print('---------------------------')
print(data['ncalls'].min())
print(data['ncalls'].max())
print('---------------------------')
print(data['ncalls'].describe())  ## adatframe egy mezőről  info
print('---------------------------')


print('---------------------------')
print(data)
print('---------------------------')

 
## kiírás filebe
data.to_csv('kimenet.csv', sep='|') ## separátorok lehetnek még: |  \t  ,  ; ¤  @  ~

## diagram
data[['cumtime', 'percall']].plot(figsize=(10, 6), style=['-', '--'], lw=2)

         tottime   apercall    cumtime    percall
count  10.000000  10.000000  10.000000  10.000000
mean    0.001100   0.000300   0.012400   0.006300
std     0.002601   0.000949   0.005296   0.009166
min     0.000000   0.000000   0.008000   0.000000
25%     0.000000   0.000000   0.009000   0.000000
50%     0.000000   0.000000   0.011000   0.000000
75%     0.000000   0.000000   0.012500   0.010250
max     0.008000   0.003000   0.022000   0.022000
---------------------------
1
50
---------------------------
count     10
unique     3
top       50
freq       5
Name: ncalls, dtype: object
---------------------------
---------------------------
   ncalls  tottime  apercall  cumtime  percall  \
0       1    0.000     0.000    0.022    0.022   
1       1    0.000     0.000    0.022    0.022   
2  354/52    0.000     0.000    0.013    0.000   
3       1    0.000     0.000    0.011    0.011   
4      50    0.000     0.000    0.011    0.000   
5      50    0.000     0.000    0.011    0.000   
6      50    0.000     0.000    0.009    0.000   
7      50    0.000     0.000    0.009    0.000   
8      50    0.008     0.000    0.008    0.000   
9       1    0.003     0.003    0.008    0.008   

                           filename:lineno(function)  
0                    {built-in method builtins.exec}  
1                               <string>:5(<module>)  
2  {built-in method numpy.core._multiarray_umath....  
3                             <string>:9(<listcomp>)  
4        <__array_function__ internals>:2(histogram)  
5                       histograms.py:680(histogram)  
6             <__array_function__ internals>:2(sort)  
7                           fromnumeric.py:837(sort)  
8         {method 'sort' of 'numpy.ndarray' objects}  
9             <ipython-input-6-74dc45cb4a27>:1(step)  
---------------------------

In [0]:

# Az openair.csv fájl London légszennyezetttségéről tartalmaz adatokat.
# Töltsük be a fájlt DataFrame-be!
url = "https://github.com/ipython-books/cookbook-2nd-data/blob/master/federer.csv?raw=true"

df = pd.read_csv(url)
print(df.head(3))  ## három sor liíratása
print('-------------')

# Megjegyzések:
# - a pandas.read_csv függvénynek rengeteg paramétere van,
#   hogy be tudja tölteni a valós életben előforduló CSV fájl változatokat
# - a pandas képes kezelni a hiányzó adatokat
#   (ezek a táblában NaN értékként jelennek meg)

# így tudunk összesítő információkat kérni a DataFrame-ről
df.info()

   year          tournament  start date type       surface      draw  \
0  1998  Basel, Switzerland  05.10.1998   WS  Indoor: Hard  Draw: 32   
1  1998    Toulouse, France  28.09.1998   WS  Indoor: Hard  Draw: 32   
2  1998    Toulouse, France  28.09.1998   WS  Indoor: Hard  Draw: 32   

  atp points  atp ranking tournament prize money round  ...  \
0          1        396.0                 $9,800   R32  ...   
1         59        878.0                $10,800   R32  ...   
2         59        878.0                $10,800   R16  ...   

  player2 2nd serve return points total player2 break points converted won  \
0                                  22.0                                4.0   
1                                  19.0                                0.0   
2                                  30.0                                0.0   

  player2 break points converted total player2 return games played  \
0                                  8.0                         8.0   
1                                  1.0                         8.0   
2                                  4.0                        10.0   

  player2 total service points won player2 total service points total  \
0                             36.0                               50.0   
1                             33.0                               65.0   
2                             46.0                               75.0   

   player2 total return points won player2 total return points total  \
0                             26.0                              53.0   
1                              8.0                              41.0   
2                             23.0                              73.0   

  player2 total points won player2 total points total  
0                     62.0                      103.0  
1                     41.0                      106.0  
2                     69.0                      148.0  

[3 rows x 70 columns]
-------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 70 columns):
year                                     1179 non-null int64
tournament                               1179 non-null object
start date                               1179 non-null object
type                                     1179 non-null object
surface                                  1179 non-null object
draw                                     1179 non-null object
atp points                               1139 non-null object
atp ranking                              1177 non-null float64
tournament prize money                   1170 non-null object
round                                    1179 non-null object
opponent                                 1179 non-null object
ranking                                  1105 non-null object
score                                    1179 non-null object
stats link                               1179 non-null object
tournament.1                             1179 non-null object
tournament round                         1179 non-null object
time                                     1179 non-null int64
winner                                   1179 non-null object
player1 name                             1179 non-null object
player1 nationality                      1179 non-null object
player1 aces                             1027 non-null float64
player1 double faults                    1027 non-null float64
player1 1st serves in                    1027 non-null float64
player1 1st serves total                 1027 non-null float64
player1 1st serve points won             1027 non-null float64
player1 1st serve points total           1027 non-null float64
player1 2nd serve points won             1027 non-null float64
player1 2nd serve points total           1027 non-null float64
player1 break points won                 1027 non-null float64
player1 break points total               1027 non-null float64
player1 service games played             1027 non-null float64
player1 1st serve return points won      1027 non-null float64
player1 1st serve return points total    1027 non-null float64
player1 2nd serve return points won      1027 non-null float64
player1 2nd serve return points total    1027 non-null float64
player1 break points converted won       1027 non-null float64
player1 break points converted total     1027 non-null float64
player1 return games played              1027 non-null float64
player1 total service points won         1027 non-null float64
player1 total service points total       1027 non-null float64
player1 total return points won          1027 non-null float64
player1 total return points total        1027 non-null float64
player1 total points won                 1027 non-null float64
player1 total points total               1027 non-null float64
player2 name                             1179 non-null object
player2 nationality                      1110 non-null object
player2 aces                             1027 non-null float64
player2 double faults                    1027 non-null float64
player2 1st serves in                    1027 non-null float64
player2 1st serves total                 1027 non-null float64
player2 1st serve points won             1027 non-null float64
player2 1st serve points total           1027 non-null float64
player2 2nd serve points won             1027 non-null float64
player2 2nd serve points total           1027 non-null float64
player2 break points won                 1027 non-null float64
player2 break points total               1027 non-null float64
player2 service games played             1027 non-null float64
player2 1st serve return points won      1027 non-null float64
player2 1st serve return points total    1027 non-null float64
player2 2nd serve return points won      1027 non-null float64
player2 2nd serve return points total    1027 non-null float64
player2 break points converted won       1027 non-null float64
player2 break points converted total     1027 non-null float64
player2 return games played              1027 non-null float64
player2 total service points won         1027 non-null float64
player2 total service points total       1027 non-null float64
player2 total return points won          1027 non-null float64
player2 total return points total        1027 non-null float64
player2 total points won                 1027 non-null float64
player2 total points total               1027 non-null float64
dtypes: float64(49), int64(2), object(19)
memory usage: 644.9+ KB

In [0]:

# írassuk ki az oszlopok minimális, maximális és és átlagos értékét
data = []
for c in df.columns[22:30]: # kihagyjuk a dátum oszlopot, mivel ott nincs értelme az átlagnak
    se = df[c]
    data.append({"column": c, "min": se.min(), "max": se.max(), "mean": se.mean()})
    # (megjegyzés: a pandas a NaN értékeket nem veszi figyelembe a statisztikakészítéskor)
stats = pd.DataFrame(data)
stats

# megjegyzés: a statisztikákat a describe() függvény segítségével is lekérhettük volna
print(df["player1 aces"].describe()) # <= egy oszlop statisztikáit adja vissza (Series-be csomagolva)

df.describe() # <= az összes oszlop statisztikáit visszaadja (DataFrame-be csomagolva)

df["player1 aces"][:10].plot()

count    1027.000000
mean        7.658228
std         4.791261
min         0.000000
25%         4.000000
50%         7.000000
75%        10.000000
max        50.000000
Name: player1 aces, dtype: float64

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x2281b39dc08>

In [0]:

player = 'Roger Federer'
df['win'] = df['winner'] == player
df['win'].tail()

Out[0]:

1174    False
1175     True
1176     True
1177     True
1178    False
Name: win, dtype: bool

In [0]:

won = 100 * df['win'].mean()
print(f"{player}  {won:.0f}% -ban győzött a mérkőzései során.")

Roger Federer  82% -ban győzött a mérkőzései során.

In [0]:

date = df['start date']
print(date)
print('------------------')
df['dblfaults'] = (df['player1 double faults'] /  df['player1 total points total'])
print(df['dblfaults'].tail())
print('------------------')
print(df['dblfaults'].describe())

0       05.10.1998
1       28.09.1998
2       28.09.1998
3       28.09.1998
4       24.08.1998
           ...    
1174    16.01.2012
1175    02.01.2012
1176    02.01.2012
1177    02.01.2012
1178    02.01.2012
Name: start date, Length: 1179, dtype: object
------------------
1174    0.018116
1175    0.000000
1176    0.000000
1177    0.011561
1178         NaN
Name: dblfaults, dtype: float64
------------------
count    1027.000000
mean        0.012129
std         0.010797
min         0.000000
25%         0.004444
50%         0.010000
75%         0.018108
max         0.060606
Name: dblfaults, dtype: float64

In [0]:

## milyen tipusú pályákon volt eredményes
df.groupby('surface')['win'].mean()

Out[0]:

surface
Indoor: Carpet    0.736842
Indoor: Clay      0.833333
Indoor: Hard      0.836283
Outdoor: Clay     0.779116
Outdoor: Grass    0.871429
Outdoor: Hard     0.842324
Name: win, dtype: float64

In [0]:

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

gb = df.groupby('year')


fig, ax = plt.subplots(1, 1)
ax.plot_date(date, df['dblfaults'], alpha=.25, lw=0)
ax.plot_date(gb['start date'].max(), gb['dblfaults'].mean(), '-', lw=3)
ax.set_xlabel('Year')
ax.set_ylabel('Double faults per match')
ax.set_ylim(0)

Out[0]:

(0, 0.06363636363636364)

In [0]: