Karar Ağaçları (Decision Tree) Kullanarak Hava Durumu
Sınıflandırması
scikit-learn
Günlük Hava Durumu Analizi
Gerekli Kütüphanelerin İçe Aktarılması
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image, display,HTML
CSS = """
.output {
flex-direction: row;
}
"""
HTML('<style>{}</style>'.format(CSS))
CSV Dosyası ile Pandas DataFrame oluşturma
url='https://raw.githubusercontent.com/cagriemreakin/Machine-Learning/master/classification/weather/daily_weather.csv'
data = pd.read_csv(url)
Hakkında
data.columns
Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am', 'relative_humidity_3pm'], dtype='object')
Dataset içindeki değerlerin açıklaması:
data.head(10)
number | air_pressure_9am | air_temp_9am | avg_wind_direction_9am | avg_wind_speed_9am | max_wind_direction_9am | max_wind_speed_9am | rain_accumulation_9am | rain_duration_9am | relative_humidity_9am | relative_humidity_3pm | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 918.060000 | 74.822000 | 271.100000 | 2.080354 | 295.400000 | 2.863283 | 0.00 | 0.0 | 42.420000 | 36.160000 |
1 | 1 | 917.347688 | 71.403843 | 101.935179 | 2.443009 | 140.471548 | 3.533324 | 0.00 | 0.0 | 24.328697 | 19.426597 |
2 | 2 | 923.040000 | 60.638000 | 51.000000 | 17.067852 | 63.700000 | 22.100967 | 0.00 | 20.0 | 8.900000 | 14.460000 |
3 | 3 | 920.502751 | 70.138895 | 198.832133 | 4.337363 | 211.203341 | 5.190045 | 0.00 | 0.0 | 12.189102 | 12.742547 |
4 | 4 | 921.160000 | 44.294000 | 277.800000 | 1.856660 | 136.500000 | 2.863283 | 8.90 | 14730.0 | 92.410000 | 76.740000 |
5 | 5 | 915.300000 | 78.404000 | 182.800000 | 9.932014 | 189.000000 | 10.983375 | 0.02 | 170.0 | 35.130000 | 33.930000 |
6 | 6 | 915.598868 | 70.043304 | 177.875407 | 3.745587 | 186.606696 | 4.589632 | 0.00 | 0.0 | 10.657422 | 21.385657 |
7 | 7 | 918.070000 | 51.710000 | 242.400000 | 2.527742 | 271.600000 | 3.646212 | 0.00 | 0.0 | 80.470000 | 74.920000 |
8 | 8 | 920.080000 | 80.582000 | 40.700000 | 4.518619 | 63.000000 | 5.883152 | 0.00 | 0.0 | 29.580000 | 24.030000 |
9 | 9 | 915.010000 | 47.498000 | 163.100000 | 4.943637 | 195.900000 | 6.576604 | 0.00 | 0.0 | 88.600000 | 68.050000 |
Veri işlemenin ilk adımı temiz bir veri elde etmektir.Hücredeki boş (null)değerler hesaplamaların yanlış olmasına sebep olabilir.Satırlardaki boş değerleri bulalım.
data[data.isnull().any(axis=1)]
number | air_pressure_9am | air_temp_9am | avg_wind_direction_9am | avg_wind_speed_9am | max_wind_direction_9am | max_wind_speed_9am | rain_accumulation_9am | rain_duration_9am | relative_humidity_9am | relative_humidity_3pm | |
---|---|---|---|---|---|---|---|---|---|---|---|
16 | 16 | 917.890000 | NaN | 169.200000 | 2.192201 | 196.800000 | 2.930391 | 0.000 | 0.000000 | 48.990000 | 51.190000 |
111 | 111 | 915.290000 | 58.820000 | 182.600000 | 15.613841 | 189.000000 | NaN | 0.000 | 0.000000 | 21.500000 | 29.690000 |
177 | 177 | 915.900000 | NaN | 183.300000 | 4.719943 | 189.900000 | 5.346287 | 0.000 | 0.000000 | 29.260000 | 46.500000 |
262 | 262 | 923.596607 | 58.380598 | 47.737753 | 10.636273 | 67.145843 | 13.671423 | 0.000 | NaN | 17.990876 | 16.461685 |
277 | 277 | 920.480000 | 62.600000 | 194.400000 | 2.751436 | NaN | 3.869906 | 0.000 | 0.000000 | 52.580000 | 54.030000 |
334 | 334 | 916.230000 | 75.740000 | 149.100000 | 2.751436 | 187.500000 | 4.183078 | NaN | 1480.000000 | 31.880000 | 32.900000 |
358 | 358 | 917.440000 | 58.514000 | 55.100000 | 10.021491 | NaN | 12.705819 | 0.000 | 0.000000 | 13.880000 | 25.930000 |
361 | 361 | 920.444946 | 65.801845 | 49.823346 | 21.520177 | 61.886944 | 25.549112 | NaN | 40.364018 | 12.278715 | 7.618649 |
381 | 381 | 918.480000 | 66.542000 | 90.900000 | 3.467257 | 89.400000 | 4.406772 | NaN | 0.000000 | 20.640000 | 14.350000 |
409 | 409 | NaN | 67.853833 | 65.880616 | 4.328594 | 78.570923 | 5.216734 | 0.000 | 0.000000 | 18.487385 | 20.356594 |
517 | 517 | 920.570000 | 53.600000 | 100.100000 | 4.697574 | NaN | 6.285801 | 4.712 | 14842.000000 | 79.880000 | 84.530000 |
519 | 519 | 916.250000 | 55.670000 | 176.400000 | 6.666081 | 188.200000 | NaN | 0.000 | 0.000000 | 72.550000 | 74.390000 |
546 | 546 | NaN | 42.746000 | 251.100000 | 12.929513 | 274.400000 | 17.604718 | 14.627 | 7825.000000 | 87.870000 | 70.770000 |
620 | 620 | 921.200000 | 56.786000 | 192.300000 | 9.551734 | 201.400000 | 11.005745 | NaN | 0.000000 | 59.790000 | 77.750000 |
625 | 625 | 912.400000 | 50.774000 | 171.600000 | NaN | 181.400000 | 4.831790 | 0.000 | 0.000000 | 86.840000 | 64.740000 |
656 | 656 | 920.830000 | 66.344000 | NaN | 15.457255 | 189.400000 | 16.486248 | 0.000 | 0.000000 | 23.770000 | 51.630000 |
670 | 670 | 910.920000 | 48.362000 | 156.500000 | NaN | 177.500000 | 16.128337 | 4.970 | 10560.000000 | 80.560000 | 88.220000 |
672 | 672 | 922.448945 | 72.863773 | NaN | 3.682370 | 214.196160 | 4.849450 | 0.000 | 0.000000 | 16.753670 | 17.804720 |
705 | 705 | 911.900000 | 59.072000 | 199.800000 | 1.275056 | 239.500000 | 1.834291 | NaN | 0.000000 | 77.630000 | 59.130000 |
731 | 731 | 922.970166 | 51.391847 | 33.810942 | NaN | 59.290089 | 11.111555 | 0.000 | 4.735034 | 34.807753 | 18.418179 |
737 | 737 | 917.895130 | 76.804690 | 104.771020 | 1.632705 | 97.178763 | NaN | 0.000 | 0.000000 | 13.771311 | 16.792455 |
788 | 788 | 917.923442 | 73.249717 | 42.101739 | 4.132698 | 64.284969 | 5.345258 | 0.000 | NaN | 6.939692 | 18.793825 |
840 | 840 | 918.043767 | NaN | 181.774042 | 0.964376 | 185.618601 | 1.570007 | 0.000 | 0.000000 | 11.911222 | 18.154358 |
848 | 848 | 915.250000 | 37.562000 | 246.500000 | 11.587349 | 258.700000 | NaN | 3.171 | 2891.000000 | 91.000000 | 90.780000 |
861 | 861 | 919.065408 | NaN | 172.303728 | 2.639600 | 193.058141 | 3.326949 | 0.000 | 0.000000 | 12.497839 | 13.438518 |
869 | 869 | NaN | 45.104000 | 259.000000 | 3.265932 | 275.000000 | 4.026492 | 0.000 | 80.000000 | 85.270000 | 90.260000 |
998 | 998 | 914.140000 | 71.240000 | NaN | 1.722444 | 232.900000 | 2.326418 | 0.000 | 0.000000 | 24.200000 | 41.380000 |
1031 | 1031 | 922.669195 | NaN | 47.946284 | 7.969686 | 65.770066 | 10.262337 | 0.000 | 0.000000 | 18.920805 | 19.641841 |
1035 | 1035 | 919.670000 | 77.576000 | 171.800000 | 6.554234 | 191.000000 | 8.164831 | 0.000 | NaN | 56.860000 | 50.650000 |
1063 | 1063 | 917.300185 | 65.790001 | NaN | 1.879553 | 222.498226 | 2.692862 | 0.000 | 0.000000 | 14.972668 | 20.966267 |
1066 | 1066 | 919.564869 | 73.726732 | 68.704694 | 3.551777 | 102.571616 | 4.861315 | NaN | 0.000000 | 11.657314 | 17.331823 |
Gereksiz Veri'den Kurtulalım
Kaç tane satır sayısına ihtiyacımız olmadığı için silelim.
del data['number']
boş değerleri pandas dropna fonksiyonunu kullanarak silelim
silmeden_once = data.shape[0]
print(silmeden_once)
1064
data = data.dropna()
sildikten_sonra = data.shape[0]
print(print(sildikten_sonra))
1064 None
Temizleme işleminden sonra kalan satır sayısı?
silmeden_once - sildikten_sonra
0
Sınıflandırma İşlemi
clean_data = data.copy() #temizlediğimiz veri setini kopyaladık
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1 #nem oranı 24.99'dan büyük olanları 1 olarak işaretle
print(clean_data['high_humidity_label'])
0 1 1 0 2 0 3 0 4 1 5 1 6 0 7 1 8 0 9 1 10 1 11 1 12 1 13 1 14 0 15 0 17 0 18 1 19 0 20 0 21 1 22 0 23 1 24 0 25 1 26 1 27 1 28 1 29 1 30 1 .. 1064 1 1065 1 1067 1 1068 1 1069 1 1070 1 1071 1 1072 0 1073 1 1074 1 1075 0 1076 0 1077 1 1078 0 1079 1 1080 0 1081 0 1082 1 1083 1 1084 1 1085 1 1086 1 1087 1 1088 1 1089 1 1090 1 1091 1 1092 1 1093 1 1094 0 Name: high_humidity_label, Length: 1064, dtype: int32
Sonucu 'y' de sakla.
y=clean_data[['high_humidity_label']].copy()
#y
clean_data['relative_humidity_3pm'].head()
0 36.160000 1 19.426597 2 14.460000 3 12.742547 4 76.740000 Name: relative_humidity_3pm, dtype: float64
y.head() # 25 ve büyük değerler için 1 olarak işaretlendi
high_humidity_label | |
---|---|
0 | 1 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 1 |
Saat 3'teki Nem miktarını bulmak için Saat 9'daki sensor değerlerini kullanalım
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
'rain_duration_9am']
X = clean_data[morning_features].copy()# morning features'ın içindeki indeklere göre cleandata değerlerini Z' e kopyalam işelemi
X.columns
Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am'], dtype='object')
y.columns
Index(['high_humidity_label'], dtype='object')
Test ve Eğitim(Train) Kümeleri Oluşturma
display(Image(filename="C:\\Users\\ceakn\\Desktop\\site-resimler\\class_apply.png", embed=True))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)
Yukarıdaki kodun amacı daha önceki yazımda anlattığım test ve eğitim veri setini oluşturmaktır.Yukarıdaki kodda test için veri setinin %33 ü ayrılmıştır.
#type(X_train)
#type(X_test)
#type(y_train)
#type(y_test)
#X_train.head()
#y_train.describe()
Fit on Train Set
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0) # 10 adet leaf node' a izin verdik ve ağacı oluşturduk.
humidity_classifier.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=10, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=0, splitter='best')
type(humidity_classifier)
sklearn.tree.tree.DecisionTreeClassifier
Test Kümesinden Çıkarım Yapma
predictions = humidity_classifier.predict(X_test)
predictions[:10]
array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1])
y_test['high_humidity_label'][:10]
456 0 845 0 693 1 259 1 723 1 224 1 300 1 442 0 585 1 1057 1 Name: high_humidity_label, dtype: int32
Yukarıdaki değerlere bakarsanız 2 değeriin farklı olduğunu görebilirsiniz.
Başarı Oranını Hesaplama
accuracy_score(y_true = y_test, y_pred = predictions)
0.81534090909090906
% 81 başarı