# подгружаем все нужные пакеты import pandas as pd import numpy as np import sklearn # для встроенных картинок %pylab inline # %matplotlib inline # чуть покрасивше картинки: pd.set_option('display.mpl_style', 'default') figsize(10, 3) datatrain = pd.read_csv('D:\\Competitions\\Rossman\\train.csv') datatest = pd.read_csv('D:\\Competitions\\Rossman\\test.csv') datastore = pd.read_csv('D:\\Competitions\\Rossman\\store.csv') datatrain['StateHoliday'] = datatrain['StateHoliday'].astype(str) # сразу print 'обучение' + str(datatrain.shape) print 'контроль' + str(datatest.shape) print 'магазины' + str(datastore.shape) print datatrain[-3:] datatrain[:3] #data = data.reindex(index=data.index[::-1]) # почему - будет видно потом #data = data.iloc[::-1] #data.index = range(data.shape[0]) print datatest[-3:] datatest[:3] print datastore[-3:] datastore[:3] # смотрим уникальные значения for i in datatrain.columns: #addtext = + ' '+ ) # addtext = ((' intest=' + (str(datatest[i].nunique()))) + ' '+ (str(datatrain[i].unique()) if (datatrain[i].nunique()<10) else '')) print (str(i) + \ (' c ' if (type(datatrain[i][0]) is str) else ' n ') + \ 'intrain=' + str(datatrain[i].nunique()) + \ (' intest=' + (str(datatest[i].nunique()) if (i in datatest.columns) else ' not in test')) ) + \ (' uniques=' + str(datatrain[i].unique()) + (' / ' + str(datatest[i].unique()) if (i in datatest.columns) else ' ') if (datatrain[i].nunique()<10) else '') print datatrain.isnull().sum() # нет ли NULL print datatest.isnull().sum() # нет ли NULL print datatrain.dtypes # print data['Store'].unique().shape # print data['DayOfWeek'].unique() # print data['Date'].unique().shape # print data['Sales'].unique().shape # print data['Customers'].unique().shape # print data['Open'].unique() # print data['Promo'].unique() # print data['StateHoliday'].unique() # print data['SchoolHoliday'].unique() datatrain.Date = pd.to_datetime(datatrain.Date) datatest.Date = pd.to_datetime(datatest.Date) datatrain['Month'] = datatrain.Date.map(lambda x: x.month) # ИНАЧЕ НЕ ПРИСВАИВАЕТСЯ datatrain['DayOfYear'] = datatrain.Date.map(lambda x: x.dayofyear) datatrain['Year'] = datatrain.Date.map(lambda x: x.year) datatrain['Week'] = datatrain.Date.map(lambda x: x.week) datatest['Month'] = datatest.Date.map(lambda x: x.month) datatest['DayOfYear'] = datatest.Date.map(lambda x: x.dayofyear) datatest['Year'] = datatest.Date.map(lambda x: x.year) datatest['Week'] = datatest.Date.map(lambda x: x.week) datatrain.DayOfYear.plot(color='blue') datatest.DayOfYear.plot(color='red') pd.crosstab(datatest.Date.map(lambda x: x.dayofweek), datatest.DayOfWeek) pd.DataFrame({'DayOfYear': datatest.DayOfYear, 'DayOfWeek': datatest.DayOfWeek}).plot(kind='scatter', x='DayOfYear', y='DayOfWeek') datatrain.Date[0] data1 = datatrain[datatrain.Date>=pd.Timestamp('2015-01-01 00:00:00')] data2 = datatrain[(datatrain.Date=pd.Timestamp('2014-01-01 00:00:00'))] data3 = datatrain[(datatrain.Date0].groupby('Month')['Sales'].mean().plot(color='g', style='--') data2[data2.Sales>0].groupby('Month')['Sales'].mean().plot(color='b', style='--') data3[data3.Sales>0].groupby('Month')['Sales'].mean().plot(color='r', style='--') data1.groupby('Week')['Sales'].mean().plot(color='g') data2.groupby('Week')['Sales'].mean().plot(color='b') data3.groupby('Week')['Sales'].mean().plot(color='r') data1[data1.Sales>0].groupby('Week')['Sales'].mean().plot(color='g', style='--') data2[data2.Sales>0].groupby('Week')['Sales'].mean().plot(color='b', style='--') data3[data3.Sales>0].groupby('Week')['Sales'].mean().plot(color='r', style='--') data1.groupby('Month')['Store'].nunique().plot(color='g') data2.groupby('Month')['Store'].nunique().plot(color='b') data3.groupby('Month')['Store'].nunique().plot(color='r') datatest.groupby('Month')['Store'].nunique().plot(color='k') print np.intersect1d(datatest.Store.unique(), data1.Store.unique()).__len__() print np.intersect1d(datatest.Store.unique(), data2.Store.unique()).__len__() print np.intersect1d(datatest.Store.unique(), data3.Store.unique()).__len__() print np.intersect1d(datatest.Store.unique(), data2[data2.Month==12].Store.unique()).__len__() data1.groupby('DayOfYear')['Sales'].mean().plot(color='g') data2.groupby('DayOfYear')['Sales'].mean().plot(color='b') data3.groupby('DayOfYear')['Sales'].mean().plot(color='r') figsize(15, 5) ax = data1['Sales'].plot(kind='kde', color='g') data2['Sales'].plot(kind='kde', color='b') data3['Sales'].plot(kind='kde', color='r') ax.set_xlim(-2000, 20000) figsize(15, 5) ax = data3['Sales'].hist(bins=500, color='r') data2['Sales'].hist(bins=500, color='b') data1['Sales'].hist(bins=500, color='g') ax.set_ylim(0, 5000) ax.set_xlim(-1, 25000) figsize(15, 5) ax = data1.groupby('Week')['Promo'].mean().plot(color='g') data2.groupby('Week')['Promo'].mean().plot(color='b') data3.groupby('Week')['Promo'].mean().plot(color='r') ax.legend(['2015', '2014', '2013']) np.array([1,2,3,4,5,6,7])/7.0 figsize(15, 5) ax = data1[data1.Sales>0].groupby('Date')['Promo'].mean().plot(color='g') data2[data2.Sales>0].groupby('Date')['Promo'].mean().plot(color='b') data3[data3.Sales>0].groupby('Date')['Promo'].mean().plot(color='r') ax.legend(['2015', '2014', '2013']) cols = ['r','b','g','y','c','m','k'] ax = datatrain[datatrain['DayOfWeek']==7]['Sales'].plot(kind='kde', color=cols[6]) for d in range(6): datatrain[datatrain['DayOfWeek']==(d+1)]['Sales'].plot(kind='kde', color=cols[d]) ax.set_xlim(-500, 15000) ax.set_ylim(0, 0.0003) ax.legend([u'воскресенье',u'понедельник',u'вторник',u'среда',u'четверг',u'пятница',u'суббота']) figsize(15, 5) ax = datatrain[datatrain['DayOfWeek']==6]['Sales'].hist(bins=500, color='m') datatrain[datatrain['DayOfWeek']==3]['Sales'].hist(bins=500, color='g') datatrain[datatrain['DayOfWeek']==7]['Sales'].hist(bins=500, color='k') ax.set_ylim(0, 2000) ax.set_xlim(-1, 20000) ax = datatrain[datatrain.Sales>0].Sales.hist(bins=200) ax.set_title(u'Все покупки') ax = datatrain[datatrain.Sales>0].Sales.apply(lambda x: np.log(x+1.0)).hist(bins=200) ax.set_title(u'Логарифм всех покупок') cols = ['r','b','g','y','c','m','k'] for i, s in enumerate(datatrain['Store'].unique()[:7]): data1[data1['Store']==s]['Sales'].plot(color=cols[i]) tmp = data1[data1['Store']==1] tmp[:5] tmp['Promo'] *= 1000 tmp['StateHoliday'] *= 2000 tmp['SchoolHoliday'] *= 3000 tmp.plot() data1.groupby('DayOfWeek')['Sales'].mean().plot(color='g') data2.groupby('DayOfWeek')['Sales'].mean().plot(color='b') data3.groupby('DayOfWeek')['Sales'].mean().plot(color='r') data1[data1['Month']==1].groupby('DayOfWeek')['Sales'].mean().plot(color='g') data1[data1['Month']==2].groupby('DayOfWeek')['Sales'].mean().plot(color='r') data1[data1['Month']==3].groupby('DayOfWeek')['Sales'].mean().plot(color='b') data1[data1['Month']==4].groupby('DayOfWeek')['Sales'].mean().plot(color='m') data1[data1['Month']==5].groupby('DayOfWeek')['Sales'].mean().plot(color='c') data1[data1['Month']==6].groupby('DayOfWeek')['Sales'].mean().plot(color='y') data1[data1['Month']==7].groupby('DayOfWeek')['Sales'].mean().plot(color='k') ax = data1[data1['Store']==2].plot(kind='scatter', x='Sales', y='Customers', color='b') data2[data2['Store']==2].plot(kind='scatter', x='Sales', y='Customers', color='r', ax=ax) data3[data3['Store']==2].plot(kind='scatter', x='Sales', y='Customers', color='g', ax=ax) ax = data2[data2['Store']==2].plot(kind='scatter', x='Sales', y='Customers', color='r') data2[data2['Store']==3].plot(kind='scatter', x='Sales', y='Customers', color='k', ax=ax) data2[data2['Store']==4].plot(kind='scatter', x='Sales', y='Customers', color='m', ax=ax) data2[data2['Store']==5].plot(kind='scatter', x='Sales', y='Customers', color='y', ax=ax) for m in range(7): print data1[data1['Month']==m+1].groupby('Promo')['Sales'].mean() for y in [2013, 2014, 2015]: print datatrain[datatrain['Year']==y].groupby('Open')['Sales'].mean() for y in [2013, 2014, 2015]: print datatrain[datatrain['Year']==y].groupby('StateHoliday')['Sales'].mean() for y in [2013, 2014, 2015]: print datatrain[datatrain['Year']==y].groupby('SchoolHoliday')['Sales'].mean() # ошибка def rmspe(y, a): # y=0 - игнорировать return np.mean((((y-a)/y)[y>0]) ** 2) ** 0.5 # ошибка def prmse(y, a): # y=0 - игнорировать return (np.mean(((y-a)[y>0]) ** 2) ** 0.5)/np.mean(y[y>0]) # ошибка def smape(y, a): # y=0 - игнорировать return np.mean((2*np.abs(y-a)/np.abs(y+a))[y>0]) test = datatrain[datatrain.Date>pd.Timestamp('2015-06-13 00:00:00')] train = datatrain[datatrain.Date<=pd.Timestamp('2015-06-13 00:00:00')] print train.shape, test.shape ytrain = train.Sales.values ytest = test.Sales.values print rmspe(ytest, 0) print rmspe(ytest, ytrain[ytrain>0].mean()) print rmspe(ytest, 1.1*ytrain[ytrain>0].mean()) print rmspe(ytest, 0.9*ytrain[ytrain>0].mean()) print rmspe(ytest, 0.7*ytrain[ytrain>0].mean()) print rmspe(ytest, np.median(ytrain[ytrain>0])) print rmspe(ytest, 0.75*np.median(ytrain[ytrain>0])) l = np.linspace(0.4,1.2,1001) e = [] e2 = [] e3 = [] for x in l: a = np.median(x*ytrain[ytrain>0]) # np.mean(y[y>0]) # e.append(rmspe(ytest, a)) e2.append(prmse(ytest, a)) e3.append(smape(ytest, a)) figsize(15, 5) tmp = pd.DataFrame({'rmspe': e, 'prmse': e2, 'smape': e3}) tmp.index = l ax = tmp.plot() ax.set_title(u'Качество константного решения') ax.set_xlabel(u'множитель') ax.set_ylabel(u'ошибка') print 'min rmspe = ' + str(min(e)) print 'min prmse = ' + str(min(e2)) print 'min smape = ' + str(min(e3)) train[:3] test[:3] st = train.groupby('DayOfWeek')['Sales'].mean() a = test['DayOfWeek'].apply(lambda x: st[x]).values print rmspe(ytest, a) def investlina(a,x1=0.6,x2=1.2): l = np.linspace(0.6,1.2,1001) e = [] e2 = [] e3 = [] for x in l: a2 = x*a e.append(rmspe(ytest, a2)) e2.append(prmse(ytest, a2)) e3.append(smape(ytest, a2)) tmp = pd.DataFrame({'rmspe': e, 'prmse': e2, 'smape': e3}) tmp.index = l ax = tmp.plot() ax.set_title(u'Качество константного решения') ax.set_xlabel(u'множитель') ax.set_ylabel(u'ошибка') print 'min rmspe = ' + str(min(e)) print 'min prmse = ' + str(min(e2)) print 'min smape = ' + str(min(e3)) investlina(a) st = train.groupby(['DayOfWeek', 'Store'])['Sales'].mean() # группировка по двум признакам a = test[['DayOfWeek', 'Store']].apply(lambda x: st[x[0], x[1]], axis=1).values # ТОЛЬКО ТАК... print rmspe(ytest, a) investlina(a) test['Forecast'] = a ax = test[test.Store==2][['Sales', 'Forecast']].plot() test[test.Store==1000][['Sales', 'Forecast']].plot() st = train[train.Year==2015].groupby(['DayOfWeek', 'Store'])['Sales'].mean() # группировка по двум признакам a = test[['DayOfWeek', 'Store']].apply(lambda x: st[x[0], x[1]], axis=1).values # ТОЛЬКО ТАК... print rmspe(ytest, a) investlina(a) st = train.groupby(['DayOfWeek', 'Store', 'Promo'])['Sales'].mean() # группировка по двум признакам a = test[['DayOfWeek', 'Store', 'Promo']].apply(lambda x: st[x[0], x[1], x[2]], axis=1).values # ТОЛЬКО ТАК... print rmspe(ytest, a) investlina(a) figsize(15,5) test['Forecast'] = a ax = test[test.Store==2][['Sales', 'Forecast']].plot() test[test.Store==1000][['Sales', 'Forecast']].plot() st = train[train.Year==2015].groupby(['DayOfWeek', 'Store'])['Sales'].mean() tmp = pd.DataFrame(st) tmp.T.stack(0) # ОЧЕНЬ ДОЛГО.... def takemean(st, x1, x2, x3, x4, x5): if st.index.isin([(x1, x2, x3, x4, x5)]).any(): return st[(x1, x2, x3, x4, x5)] else: return 5742.0 st = train.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'SchoolHoliday'])['Sales'].mean() # группировка по двум признакам a = test[['DayOfWeek', 'Store', 'Open', 'Promo', 'SchoolHoliday']].apply(lambda x: takemean(st, x[0], x[1], x[2], x[3], x[4]), axis=1).values # ТОЛЬКО ТАК... print rmspe(ytest, a) st = train.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['Sales'].mean() test['dummy'] = 5742.0 st2 = test.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['dummy'].mean() np.sum(~st2.index.isin(st.index)) # есть новые индексы st = st.append(st2[~st2.index.isin(st.index)]) np.sum(~st2.index.isin(st.index)) # ура! нет "новинок" st[:5] a = test[['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']].apply(lambda x: st[x[0], x[1], x[2], x[3], x[4], x[5]], axis=1).values # ТОЛЬКО ТАК... print rmspe(ytest, a) investlina(a) test['Forecast'] = a ax = test[test.Store==2][['Sales', 'Forecast']].plot() test[test.Store==1000][['Sales', 'Forecast']].plot() train['Holiday'] = (train.StateHoliday!='0')|(train.SchoolHoliday!=0) test['Holiday'] = (test.StateHoliday!='0')|(test.SchoolHoliday!=0) feats = ['DayOfWeek', 'Store', 'Promo', 'Holiday'] # , 'SchoolHoliday', 'StateHoliday' st = train[train.Open>0].groupby(feats)['Sales'].mean() test['dummy'] = 6200.0 st2 = test.groupby(feats)['dummy'].mean() t = (~st2.index.isin(st.index)) print 'не хватает индексов:' + str(np.sum(t)) + ' %:' + str(np.mean(t)) st = st.append(st2[~st2.index.isin(st.index)]) a = test[feats].apply(lambda x: st[x[0], x[1], x[2], x[3]], axis=1).values # КАК ЕЩЁ СДЕЛАТЬ????? print 'ошибка = ' + str(rmspe(ytest, a)) investlina(a) def mymean(x): w = linspace(1,0.1,x.__len__()) ** 2.0 w = w/w.sum() return np.dot(x,w) st = train[train.Open>0].groupby(feats)['Sales'].apply(mymean) test['dummy'] = 6200.0 st2 = test.groupby(feats)['dummy'].apply(mymean) t = (~st2.index.isin(st.index)) print 'не хватает индексов:' + str(np.sum(t)) + ' %:' + str(np.mean(t)) st = st.append(st2[~st2.index.isin(st.index)]) a = test[feats].apply(lambda x: st[x[0], x[1], x[2], x[3]], axis=1).values # КАК ЕЩЁ СДЕЛАТЬ????? print 'ошибка = ' + str(rmspe(ytest, a)) investlina(a) train = pd.read_csv('D:\\Competitions\\Rossman\\train.csv') test = pd.read_csv('D:\\Competitions\\Rossman\\test.csv') test[:3] test.isnull().sum() test = test.fillna(1) train[:3] st = train.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['Sales'].mean() test['dummy'] = 5742.0 st2 = test.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['dummy'].mean() np.sum(~st2.index.isin(st.index)) # есть новые индексы st = st.append(st2[~st2.index.isin(st.index)]) np.sum(~st2.index.isin(st.index)) # ура! нет "новинок" a = test[['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']].apply(lambda x: st[x[0], x[1], x[2], x[3], x[4], x[5]], axis=1).values # ТОЛЬКО ТАК... forout = pd.DataFrame({'Id': test['Id'], 'Sales': a}) forout[:3] forout.to_csv('D:\\Competitions\\Rossman\\trivial_10.csv', index=False) # 0.16046 pd.DataFrame({'Id': test['Id'], 'Sales': 0.95*a}).to_csv('D:\\Competitions\\Rossman\\trivial_095.csv', index=False) # 0.1459 pd.DataFrame({'Id': test['Id'], 'Sales': 0.9*a}).to_csv('D:\\Competitions\\Rossman\\trivial_09.csv', index=False) # 0.14990 st = train.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['Sales'].apply(mymean) test['dummy'] = 5742.0 st2 = test.groupby(['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday'])['dummy'].apply(mymean) st = st.append(st2[~st2.index.isin(st.index)]) a = test[['DayOfWeek', 'Store', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday']].apply(lambda x: st[x[0], x[1], x[2], x[3], x[4], x[5]], axis=1).values # ТОЛЬКО ТАК... pd.DataFrame({'Id': test['Id'], 'Sales': 0.95*a}).to_csv('D:\\Competitions\\Rossman\\trivialmymean_095.csv', index=False) # 0.14713 pd.DataFrame({'Id': test['Id'], 'Sales': 0.9*a}).to_csv('D:\\Competitions\\Rossman\\trivialmymean_09.csv', index=False) # 0.14120 pd.DataFrame({'Id': test['Id'], 'Sales': 0.85*a}).to_csv('D:\\Competitions\\Rossman\\trivialmymean_085.csv', index=False) # 0.15537