Berdasarkan isu #79: request: ambil dataset hujan jam-jaman dari excel
Referensi isu:
hidrokit.contrib.taruma.hk43
#43. (view notebook / manual). (Menggunakan fungsi _get_years()
untuk memperoleh list
tahun pada berkas excel.Deskripsi permasalahan:
pandas.DataFrame
Strategi penyelesaian masalah:
hk43._get_years()
)pandas.DataFrame
Catatan:
Untuk prapemrosesan tabel seperti (cek data yang hilang, dlsbnya) akan dikembangkan dengan modul yang terpisah karena beberapa fungsi sudah tersedia di modul hidrokit.contrib.taruma.hk73
(untuk mengolah berkas dari bmkg).
# Install hidrokit 0.3.x
!pip install hidrokit -q
# Download sample excel
!wget -O sample.xlsx "https://taruma.github.io/assets/hidrokit_dataset/hidrokit_hourly_template.xlsx" -q
# Import library
import pandas as pd
import numpy as np
from calendar import monthrange
from hidrokit.contrib.taruma import hk43
# ref: https://www.reddit.com/r/learnpython/comments/485h1p/how_do_i_check_if_an_object_is_a_collection_in/
from collections.abc import Sequence
def _index_hourly(year, freq='60min'):
"""Return object DatetimeIndex"""
if isinstance(year, Sequence):
year_start, year_end = year
else:
year_start, year_end = year, year
period = '{}0101 00:00,{}1231 23:00'.format(year_start, year_end).split(',')
return pd.date_range(*period, freq=freq)
def _melt_to_array(df):
return df.melt().drop('variable', axis=1)['value'].values
def _get_array_in_month(df, year, month):
n_days = monthrange(year, month)[1]
mask_month = slice(None, n_days)
df_month = df.iloc[mask_month, :].T
return _melt_to_array(df_month)
def _get_year(df, loc=(0,1)):
return df.iloc[loc]
def _get_array_in_year(df, year):
n_rows, _ = df.shape
# configuration (view the excel)
n_month = 1 # number of row to monthID
n_gap = 2 # number of row between month pivot table
n_lines = 31 + n_gap # number of row each month
data = []
for row in range(1, n_rows, n_lines):
mask_start = row + n_month
mask_end = row + n_lines
month = df.iloc[mask_start, 1]
mask_row = slice(mask_start, mask_end)
df_month = df.iloc[mask_row, 4:]
array_month = _get_array_in_month(df_month, year, month)
data.append(array_month)
return np.hstack(data)
def _get_info(file, config_sheet=None):
excel = pd.ExcelFile(file)
first_sheet = excel.sheet_names[0]
config_sheet = first_sheet if config_sheet is None else config_sheet
df = pd.read_excel(
excel, sheet_name=config_sheet, header=None, usecols='A:B'
)
info = {}
for index, _ in df.iterrows():
key = df.iloc[index, 0].lower()
value = df.iloc[index, 1]
info[str(key)] = value
return info
def read_excel_hourly(file, station=None):
excel = pd.ExcelFile(file)
# CONFIG
years = hk43._get_years(excel)
station = 'NA' if station is None else station
# READ DATA
data = []
for year in years:
sheet = pd.read_excel(
excel, sheet_name=str(year),
header=None, nrows=396,
usecols='A:AB'
)
array = _get_array_in_year(sheet, year)
df_year = pd.DataFrame(
data=array,
columns=[station],
index=_index_hourly(year)
)
data.append(df_year)
return pd.concat(data, axis=0)
_index_hourly(year, freq='60min')
¶Tujuan: membuat index menggunakan perintah pd.date_range()
dengan input year
yang berupa bilangan ataupun sequence.
_index_hourly(2000) # jika bilangan harus dalam bentuk integer
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00', '2000-01-01 03:00:00', '2000-01-01 04:00:00', '2000-01-01 05:00:00', '2000-01-01 06:00:00', '2000-01-01 07:00:00', '2000-01-01 08:00:00', '2000-01-01 09:00:00', ... '2000-12-31 14:00:00', '2000-12-31 15:00:00', '2000-12-31 16:00:00', '2000-12-31 17:00:00', '2000-12-31 18:00:00', '2000-12-31 19:00:00', '2000-12-31 20:00:00', '2000-12-31 21:00:00', '2000-12-31 22:00:00', '2000-12-31 23:00:00'], dtype='datetime64[ns]', length=8784, freq='60T')
_index_hourly(['2000', 2001]) # jika dalam seq-object bisa berupa integer atau string
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00', '2000-01-01 02:00:00', '2000-01-01 03:00:00', '2000-01-01 04:00:00', '2000-01-01 05:00:00', '2000-01-01 06:00:00', '2000-01-01 07:00:00', '2000-01-01 08:00:00', '2000-01-01 09:00:00', ... '2001-12-31 14:00:00', '2001-12-31 15:00:00', '2001-12-31 16:00:00', '2001-12-31 17:00:00', '2001-12-31 18:00:00', '2001-12-31 19:00:00', '2001-12-31 20:00:00', '2001-12-31 21:00:00', '2001-12-31 22:00:00', '2001-12-31 23:00:00'], dtype='datetime64[ns]', length=17544, freq='60T')
_get_array_in_month(df, year, month)
¶Tujuan: mengambil pivot tabel satu bulan dan mengubahnya (melt) ke bentuk tabel biasa.
_get_year(df, loc)
¶Tujuan: melakukan perintah df.iloc[loc]
_get_array_in_year(df, year)
¶Tujuan: serupa dengan _get_array_in_month()
, fungsi ini mengambil seluruh informasi pada sheet tunggal.
_get_info(file, config_sheet=None)
¶Tujuan: mengambil nilai pada sheet pengaturan (sheet pertama pada file) dan mengubahnya ke dalam bentuk dictionary
.
info_file = _get_info('sample.xlsx', config_sheet='_INFO')
info_file
{'key': 'VALUE', 'station_name': 'AURENE'}
info_file['station_name']
'AURENE'
read_excel_hourly(file, station=None)
¶Tujuan: membaca data jam-jaman yang terdapat pada file lalu mengubahnya ke dalam bentuk pandas.DataFrame
dengan index yang sesuai dengan tahun kejadiannya.
data = read_excel_hourly('sample.xlsx', station=info_file['station_name'])
data
AURENE | |
---|---|
2000-01-01 00:00:00 | - |
2000-01-01 01:00:00 | NaN |
2000-01-01 02:00:00 | NaN |
2000-01-01 03:00:00 | NaN |
2000-01-01 04:00:00 | NaN |
... | ... |
2002-12-31 19:00:00 | - |
2002-12-31 20:00:00 | - |
2002-12-31 21:00:00 | - |
2002-12-31 22:00:00 | - |
2002-12-31 23:00:00 | - |
26304 rows × 1 columns
data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 26304 entries, 2000-01-01 00:00:00 to 2002-12-31 23:00:00 Freq: 60T Data columns (total 1 columns): AURENE 10918 non-null object dtypes: object(1) memory usage: 411.0+ KB
Catatan: data masih berupa object
, dan belum diubah ke bentuk angka. Prapemrosesan ini serupa pada modul hk73 (untuk membaca berkas bmkg).
data.sample(n=20) # menampilkan sampel 20 baris dalam data secara acak
AURENE | |
---|---|
2002-07-26 01:00:00 | - |
2001-11-09 02:00:00 | NaN |
2000-08-16 00:00:00 | - |
2000-11-26 10:00:00 | NaN |
2001-05-21 09:00:00 | NaN |
2000-09-16 08:00:00 | NaN |
2001-06-30 02:00:00 | NaN |
2000-11-26 00:00:00 | - |
2001-08-22 19:00:00 | NaN |
2001-10-27 18:00:00 | NaN |
2002-03-31 07:00:00 | - |
2000-10-01 06:00:00 | NaN |
2002-07-14 09:00:00 | - |
2000-11-06 16:00:00 | NaN |
2001-03-11 07:00:00 | NaN |
2000-10-04 04:00:00 | NaN |
2002-01-25 15:00:00 | - |
2001-06-27 13:00:00 | NaN |
2002-07-15 07:00:00 | - |
2000-08-08 15:00:00 | NaN |
data['2001'] # menampilkan data pada tahun tertentu
AURENE | |
---|---|
2001-01-01 00:00:00 | NaN |
2001-01-01 01:00:00 | NaN |
2001-01-01 02:00:00 | NaN |
2001-01-01 03:00:00 | NaN |
2001-01-01 04:00:00 | NaN |
... | ... |
2001-12-31 19:00:00 | NaN |
2001-12-31 20:00:00 | NaN |
2001-12-31 21:00:00 | NaN |
2001-12-31 22:00:00 | NaN |
2001-12-31 23:00:00 | NaN |
8760 rows × 1 columns
data['20020101 07:00': '20020101 16:00'] # menampilkan data diantara jam 7.00 sampai 16.00 pada tanggal 1 januari 2002
AURENE | |
---|---|
2002-01-01 07:00:00 | 0.3 |
2002-01-01 08:00:00 | - |
2002-01-01 09:00:00 | - |
2002-01-01 10:00:00 | - |
2002-01-01 11:00:00 | - |
2002-01-01 12:00:00 | - |
2002-01-01 13:00:00 | 0.5 |
2002-01-01 14:00:00 | - |
2002-01-01 15:00:00 | 1.1 |
2002-01-01 16:00:00 | - |
- 20191129 - 1.0.0 - Initial
Source code in this notebook is licensed under a MIT License. Data in this notebook is licensed under a Creative Common Attribution 4.0 International.