노트북 작성: 박상길 skpark1224@hyundai.com
inspired by a book called
『Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale』
런던의 전체 자전거 대여소의 사용 횟수와 평균 사용 시간을 추출해보자.
%%bigquery df
SELECT
start_station_name
, AVG(duration) as duration
, COUNT(duration) as num_trips
FROM `bigquery-public-data`.london_bicycles.cycle_hire
GROUP BY start_station_name
ORDER BY num_trips DESC
df
start_station_name | duration | num_trips | |
---|---|---|---|
0 | Belgrove Street , King's Cross | 1011.076696 | 234458 |
1 | Hyde Park Corner, Hyde Park | 2782.730709 | 215629 |
2 | Waterloo Station 3, Waterloo | 866.376135 | 201630 |
3 | Black Lion Gate, Kensington Gardens | 3588.012004 | 161952 |
4 | Albert Gate, Hyde Park | 2359.413930 | 155647 |
... | ... | ... | ... |
875 | Monier Road | 1576.363636 | 11 |
876 | Blackfriars road, Southwark | 96.000000 | 5 |
877 | LSP2 | 105.000000 | 4 |
878 | PENTON STREET COMMS TEST TERMINAL _ CONTACT MA... | 0.000000 | 1 |
879 | LSP1 | 840.000000 | 1 |
880 rows × 3 columns
df.describe()
duration | num_trips | |
---|---|---|
count | 880.000000 | 880.000000 |
mean | 1348.351153 | 27692.273864 |
std | 434.057829 | 23733.621289 |
min | 0.000000 | 1.000000 |
25% | 1078.684974 | 13033.500000 |
50% | 1255.889223 | 23658.500000 |
75% | 1520.504055 | 35450.500000 |
max | 4836.380090 | 234458.000000 |
다음과 같이 산점도(scatter plot)로 표현할 수 있다.
df.plot.scatter('duration', 'num_trips')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4585f20490>
만일 누군가 어떤 대여소에서 자전거를 빌린 지 10분도 채 되지 않아 다시 원래 대여소로 되돌려 놓았다면 아마도 그 자전거에 문제가 있었기 때문이라고 추정할 수 있다. 이런 상황을 망친 나들이(bad trips)라 부르기로 하고 추출을 시도해보자.
%%bigquery badtrips
WITH all_bad_trips AS (
SELECT
start_station_name
, COUNTIF(duration < 600 AND start_station_name = end_station_name) AS bad_trips
, COUNT(*) as num_trips
FROM `bigquery-public-data`.london_bicycles.cycle_hire
WHERE EXTRACT(YEAR FROM start_date) = 2015
GROUP BY start_station_name
HAVING num_trips > 10
)
SELECT *, bad_trips / num_trips AS fraction_bad FROM all_bad_trips
ORDER BY fraction_bad DESC
badtrips
start_station_name | bad_trips | num_trips | fraction_bad | |
---|---|---|---|---|
0 | Contact Centre, Southbury House | 20 | 48 | 0.416667 |
1 | Monier Road, Newham | 1 | 25 | 0.040000 |
2 | Aberfeldy Street, Poplar | 35 | 955 | 0.036649 |
3 | Ormonde Gate, Chelsea | 315 | 8932 | 0.035266 |
4 | Thornfield House, Poplar | 28 | 947 | 0.029567 |
... | ... | ... | ... | ... |
818 | Sun Street, Liverpool Street | 70 | 28787 | 0.002432 |
819 | Finsbury Circus, Liverpool Street | 100 | 44822 | 0.002231 |
820 | Waterloo Station 3, Waterloo | 151 | 80916 | 0.001866 |
821 | Waterloo Roundabout, Waterloo | 3 | 1678 | 0.001788 |
822 | Monier Road | 0 | 11 | 0.000000 |
823 rows × 4 columns
badtrips.plot.scatter('num_trips', 'fraction_bad')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4585de3e10>
이 그래프를 보면 fraction_bad 컬럼의 값이 높을수록 관련된 대여소의 num_trips 컬럼 값이 낮다는 점을 알 수 있지만 0.4라는 예외 때문에 그 트렌드가 명확하게 보이지 않는다. 다음 seaborn 패키지를 이용해서 데이터를 조금 더 자세히 들여다보자.
regplot
do Plot the data and draw a linear regression model.
import seaborn as sns
ax = sns.regplot(badtrips['num_trips'],badtrips['fraction_bad'])
ax.set_ylim(0, 0.05)
(0.0, 0.05)
이제 num_trips 컬럼 값이 낮은 대여소일수록 fraction_bad 컬럼 값이 높다는 점이 명확해졌다.
그런데, fraction_bad 값이 높을수록 num_trips 값이 낮으므로 단순히 fraction_bad 값이 높은 대여소로 직원을 보내서는 안된다. 그렇다면 임의 조사를 수행할 대여소를 어떻게 골라야 할까?
한 가지 방법은 사용 빈도가 높은 대여소 중에서 상황이 좋지 않은 5곳을 선택하고, 그 다음으로 사용 빈도가 높은 대여소 중 5곳을 선택해 나가는 방법이다. 그러려면 num_trips 컬럼 값을 기준으로 대여소의 변위치(quantile)을 구해 4개의 묶음(band)을 만들고, 각 묶음 안에서 망친 나들이 비율이 가장 높은 5개의 대여소를 찾으면 된다.
stations_to_examine = []
for band in range(1,5):
min_trips = badtrips['num_trips'].quantile(0.2*(band))
max_trips = badtrips['num_trips'].quantile(0.2*(band+1))
query = 'num_trips >= {} and num_trips < {}'.format(
min_trips, max_trips)
print(query) # band
stations = badtrips.query(query)
stations = stations.sort_values(
by=['fraction_bad'], ascending=False)[:5]
print(stations) # 5 worst
stations_to_examine.append(stations)
print()
num_trips >= 4826.4 and num_trips < 8511.8 start_station_name bad_trips num_trips fraction_bad 6 River Street , Clerkenwell 221 8279 0.026694 9 Courland Grove, Wandsworth Road 105 5369 0.019557 10 Stanley Grove, Battersea 92 4882 0.018845 12 Southern Grove, Bow 112 6152 0.018205 18 Richmond Way, Shepherd's Bush 126 8149 0.015462 num_trips >= 8511.8 and num_trips < 11502.6 start_station_name bad_trips num_trips fraction_bad 3 Ormonde Gate, Chelsea 315 8932 0.035266 41 Emperor's Gate, South Kensington 136 10554 0.012886 53 Chepstow Villas, Notting Hill 135 11038 0.012230 57 Salmon Lane, Limehouse 113 9345 0.012092 63 Greyhound Road, Hammersmith 118 9998 0.011802 num_trips >= 11502.6 and num_trips < 16509.2 start_station_name bad_trips num_trips \ 35 Hertford Road, De Beauvoir Town 195 14188 39 Bramham Gardens, Earl's Court 212 16178 62 Lancaster Gate , Bayswater 172 14554 75 Ilchester Gardens, Bayswater 159 14137 80 Westfield Library Corner, Shepherd's Bush 126 11517 fraction_bad 35 0.013744 39 0.013104 62 0.011818 75 0.011247 80 0.010940 num_trips >= 16509.2 and num_trips < 95740.0 start_station_name bad_trips num_trips fraction_bad 25 Queen's Gate, Kensington Gardens 396 27457 0.014423 74 Speakers' Corner 2, Hyde Park 468 41107 0.011385 76 Cumberland Gate, Hyde Park 303 26981 0.011230 77 Albert Gate, Hyde Park 729 66547 0.010955 82 Triangle Car Park, Hyde Park 454 41675 0.010894
import pandas as pd
stations_to_examine = pd.concat(stations_to_examine)
stations_to_examine.head(5)
start_station_name | bad_trips | num_trips | fraction_bad | |
---|---|---|---|---|
6 | River Street , Clerkenwell | 221 | 8279 | 0.026694 |
9 | Courland Grove, Wandsworth Road | 105 | 5369 | 0.019557 |
10 | Stanley Grove, Battersea | 92 | 4882 | 0.018845 |
12 | Southern Grove, Bow | 112 | 6152 | 0.018205 |
18 | Richmond Way, Shepherd's Bush | 126 | 8149 | 0.015462 |
이 데이터를 추후 보고서 작성을 위해 별도 테이블로 저장한다.
from google.cloud import bigquery
PROJECT='myhyundai20'
bq = bigquery.Client(project=PROJECT)
table_id = '{}.experiments.bad_bikes'.format(PROJECT)
job = bq.load_table_from_dataframe(stations_to_examine, table_id)
job.result() # blocks and waits
print("Loaded {} rows into {}".format(job.output_rows, table_id))
Loaded 20 rows into myhyundai20.experiments.bad_bikes
이제 저장한 대여소 위치를 위경도를 지닌 테이블과 조인하여 지도에 표현해본다.
%%bigquery stations_to_examine
SELECT
start_station_name AS station_name
, num_trips
, fraction_bad
, latitude
, longitude
FROM `myhyundai20`.experiments.bad_bikes AS bad
JOIN `myhyundai20`.experiments.cycle_stations AS s
ON bad.start_station_name = s.name
stations_to_examine.head(5)
station_name | num_trips | fraction_bad | latitude | longitude | |
---|---|---|---|---|---|
0 | Ormonde Gate, Chelsea | 8932 | 0.035266 | 51.487964 | -0.161765 |
1 | Stanley Grove, Battersea | 4882 | 0.018845 | 51.470475 | -0.152130 |
2 | Courland Grove, Wandsworth Road | 5369 | 0.019557 | 51.472918 | -0.132103 |
3 | Southern Grove, Bow | 6152 | 0.018205 | 51.523538 | -0.030556 |
4 | Salmon Lane, Limehouse | 9345 | 0.012092 | 51.514115 | -0.033828 |
folium 패키지를 이용해 지도에 표현해보자.
import folium
map_pts = folium.Map(location=[51.5, -0.15], zoom_start=12)
for idx, row in stations_to_examine.iterrows():
folium.Marker( location=[row['latitude'], row['longitude']], popup=row['station_name'] ).add_to(map_pts)
map_pts
이제 이 대여소 위치에 직원을 보내서 대여소를 점검하자.