以下で読み込んでいる tabelog_scraping_data_60_pages_per_pref.npy
は
にある Scraping.ipynb
を実行して作成した scraping.npy
のファイル名を変更したものである. ただし, その scraping.npy
の中身は以下のように変更されている:
total_contents.npy
というファイルに, 「非会員・無料会員(店舗準会員)」と「有料会員(店舗会員)」の区別のデータなどが追加されていたが, 以下で使用する scraping.npy
は追加前のファイルである. (食べログ 非会員/無料会員/有料会員の見分け方)も参照せよ.scraping.npy
に含まれている.このような scraping.npy
のファイル名を変更して作った tabelog_scraping_data_60_pages per_pref.npy
は
からダウンロードできる.
各都道府県ごとに最初の60ページ分のデータの様子は異なるので, それらをまとめてグラフにするのは正しいグラフの描き方ではない. 以下ではそういう不適切なことをしていることに注意して欲しい.
当たりを付けるためのプロットに過ぎないことに注意せよ.
レビュー数に評価値の最頻値がどのように依存するかをプロットするというアイデアは
による. 最頻値がきれいに0.2刻みで生じていることを確認できる.
ENV["LINES"] = 100
using PyCall
np = pyimport("numpy")
using StatsBase
using Plots
gr()
Plots.GRBackend()
# データの読み込み
npscraping = np.load("tabelog_scraping_data_60_pages_per_pref.npy", allow_pickle=true)
X = [x for npscraping in npscraping for x in y]
@show length(X)
@show X[1]
@show X[end]
pref = [x["pref"] for x in X]
name = [x["name"] for x in X]
review_num = [x["amounts"] for x in X]
rating = [x["rates"] for x in X]
@show extrema(review_num)
@show extrema(rating);
UndefVarError: y not defined Stacktrace: [1] (::getfield(Main, Symbol("##4#5")))(::PyObject) at .\none:0 [2] iterate at .\generator.jl:47 [inlined] [3] iterate at .\iterators.jl:902 [inlined] [4] iterate at .\iterators.jl:898 [inlined] [5] grow_to!(::Array{Any,1}, ::Base.Iterators.Flatten{Base.Generator{Array{PyObject,1},getfield(Main, Symbol("##4#5"))}}) at .\array.jl:666 [6] _collect at .\array.jl:580 [inlined] [7] collect(::Base.Iterators.Flatten{Base.Generator{Array{PyObject,1},getfield(Main, Symbol("##4#5"))}}) at .\array.jl:544 [8] top-level scope at In[2]:4
@show pref_name = collect(Set(pref));
@show length(pref_name);
pref_name = collect(Set(pref)) = ["akita", "hokkaido", "wakayama", "okinawa", "hiroshima", "shimane", "mie", "aomori", "nagasaki", "oita", "fukushima", "kagoshima", "yamanashi", "saga", "saitama", "fukui", "miyazaki", "shizuoka", "tokyo", "niigata", "kanagawa", "okayama", "ibaraki", "tochigi", "toyama", "aichi", "kochi", "kyoto", "gunma", "yamaguchi", "kagawa", "fukuoka", "chiba", "osaka", "kumamoto", "nagano", "miyagi", "iwate", "ishikawa", "hyogo", "ehime", "shiga", "tottori", "nara", "gifu", "yamagata", "tokushima"] length(pref_name) = 47
pref_list = [
"hokkaido",
"aomori","akita","yamagata","iwate","miyagi", "fukushima",
"chiba","tochigi","ibaraki","gunma","saitama","tokyo","kanagawa",
"aichi","gifu","shizuoka","niigata","yamanashi", "nagano","ishikawa","toyama","fukui",
"mie","osaka","hyogo","kyoto","shiga","nara","wakayama",
"okayama","hiroshima","tottori","shimane","yamaguchi",
"kagawa","tokushima","ehime","kochi",
"fukuoka","saga","nagasaki","kumamoto","oita","miyazaki","kagoshima","okinawa"
]
Set(pref_name) == Set(pref_list)
true
pref_name = pref_list
47-element Array{String,1}: "hokkaido" "aomori" "akita" "yamagata" "iwate" "miyagi" "fukushima" "chiba" "tochigi" "ibaraki" "gunma" "saitama" "tokyo" "kanagawa" "aichi" "gifu" "shizuoka" "niigata" "yamanashi" "nagano" "ishikawa" "toyama" "fukui" "mie" "osaka" "hyogo" "kyoto" "shiga" "nara" "wakayama" "okayama" "hiroshima" "tottori" "shimane" "yamaguchi" "kagawa" "tokushima" "ehime" "kochi" "fukuoka" "saga" "nagasaki" "kumamoto" "oita" "miyazaki" "kagoshima" "okinawa"
review_num_count = [count(review_num .== k) for k in 1:maximum(review_num)]
plot((x -> ifelse(iszero(x), 0.9, x)).(review_num_count);
legend=false, xlabel="review_num", xscale=:log10, yscale=:log10)
取得できたデータ中のレビュー数の最小値は3である. 各県ごとに最初の60ページ分のデータしか取得していないので, レビュー数が小さなお店の情報は取得されていない.
pref_name = unique(pref)
pref_count = [count(pref .== x) for x in pref_name]
@show pref_count;
pref_count = [1180, 1180, 1179, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1179, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1180, 1179, 1180]
各都道府県ごとに取得できた店舗数は1179または1180である. 20×60=1200分取得できていると思っていたが, 約20足りなくなってしまった.
review_num_min = [minimum(review_num[pref .== x]) for x in pref_name]
hcat(pref_name, review_num_min)
47×2 Array{Any,2}: "hokkaido" 59 "aomori" 6 "akita" 7 "yamagata" 7 "iwate" 9 "miyagi" 25 "fukushima" 13 "aichi" 61 "gifu" 12 "shizuoka" 24 "mie" 18 "niigata" 16 "yamanashi" 8 "nagano" 3 "ishikawa" 13 "toyama" 8 "fukui" 5 "okayama" 17 "hiroshima" 19 "tottori" 5 "shimane" 3 "yamaguchi" 8 "kagawa" 11 "tokushima" 6 "ehime" 9 "kochi" 3 "tokyo" 261 "kanagawa" 77 "chiba" 35 "tochigi" 14 "ibaraki" 15 "gunma" 16 "saitama" 31 "osaka" 111 "hyogo" 46 "kyoto" 54 "shiga" 11 "nara" 13 "wakayama" 5 "fukuoka" 40 "saga" 4 "nagasaki" 7 "kumamoto" 9 "oita" 5 "miyazaki" 6 "kagoshima" 4 "okinawa" 18
@show pref_small = pref_name[review_num_min .≤ 10]
@show pref_middle = pref_name[10 .< review_num_min .≤ 20]
@show pref_large = pref_name[20 .< review_num_min];
pref_small = pref_name[review_num_min .≤ 10] = ["aomori", "akita", "yamagata", "iwate", "yamanashi", "nagano", "toyama", "fukui", "tottori", "shimane", "yamaguchi", "tokushima", "ehime", "kochi", "wakayama", "saga", "nagasaki", "kumamoto", "oita", "miyazaki", "kagoshima"] pref_middle = pref_name[10 .< review_num_min .≤ 20] = ["fukushima", "gifu", "mie", "niigata", "ishikawa", "okayama", "hiroshima", "kagawa", "tochigi", "ibaraki", "gunma", "shiga", "nara", "okinawa"] pref_large = pref_name[20 .< review_num_min] = ["hokkaido", "miyagi", "aichi", "shizuoka", "tokyo", "kanagawa", "chiba", "saitama", "osaka", "hyogo", "kyoto", "fukuoka"]
in_small(x) = in(x, pref_small)
in_middle(x) = in(x, pref_middle)
in_large(x) = in(x, pref_large)
@show count(in_small.(pref))
@show count(in_middle.(pref))
@show count(in_large.(pref));
count(in_small.(pref)) = 24777 count(in_middle.(pref)) = 16520 count(in_large.(pref)) = 14160
3.60 と 3.80 に「異常」があるように見える.
histogram(rating, legend=false, bin=minimum(rating):0.01:maximum(rating), xtick=3.0:0.1:4.8)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_small(pref) & (50 < review_num))], legend=false, bin=3.0:0.01:4.1, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[in_small.(pref)], legend=false, bin=minimum(rating):0.01:maximum(rating), xtick=3.0:0.1:4.8)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[in_middle.(pref)], legend=false, bin=minimum(rating):0.01:maximum(rating), xtick=3.0:0.1:4.8)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_middle(pref) & (50 < review_num))], legend=false, bin=3.0:0.01:4.1, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[in_large.(pref)], legend=false, bin=minimum(rating):0.01:maximum(rating), xtick=3.0:0.1:4.8)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (20 < review_num ≤ 30))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (30 < review_num ≤ 40))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (40 < review_num ≤ 50))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (50 < review_num ≤ 60))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (60 < review_num ≤ 70))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (70 < review_num ≤ 80))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (80 < review_num ≤ 100))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(in_large(pref) & (100 < review_num ≤ 1000000))], legend=false, bin=3.0:0.01:4.2, xtick=3.0:0.1:4.2)
vline!([3.6, 3.8], ls=:dashdot)
histogram(rating[@.(0 < review_num ≤ 5)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(5 < review_num ≤ 10)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(10 < review_num ≤ 15)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(15 < review_num ≤ 20)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(20 < review_num ≤ 25)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(25 < review_num ≤ 30)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
histogram(rating[@.(30 < review_num ≤ 35)], legend=false, bin=3.0:0.02:4.2, xtick=3.0:0.1:4.2)
vline!(3.0:0.1:4.2, ls=:dashdot)
rc = sort(unique(review_num))
mode_of_rating = [mode(rating[@.(review_num == k)]) for k in rc]
plot(rc, mode_of_rating, xlim=(3, 500), xscale=:log10, ytick=3.0:0.2:4.8,
title="mode of ratings with number of reviews = k", xlabel="k", legend=false)
rc = sort(unique(review_num))
mode_of_rating = [mode(rating[@.(0.9k ≤ review_num ≤ 1.1k)]) for k in rc]
plot(rc, mode_of_rating, xscale=:log10, ytick=3.0:0.2:4.0,
title="mode of ratings with 0.9k <= number of reviews <= 1.1k", xlabel="k", legend=false)
rc = sort(unique(review_num))
mode_of_rating = [mode(rating[@.(0.8k ≤ review_num ≤ 1.2k)]) for k in rc]
plot(rc, mode_of_rating, xscale=:log10, ytick=3.0:0.2:4.0,
title="mode of ratings with 0.8k <= number of reviews <= 1.2k", xlabel="k", legend=false)
rc = sort(unique(review_num))
mode_of_rating = [mode(rating[@.(0.7k ≤ review_num ≤ 1.3k)]) for k in rc]
plot(rc, mode_of_rating, xscale=:log10, ytick=3.0:0.2:4.0,
title="mode of ratings with 0.7k <= number of reviews <= 1.3k", xlabel="k", legend=false)
rc = sort(unique(review_num))
mode_of_rating = [mode(rating[@.(min(0.9k, k-3) ≤ review_num ≤ max(1.1k, k+3))]) for k in rc]
plot(rc, mode_of_rating, xscale=:log10, ytick=3.0:0.2:4.0,
title="mode of ratings with min(0.9k,k-3) <= number of reviews <= max(1.1k, k+3)", titlefontsize=10,
xlabel="k", label="")
plot!([15, 15], [3.0, 4.0], label="k=15", lw=1)
plot!([20, 20], [3.0, 4.0], label="k=20", lw=2, ls=:dot)
plot!([75, 75], [3.0, 4.0], label="k=75", lw=1.5, ls=:dash)
plot!([270, 270], [3.0, 4.0], label="k=270", lw=1.5, ls=:dashdot)
plot!(legend=:topleft)