I'll show how it is possible to get the most out of the Pandas & the Clojure ecosystem at the same time.
This intro is based on this Kaggle notebook you can follow along with that if you come from the Python world.
The easiest way to go is the provided Docker image, but if you want to setup your machine just follow along.
If you want to install everything at the system level you should do something equivalent to what we do below:
sudo apt-get update
sudo apt-get install libpython3.6-dev
pip3 install numpy pandas
To work within a conda environment just create a new one with:
conda create -n panthera python=3.6 numpy pandas
conda activate panthera
Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.
Let's just add panthera to our classpath and we're good to go!
(require '[clojupyter.misc.helper :as helper])
(helper/add-dependencies '[panthera "0.1-alpha.11"])
:ok
:ok
Now require panthera main API namespace and define a little helper to better inspect data-frames
(require '[panthera.panthera :as pt])
(require '[clojupyter.display :as display])
(require '[libpython-clj.python :as py])
(defn show
[obj]
(display/html
(py/call-attr obj "to_html")))
#'user/show
(helper/add-dependencies '[metasoarous/oz "1.5.4"])
(require '[oz.notebook.clojupyter :as oz])
nil
We will work with Pokemons! Datasets are available here.
We can read data into panthera from various formats, one of the most used is read-csv
. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.
To see which options are available you can check docs or even original Pandas docs, just remember that if you pass keywords they'll be converted to Python automatically (for example :index-col
becomes index_col
), while if you pass strings you have to use its original name.
Below as an example we read-csv
our file, but we want to get only the first 10 rows, so we pass a map to the function like {:nrows 10}
.
(show (pt/read-csv "../resources/pokemon.csv" {:nrows 10}))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
5 | 6 | Charmeleon | Fire | NaN | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
6 | 7 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
7 | 8 | Mega Charizard X | Fire | Dragon | 78 | 130 | 111 | 130 | 85 | 100 | 1 | False |
8 | 9 | Mega Charizard Y | Fire | Flying | 78 | 104 | 78 | 159 | 115 | 100 | 1 | False |
9 | 10 | Squirtle | Water | NaN | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
The cool thing is that we can chain operations, the threading first macro is our friend!
Below we read the whole csv, get the correlation matrix and then show it
(-> (pt/read-csv "../resources/pokemon.csv")
pt/corr
show)
# | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|
# | 1.000000 | 0.097712 | 0.102664 | 0.094691 | 0.089199 | 0.085596 | 0.012181 | 0.983428 | 0.154336 |
HP | 0.097712 | 1.000000 | 0.422386 | 0.239622 | 0.362380 | 0.378718 | 0.175952 | 0.058683 | 0.273620 |
Attack | 0.102664 | 0.422386 | 1.000000 | 0.438687 | 0.396362 | 0.263990 | 0.381240 | 0.051451 | 0.345408 |
Defense | 0.094691 | 0.239622 | 0.438687 | 1.000000 | 0.223549 | 0.510747 | 0.015227 | 0.042419 | 0.246377 |
Sp. Atk | 0.089199 | 0.362380 | 0.396362 | 0.223549 | 1.000000 | 0.506121 | 0.473018 | 0.036437 | 0.448907 |
Sp. Def | 0.085596 | 0.378718 | 0.263990 | 0.510747 | 0.506121 | 1.000000 | 0.259133 | 0.028486 | 0.363937 |
Speed | 0.012181 | 0.175952 | 0.381240 | 0.015227 | 0.473018 | 0.259133 | 1.000000 | -0.023121 | 0.326715 |
Generation | 0.983428 | 0.058683 | 0.051451 | 0.042419 | 0.036437 | 0.028486 | -0.023121 | 1.000000 | 0.079794 |
Legendary | 0.154336 | 0.273620 | 0.345408 | 0.246377 | 0.448907 | 0.363937 | 0.326715 | 0.079794 | 1.000000 |
Since we'll be using pokemon.csv
a lot, let's give it a name, defonce
is great here
(defonce pokemon (pt/read-csv "../resources/pokemon.csv"))
#'user/pokemon
Let's see how plotting goes
(defn heatmap
[data x y z]
{:data {:values data}
:width 500
:height 500
:encoding {:x {:field x
:type "nominal"}
:y {:field y
:type "nominal"}}
:layer [{:mark "rect"
:encoding {:color {:field z
:type "quantitative"}}}
{:mark "text"
:encoding {:text
{:field z
:type "quantitative"
:format ".2f"}
:color {:value "white"}}}]})
#'user/heatmap
(-> pokemon
pt/corr
pt/reset-index
(pt/melt {:id-vars :index})
pt/->clj
(heatmap :index :variable :value)
oz/view!)
What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!
What if we already read our data but we want to see only some rows? We have the head
function for that
(show (pt/head pokemon))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
(show (pt/head pokemon 10))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
5 | 6 | Charmeleon | Fire | NaN | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
6 | 7 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
7 | 8 | Mega Charizard X | Fire | Dragon | 78 | 130 | 111 | 130 | 85 | 100 | 1 | False |
8 | 9 | Mega Charizard Y | Fire | Flying | 78 | 104 | 78 | 159 | 115 | 100 | 1 | False |
9 | 10 | Squirtle | Water | NaN | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
Another nice thing we can do is to get columns names
(pt/names pokemon)
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'], dtype='object')
Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?
(vec (pt/names pokemon))
["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]
That's it! Just call vec
and now you have a nice Clojure vector that you can deal with.
N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below
(doseq [a (pt/names pokemon)] (println a))
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
nil
Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!
Let's plot Speed
and Defense
(defn line-plot
[data x y & [color]]
(let [spec {:data {:values data}
:mark "line"
:width 600
:height 300
:encoding {:x {:field x
:type "quantitative"}
:y {:field y
:type "quantitative"}
:color {}}}]
(if color
(assoc-in spec [:encoding :color] {:field color
:type "nominal"})
(assoc-in spec [:encoding :color] {:value "blue"}))))
#'user/line-plot
(-> pokemon
(pt/subset-cols :# :Speed :Defense)
(pt/melt {:id-vars :#})
pt/->clj
(line-plot :# :value :variable)
oz/view!)
Let's look at the operation above:
subset-cols
: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columnsmelt
: this transforms the data-frame from wide to long format (for more info about it see further below->clj
: this turns data-frames and serieses to a Clojure vector of mapssubset-cols
is pretty straightforward:
(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)
Speed | Attack | |
---|---|---|
0 | 45 | 49 |
1 | 60 | 62 |
2 | 80 | 82 |
3 | 80 | 100 |
4 | 65 | 52 |
(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)
Speed | Attack | HP | # | |
---|---|---|---|---|
0 | 45 | 49 | 45 | 1 |
1 | 60 | 62 | 60 | 2 |
2 | 80 | 82 | 80 | 3 |
3 | 80 | 100 | 80 | 4 |
4 | 65 | 52 | 39 | 5 |
(-> pokemon (pt/subset-cols :# :Attack) pt/head)
# Attack 0 1 49 1 2 62 2 3 82 3 4 100 4 5 52
->clj
tries to understand what's the better way to transform panthera data structures to Clojure ones
(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)
[{:speed 45} {:speed 60} {:speed 80} {:speed 80} {:speed 65}]
(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)
[{:speed 45, :hp 45} {:speed 60, :hp 60} {:speed 80, :hp 80} {:speed 80, :hp 80} {:speed 65, :hp 39}]
Now we want to see what happens when we plot Attack
vs Defense
(defn scatter
[data x y & [color]]
(let [spec {:data {:values data}
:mark "point"
:width 600
:height 300
:encoding {:x {:field x
:type "quantitative"}
:y {:field y
:type "quantitative"}
:color {}}}]
(if color
(assoc-in spec [:encoding :color] {:field color
:type "nominal"})
(assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
#'user/scatter
(-> pokemon
(pt/subset-cols :Attack :Defense)
pt/->clj
(scatter :attack :defense)
oz/view!)
And now the Speed
histogram
(defn hist
[data x & [color]]
(let [spec {:data {:values data}
:mark "bar"
:width 600
:height 300
:encoding {:x {:field x
:bin {:maxbins 50}
:type "quantitative"}
:y {:aggregate "count"
:type "quantitative"}
:color {}}}]
(if color
(assoc-in spec [:encoding :color] {:field color
:type "nominal"})
(assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
#'user/hist
(-> pokemon
(pt/subset-cols :Speed)
pt/->clj
(hist :speed)
oz/view!)
(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?
(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))
0 | 1 | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
Columns of data-frames are just serieses:
(-> pokemon (pt/subset-cols "Defense") pt/pytype)
:series
(pt/series [1 2 3])
0 1 1 2 2 3 dtype: int64
The column name is the name of the series:
(pt/series [1 2 3] {:name :my-series})
0 1 1 2 2 3 Name: my-series, dtype: int64
One of the most straightforward ways to filter data-frames is with booleans. We have filter-rows
that takes either booleans or a function that generates booleans
(-> pokemon
(pt/filter-rows #(-> % (pt/subset-cols "Defense") (pt/gt 200)))
show)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
224 | 225 | Mega Steelix | Steel | Ground | 75 | 125 | 230 | 55 | 95 | 30 | 2 | False |
230 | 231 | Shuckle | Bug | Rock | 20 | 10 | 230 | 10 | 230 | 5 | 2 | False |
333 | 334 | Mega Aggron | Steel | NaN | 70 | 140 | 230 | 60 | 80 | 50 | 3 | False |
gt
is exactly what you think it is: >
. Check the Basic concepts notebook to better understand how math works in panthera.
Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using npy
:
(require '[panthera.numpy :refer [npy]])
nil
(defn my-filter
[col1 col2]
(npy :logical-and
{:args [(-> pokemon
(pt/subset-cols col1)
(pt/gt 200))
(-> pokemon
(pt/subset-cols col2)
(pt/gt 100))]}))
#'user/my-filter
(-> pokemon
(pt/filter-rows (my-filter :Defense :Attack))
show)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
224 | 225 | Mega Steelix | Steel | Ground | 75 | 125 | 230 | 55 | 95 | 30 | 2 | False |
333 | 334 | Mega Aggron | Steel | NaN | 70 | 140 | 230 | 60 | 80 | 50 | 3 | False |
panthera.numpy
works a little differently than regular panthera, usually you need only npy
to have access to all of numpy functions.
For instance:
(-> pokemon
(pt/subset-cols :Defense)
((npy :log))
pt/head)
0 3.891820 1 4.143135 2 4.418841 3 4.812184 4 3.761200 Name: Defense, dtype: float64
Above we just calculated the log
of the whole Defense
column! Remember that npy
operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would probably be faster).
Now let's try to do some more complicated things:
(/ (pt/sum (pt/subset-cols pokemon :Speed))
(pt/n-rows pokemon))
27311/400
Above we see how we can combine operations on serieses, but of course that's a mean
, and we have a function for that!
(defn col-mean
[col]
(pt/mean (pt/subset-cols pokemon col)))
#'user/col-mean
Now we would like to add a new column that says high
when the value is above the mean, and low
for the opposite.
npy
is really helpful here:
(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))
"high"
"low"]})
['low' 'low' 'high' 'high' 'low']
But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:
(defn where
[& args]
(npy :where {:args args}))
#'user/where
(-> pokemon
(pt/subset-cols :Speed)
pt/head
(pt/gt (col-mean :Speed))
(where "high" "low"))
['low' 'low' 'high' 'high' 'low']
That seems to work! Let's add a new column to our data-frame:
(def speed-level
(-> pokemon
(pt/subset-cols :Speed)
(pt/gt (col-mean :Speed))
(where "high" "low")))
(-> pokemon
(pt/assign {:speed-level speed-level})
(pt/subset-cols :speed_level :Speed)
(pt/head 10)
show)
speed_level | Speed | |
---|---|---|
0 | low | 45 |
1 | low | 60 |
2 | high | 80 |
3 | high | 80 |
4 | low | 65 |
5 | high | 80 |
6 | high | 100 |
7 | high | 100 |
8 | high | 100 |
9 | low | 43 |
Of course we didn't actually add speed_level
to pokemon
, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:
(vec (pt/names pokemon))
["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]
Other than head
we have tail
(show (pt/tail pokemon))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
795 | 796 | Diancie | Rock | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 797 | Mega Diancie | Rock | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 798 | Hoopa Confined | Psychic | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 799 | Hoopa Unbound | Psychic | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 800 | Volcanion | Fire | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
We can always check what's the shape of the data structure we're interested in. shape
returns rows and columns count
(pt/shape pokemon)
(800, 12)
If you want just one of the two you can either use one of n-rows
or n-cols
, or get the required value by index:
(pt/n-rows pokemon)
800
((pt/shape pokemon) 0)
800
Now we can move to something a little more interesting: some data analysis.
One of the first things we might want to do is to look at some frequencies. value-counts
is our friend
(-> pokemon
(pt/subset-cols "Type 1")
(pt/value-counts {:dropna false}))
Water 112 Normal 98 Grass 70 Bug 69 Psychic 57 Fire 52 Rock 44 Electric 44 Ground 32 Ghost 32 Dragon 32 Dark 31 Poison 28 Fighting 27 Steel 27 Ice 24 Fairy 17 Flying 4 Name: Type 1, dtype: int64
As we can see we get counts by group automatically and this can come in handy!
There's also a nice way to see many stats at once for all the numeric columns: describe
(show (pt/describe pokemon))
# | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
---|---|---|---|---|---|---|---|---|
count | 800.0000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.00000 |
mean | 400.5000 | 69.258750 | 79.001250 | 73.842500 | 72.820000 | 71.902500 | 68.277500 | 3.32375 |
std | 231.0844 | 25.534669 | 32.457366 | 31.183501 | 32.722294 | 27.828916 | 29.060474 | 1.66129 |
min | 1.0000 | 1.000000 | 5.000000 | 5.000000 | 10.000000 | 20.000000 | 5.000000 | 1.00000 |
25% | 200.7500 | 50.000000 | 55.000000 | 50.000000 | 49.750000 | 50.000000 | 45.000000 | 2.00000 |
50% | 400.5000 | 65.000000 | 75.000000 | 70.000000 | 65.000000 | 70.000000 | 65.000000 | 3.00000 |
75% | 600.2500 | 80.000000 | 100.000000 | 90.000000 | 95.000000 | 90.000000 | 90.000000 | 5.00000 |
max | 800.0000 | 255.000000 | 190.000000 | 230.000000 | 194.000000 | 230.000000 | 180.000000 | 6.00000 |
If you need some of these stats only for some columns, chances are that there's a function for that!
(-> (pt/subset-cols pokemon :HP)
((juxt pt/mean pt/std pt/minimum pt/maximum)))
[69.25875 25.53466903233207 1 255]
Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.
The R people perfectly know what I mean when I talk about tidy data, if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with describe
), it is much easier to work with long data: one row per observation and one column per variable.
In panthera there's melt
as a workhorse for this process
(-> pokemon pt/head show)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
(-> pokemon pt/head (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) show)
Name | variable | value | |
---|---|---|---|
0 | Bulbasaur | Attack | 49 |
1 | Ivysaur | Attack | 62 |
2 | Venusaur | Attack | 82 |
3 | Mega Venusaur | Attack | 100 |
4 | Charmander | Attack | 52 |
5 | Bulbasaur | Defense | 49 |
6 | Ivysaur | Defense | 63 |
7 | Venusaur | Defense | 83 |
8 | Mega Venusaur | Defense | 123 |
9 | Charmander | Defense | 43 |
Above we told panthera that we wanted to melt
our data-frame and that we would like to have the column Name
act as the main id, while we're interested in the value of Attack
and Defense
.
This makes much easier to group values by some variable:
(-> pokemon
pt/head
(pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]})
(pt/groupby :variable)
pt/mean)
value variable Attack 69.0 Defense 72.2
If you've ever used Excel you already know about pivot
, which is the opposite of melt
(-> pokemon
pt/head
(pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]})
(pt/pivot {:index "Name" :columns "variable" :values "value"})
show)
variable | Attack | Defense |
---|---|---|
Name | ||
Bulbasaur | 49 | 49 |
Charmander | 52 | 43 |
Ivysaur | 62 | 63 |
Mega Venusaur | 100 | 123 |
Venusaur | 82 | 83 |
What if we have more than one data-frame? We can combine them however we want!
(show
(pt/concatenate
[(pt/head pokemon)
(pt/tail pokemon)]
{:axis 0
:ignore-index true}))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
5 | 796 | Diancie | Rock | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
6 | 797 | Mega Diancie | Rock | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
7 | 798 | Hoopa Confined | Psychic | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
8 | 799 | Hoopa Unbound | Psychic | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
9 | 800 | Volcanion | Fire | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
Just a second to discuss some options:
:axis
: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns:ignore-index
: panthera works by index, to better understand what kind of indexes there are and most of their quirks check Basic conceptsTo better understand :axis
let's make another example
(show
(pt/concatenate
(repeat 2 (pt/head pokemon))
{:axis 1}))
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
There are many dedicated types, but no worries, there are nice ways to deal with them.
(pt/dtype pokemon)
# int64 Name object Type 1 object Type 2 object HP int64 Attack int64 Defense int64 Sp. Atk int64 Sp. Def int64 Speed int64 Generation int64 Legendary bool dtype: object
I guess there isn't much to say about :int64
and :bool
, but surely :object
looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an :object
.
:object
s are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to :category
(-> pokemon
(pt/subset-cols "Type 1")
(pt/astype :category)
pt/head)
0 Grass 1 Grass 2 Grass 3 Grass 4 Fire Name: Type 1, dtype: category Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]
(-> pokemon
(pt/subset-cols "Speed")
(pt/astype :float)
pt/head)
0 45.0 1 60.0 2 80.0 3 80.0 4 65.0 Name: Speed, dtype: float64
One of the most painful operations for data scientists and engineers is dealing with the unknown: NaN
(or nil
, Null
, etc).
panthera tries to make this as painless as possible:
(-> pokemon
(pt/subset-cols "Type 2")
(pt/value-counts {:dropna false}))
NaN 386 Flying 97 Ground 35 Poison 34 Psychic 33 Fighting 26 Grass 25 Fairy 23 Steel 22 Dark 20 Dragon 18 Ice 14 Ghost 14 Water 14 Rock 14 Fire 12 Electric 6 Normal 4 Bug 3 Name: Type 2, dtype: int64
We could check for NaN
in other ways has well:
(-> pokemon (pt/subset-cols "Type 2") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))
[true false]
One of the ways to deal with missing data is to just drop rows
(-> pokemon
(pt/dropna {:subset ["Type 2"]})
(pt/subset-cols "Type 2")
(pt/value-counts {:dropna false}))
Flying 97 Ground 35 Poison 34 Psychic 33 Fighting 26 Grass 25 Fairy 23 Steel 22 Dark 20 Dragon 18 Ice 14 Rock 14 Water 14 Ghost 14 Fire 12 Electric 6 Normal 4 Bug 3 Name: Type 2, dtype: int64
But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with fill-na
(-> pokemon
(pt/subset-cols "Type 2")
(pt/fill-na :empty)
(pt/head 10))
0 Poison 1 Poison 2 Poison 3 Poison 4 empty 5 empty 6 Flying 7 Dragon 8 Flying 9 empty Name: Type 2, dtype: object
Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible
(def times
["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"])
(pt/->datetime times)
DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15', '1993-03-16'], dtype='datetime64[ns]', freq=None)
(-> pokemon
pt/head
(pt/set-index (pt/->datetime times))
show)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1992-01-10 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1992-02-10 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
1992-03-10 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
1993-03-15 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
1993-03-16 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
(-> pokemon
pt/head
(pt/set-index (pt/->datetime times))
(pt/select-rows "1993-03-16" :loc))
# 5 Name Charmander Type 1 Fire Type 2 NaN HP 39 Attack 52 Defense 43 Sp. Atk 60 Sp. Def 50 Speed 65 Generation 1 Legendary False Name: 1993-03-16 00:00:00, dtype: object
(-> pokemon
pt/head
(pt/set-index (pt/->datetime times))
(pt/select-rows (pt/slice "1992-03-10" "1993-03-16") :loc)
show)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1992-03-10 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
1993-03-15 | 4 | Mega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
1993-03-16 | 5 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |