In this notebook, we explore how to query hierarchical databases.
ENV["LINES"] = 15
include("../citydb_json.jl")
citydb
Dict{AbstractString,Any} with 1 entry: "departments" => Any[Dict{AbstractString,Any}("name"=>"WATER MGMNT","employee…
In hierarchical data model, data is organized in a tree-like structure. In this database, data is stored as a JSON document organized in a 2-level hierarchy:
"departments"
with an array of department objects;"name"
, the name of the department, and "employees"
, an array of employees;"name"
, "surname"
, "position"
, "salary"
describing the employee.Here is a fragment of the dataset:
{
"departments": [
{
"name": "WATER MGMNT",
"employees": [
{
"name": "ALVA",
"surname": "A",
"position": "WATER RATE TAKER",
"salary": 87228
},
...
]
},
...
]
}
We may want to ask some questions about the data. For example,
Even though the raw dataset does not immediately contain any answers to these questions, it has enough information so that the answers could be inferred from the data if we are willing to write some code (we use Julia programming language).
Take a relatively complicated problem:
For each department, find the number of employees with the salary higher than $100k.
It can be solved as follows:
Depts_With_Num_Well_Paid_Empls(data) =
map(d -> Dict(
"name" => d["name"],
"N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
data["departments"])
Depts_With_Num_Well_Paid_Empls(citydb)
35-element Array{Any,1}: Dict{ASCIIString,Any}("name"=>"WATER MGMNT","N100k"=>179) Dict{ASCIIString,Any}("name"=>"POLICE","N100k"=>1493) Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","N100k"=>79) Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","N100k"=>54) Dict{ASCIIString,Any}("name"=>"STREETS & SAN","N100k"=>39) ⋮ Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","N100k"=>2) Dict{ASCIIString,Any}("name"=>"POLICE BOARD","N100k"=>0) Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","N100k"=>12) Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","N100k"=>3) Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","N100k"=>0)
Is it a good solution? Possibly. It is certainly compact, due to our use of map
and filter
to traverse the structure of the database. On the other hand, to write or understand code like this, one needs solid understanding of non-trivial CS concepts such as high-order and anonymous functions. One needs to be a professional programmer.
Is there a way to write this query without use of map
and filter
(or, equivalently, nested loops)? Indeed, there is, but to show how to do it, we need to introduce some new primitives and operations. We start with the notion of JSON combinators.
A JSON combinator is simply a function that maps JSON input to JSON output. Two trivial examples of JSON combinators are:
Const(val)
, which maps each input value to constant value val
.This()
, which copies the input to the output without changes.Const(val) = x -> val
C = Const(42)
C(true), C(42), C([1,2,3])
(42,42,42)
In this example, Const(42)
creates a new constant combinator. It is then applied to various input JSON values, always producing the same output.
This() = x -> x
I = This()
I(true), I(42), I([1,2,3])
(true,42,[1,2,3])
Similarly, This()
creates a new identity combinator. We test it with different input values to assure ourselves that it does not change the input.
Notice the pattern:
In short, by designing a collection of useful combinators, we are creating a query language (embedded in the host language, but this is immaterial).
Now let us define a more interesting combinator. Field(name)
extracts a field value from a JSON object.
Field(name) = x -> x[name]
Field (generic function with 1 method)
Salary = Field("salary")
Salary(Dict("name" => "RAHM", "surname" => "E", "position" => "MAYOR", "salary" => 216210))
216210
Here, to demonstrate field extractors, we defined Salary
, a combinator that extracts value of field "salary"
from the input JSON object.
To build interesting queries, we need a way to construct complex combinators from primitives. Let us define composition (F >> G)
of combinators F
and G
that ties F
and G
by sending the output of F
to the input of G
.
Our first, naive attempt to implement composition is as follows:
import Base: >>
(F >> G) = x -> G(F(x))
>> (generic function with 86 methods)
We can traverse the structure of hierarchical data by chaining field extractors with the composition operator.
$$ \textbf{departments}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$For example, let us find the names of all departments. We can do it by composing extractors for fields "departments"
and "name"
.
Departments = Field("departments")
Name = Field("name")
Dept_Names = Departments >> Name
Dept_Names(citydb)
LoadError: indexing Array{Any,1} with types Tuple{ASCIIString} is not supported while loading In[8], in expression starting on line 5 in error at ./error.jl:21 in getindex at abstractarray.jl:483 in anonymous at In[5]:1 in anonymous at In[7]:3 [inlined code] from essentials.jl:114
What is going on? We expected to get a list of department names, but instead we got an error.
Here is a problem. With the current definition of the >>
operator, expression
(Departments >> Name)(citydb)
is translated to
citydb["departments"]["name"]
But this fails because citydb["departments"]
is a array and thus doesn't have a field called "name"
.
Let us demonstrate the behavior of >>
on the duplicating combinator. Combinator Dup
duplicates its input, that is, for any input value x
, it produces an array [x, x]
. See what happens when we compose Dup
with itself, once or several times:
Dup = x -> Any[x, x]
Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)
(Any[0,0],Any[Any[0,0],Any[0,0]],Any[Any[Any[0,0],Any[0,0]],Any[Any[0,0],Any[0,0]]])
We need composition (F >> G)
to be smarter. When F
produces an array, the composition should apply G
to each element of the array. In addition, if G
also produces array values, (F >> G)
concatenates all outputs to produce a single array value.
(F >> G) = x -> _flat(_map(G, F(x)))
_flat(z) =
isa(z, Array) ? foldr(vcat, [], z) : z
_map(G, y) =
isa(y, Array) ? map(_expand, map(G, y)) : G(y)
_expand(z_i) =
isa(z_i, Array) ? z_i : z_i != nothing ? [z_i] : []
_expand (generic function with 1 method)
Let us test the updated >>
operator with Dup
again. We see that the output arrays are now flattened:
Dup(0), (Dup >> Dup)(0), (Dup >> Dup >> Dup)(0)
(Any[0,0],Any[0,0,0,0],Any[0,0,0,0,0,0,0,0])
We can get back to our original example, finding the names of all departments. Now we get the result we expected.
Dept_Names = Departments >> Name
Dept_Names(citydb)
35-element Array{Any,1}: "WATER MGMNT" "POLICE" "GENERAL SERVICES" "CITY COUNCIL" "STREETS & SAN" ⋮ "BOARD OF ETHICS" "POLICE BOARD" "BUDGET & MGMT" "ADMIN HEARNG" "LICENSE APPL COMM"
Similarly, we can list values of any attribute in the hierarchy tree. For example, let us find the names of all employees.
$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \gg\textbf{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$We can do it by composing field extractors on the path from the root of the hierarchy to the "name"
attribute:
Employees = Field("employees")
Empl_Names = Departments >> Employees >> Name
Empl_Names(citydb)
32181-element Array{Any,1}: "ELVIA" "VICENTE" "MUHAMMAD" "GIRLEY" "DILAN" ⋮ "NANCY" "DARCI" "THADDEUS" "RACHENETTE" "MICHELLE"
Field extractors and composition give us a way to traverse the hierarchy tree. We still need a way to summarize data.
Consider a query: find the number of departments. To write it down, we need a combinator that can count the number of elements in an array.
Here is our first attempt to implement it:
Count() = x -> length(x)
Count (generic function with 1 method)
We compose it with a combinator that generates an array of departments to calculate the number of departments:
Num_Depts = Departments >> Count()
Num_Depts(citydb)
35-element Array{Any,1}: 2 2 2 2 2 ⋮ 2 2 2 2 2
But that's not what we expected! Here is the problem: the composition operator does not let Count()
see the whole array. Instead, Departments >> Count()
submits each array element to Count()
one by one and then concatenates the outputs of Count()
. Count()
, when its input is a JSON object, returns the number of fields in the object (in this case, 2 fields for all department objects).
The right way to implement Count()
is to add an array-producing combinator as a parameter:
Count(F) = x -> length(F(x))
Count (generic function with 2 methods)
Num_Depts = Count(Departments)
Num_Depts(citydb)
35
How to use composition with Count()
correctly? Here is an example: show the number of employees for each department. Consider this: number of employees is a (derived) property of each department, which suggests us to compose two combinators: one generating department objects and the other calculating the number of employees for a given department. We get:
Num_Empls_Per_Dept = Departments >> Count(Employees)
Num_Empls_Per_Dept(citydb)
35-element Array{Any,1}: 1848 13570 924 397 2090 ⋮ 9 2 43 39 1
On the other hand, if we'd like to calculate the total number of employees, the parameter of Count()
should be the combinator that generates all the employees:
Num_Empls = Count(Departments >> Employees)
Num_Empls(citydb)
32181
We could add other summarizing or aggregate combinators. For example, let us define a combinator that finds the maximum value in an array.
Max(F) = x -> maximum(F(x))
Max (generic function with 1 method)
Aggregate combinators could be combined to answer complex questions. For example, let us find the maximum number of employees per department. We already have a combinator that generates the number of employees for each department, all is left is to apply Max()
.
Max_Empls_Per_Dept = Max(Num_Empls_Per_Dept) # Max(Departments >> Count(Employees))
Max_Empls_Per_Dept(citydb)
13570
We learned how to traverse and summarize data. Let us show how to create new structured data.
Combinator Select()
constructs JSON objects. It is parameterized with a list of field names and constructors.
Select(fields...) =
x -> Dict(map(f -> f.first => f.second(x), fields))
Select (generic function with 1 method)
For each input, Select()
constructs a new JSON object with field values generated by field constructors applied to the input.
Here is a simple example of Select()
summarizing the input array:
S = Select("len" => Count(This()), "max" => Max(This()))
S([10, 20, 30])
Dict{ASCIIString,Int64} with 2 entries: "max" => 30 "len" => 3
Similarly, we can summarize any hierarchical dataset. Let us modify the query that finds the number of employees for each department. Instead of a raw list of numbers, we will generate a table with the name of the department and its size (the number of employees):
Depts_With_Size =
Departments >> Select(
"name" => Name,
"size" => Count(Employees))
Depts_With_Size(citydb)
35-element Array{Any,1}: Dict{ASCIIString,Any}("name"=>"WATER MGMNT","size"=>1848) Dict{ASCIIString,Any}("name"=>"POLICE","size"=>13570) Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","size"=>924) Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","size"=>397) Dict{ASCIIString,Any}("name"=>"STREETS & SAN","size"=>2090) ⋮ Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","size"=>9) Dict{ASCIIString,Any}("name"=>"POLICE BOARD","size"=>2) Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","size"=>43) Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","size"=>39) Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","size"=>1)
This query could easily be expanded to add more information about the department. For that, we only need to add extra field definitions to the Select()
clause. Notably, change in one field constructor cannot in any way affect the values of the other fields.
Let us additionally determine the top salary for each department:
Depts_With_Size_And_Max_Salary =
Departments >> Select(
"name" => Name,
"size" => Count(Employees),
"max_salary" => Max(Employees >> Salary))
Depts_With_Size_And_Max_Salary(citydb)
35-element Array{Any,1}: Dict{ASCIIString,Any}("name"=>"WATER MGMNT","max_salary"=>169512,"size"=>1848) Dict{ASCIIString,Any}("name"=>"POLICE","max_salary"=>260004,"size"=>13570) Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","max_salary"=>157092,"size"=>924) Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","max_salary"=>160248,"size"=>397) Dict{ASCIIString,Any}("name"=>"STREETS & SAN","max_salary"=>157092,"size"=>2090) ⋮ Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","max_salary"=>131688,"size"=>9) Dict{ASCIIString,Any}("name"=>"POLICE BOARD","max_salary"=>97728,"size"=>2) Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","max_salary"=>169992,"size"=>43) Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","max_salary"=>156420,"size"=>39) Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","max_salary"=>69888,"size"=>1)
Remember the problem we stated in the beginning: find the number of employees with the salary higher than $100k. We have almost all pieces we need to construct a solution of this problem. One piece that appears to be missing is a way to refine data. We need a combinator that, given a set of values and a predicate, produces the values that satisfy the predicate condition.
Here is how we can implement it:
Sieve(P) = x -> P(x) ? x : nothing
Sieve (generic function with 1 method)
Combinator Sieve(P)
is parameterized with a predicate combinator P
. A predicate is a combinator that, for any input, returns true
or false
. For example, a predicate combinator (F < G)
with two parameters F
and G
returns, for any input x
, the result of comparison F(x) < G(x)
.
Let us implement common predicate (and also some arithmetic) combinators:
import Base: >, >=, <, <=, ==, !=, !, &, |, +, -, /, %
(>)(F::Function, G::Function) = x -> F(x) > G(x)
(>)(F::Function, n::Number) = F > Const(n)
(>=)(F::Function, G::Function) = x -> F(x) >= G(x)
(>=)(F::Function, n::Number) = F >= Const(n)
(<)(F::Function, G::Function) = x -> F(x) < G(x)
(<)(F::Function, n::Number) = F < Const(n)
(<=)(F::Function, G::Function) = x -> F(x) <= G(x)
(<=)(F::Function, n::Number) = F <= Const(n)
(==)(F::Function, G::Function) = x -> F(x) == G(x)
(==)(F::Function, n::Number) = F == Const(n)
(!=)(F::Function, G::Function) = x -> F(x) != G(x)
(!=)(F::Function, n::Number) = F != Const(n)
(!)(F::Function) = x -> !F(x)
(&)(F::Function, G::Function) = x -> F(x) && G(x)
(|)(F::Function, G::Function) = x -> F(x) || G(x)
(+)(F::Function, G::Function) = x -> F(x) + G(x)
(+)(F::Function, n::Number) = F + Const(n)
(-)(F::Function, G::Function) = x -> F(x) - G(x)
(-)(F::Function, n::Number) = F - Const(n)
(/)(F::Function, G::Function) = x -> F(x) / G(x)
(/)(F::Function, n::Number) = F / Const(n)
(%)(F::Function, G::Function) = x -> F(x) % G(x)
(%)(F::Function, n::Number) = F % Const(n)
rem (generic function with 133 methods)
Sieve(P)
tests its input on the predicate condition P
. If the input satisfies the condition, it is returned without changes. Otherwise, nothing
is returned.
Here is a trivial example to demonstrate how Sieve()
works:
Take_Odd = Sieve(This() % 2 == 1)
Take_Odd(5), Take_Odd(10)
(5,nothing)
When the composition operator accumulates values for array output, it drops nothing
values. Thus, in a composition (F >> Sieve(P))
with an array-generating combinator F
, Sieve(P)
filters the elements of the array.
Let us use this feature to list the departments with more than 1000 employees. We already defined a combinator producing departments with the number of employees, we just need to filter its output:
Size = Field("size")
Large_Depts = Depts_With_Size >> Sieve(Size > 1000)
Large_Depts(citydb)
7-element Array{Any,1}: Dict{ASCIIString,Any}("name"=>"WATER MGMNT","size"=>1848) Dict{ASCIIString,Any}("name"=>"POLICE","size"=>13570) Dict{ASCIIString,Any}("name"=>"STREETS & SAN","size"=>2090) Dict{ASCIIString,Any}("name"=>"AVIATION","size"=>1344) Dict{ASCIIString,Any}("name"=>"FIRE","size"=>4875) Dict{ASCIIString,Any}("name"=>"OEMC","size"=>1135) Dict{ASCIIString,Any}("name"=>"TRANSPORTN","size"=>1200)
Similarly, we can list positions of employees with salary higher than 200k:
Position = Field("position")
Very_Well_Paid_Posns =
Departments >> Employees >> Sieve(Salary > 200000) >> Position
Very_Well_Paid_Posns(citydb)
3-element Array{Any,1}: "SUPERINTENDENT OF POLICE" "FIRE COMMISSIONER" "MAYOR"
With Sieve()
defined, we are finally able to answer the original question using combinators:
For each department, find the number of employees with salary higher than 100k.
Better_Depts_With_Num_Well_Paid_Empls =
Departments >> Select(
"name" => Name,
"N100k" => Count(Employees >> Sieve(Salary > 100000)))
Better_Depts_With_Num_Well_Paid_Empls(citydb)
35-element Array{Any,1}: Dict{ASCIIString,Any}("name"=>"WATER MGMNT","N100k"=>179) Dict{ASCIIString,Any}("name"=>"POLICE","N100k"=>1493) Dict{ASCIIString,Any}("name"=>"GENERAL SERVICES","N100k"=>79) Dict{ASCIIString,Any}("name"=>"CITY COUNCIL","N100k"=>54) Dict{ASCIIString,Any}("name"=>"STREETS & SAN","N100k"=>39) ⋮ Dict{ASCIIString,Any}("name"=>"BOARD OF ETHICS","N100k"=>2) Dict{ASCIIString,Any}("name"=>"POLICE BOARD","N100k"=>0) Dict{ASCIIString,Any}("name"=>"BUDGET & MGMT","N100k"=>12) Dict{ASCIIString,Any}("name"=>"ADMIN HEARNG","N100k"=>3) Dict{ASCIIString,Any}("name"=>"LICENSE APPL COMM","N100k"=>0)
Compare it with the original solution. The new one reads much better!
Depts_With_Num_Well_Paid_Empls(data) =
map(d -> Dict(
"name" => d["name"],
"N100k" => length(filter(e -> e["salary"] > 100000, d["employees"]))),
data["departments"])
Depts_With_Num_Well_Paid_Empls (generic function with 1 method)
We achieved our goal of sketching (a prototype of) a query language for hierarchical databases. Let us explore how it could be developed further. One possible way to improve it is by adding query parameters.
Consider a problem: find the number of employees whose annual salary exceeds 200k. We have all the tools to solve it:
Num_Very_Well_Paid_Empls =
Count(Departments >> Employees >> Sieve(Salary >= 200000))
Num_Very_Well_Paid_Empls(citydb)
3
Now, imagine that we'd like to find the number of employees with salary in a certain range, but we don't know the range at the time we construct the query. Instead, we want to specify the range when we execute the query.
Let us introduce a query context, a collection of parameters and their values. We'd like the query context to travel with the input, where each combinator could access it if necessary. Thus, we have an updated definition of a JSON combinator: a function that maps JSON input and query context to JSON output.
We need to update existing combinators to make them context-aware:
Const(val) = (x, ctx...) -> val
This() = (x, ctx...) -> x
(F >> G) = (x, ctx...) -> _flat(_map(G, F(x, ctx...), ctx...))
_map(G, y, ctx...) =
isa(y, Array) ? map(_expand, map(yi -> G(yi, ctx...), y)) : G(y, ctx...)
Field(name) = (x, ctx...) -> x[name]
Select(fields...) =
(x, ctx...) -> Dict(map(f -> f.first => f.second(x, ctx...), fields))
Count(F) = (x, ctx...) -> length(F(x, ctx...))
Max(F) = (x, ctx...) -> maximum(F(x, ctx...))
Sieve(P) = (x, ctx...) -> P(x, ctx...) ? x : nothing
(>)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) > G(x, ctx...)
(>=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) >= G(x, ctx...)
(<)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) < G(x, ctx...)
(<=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) <= G(x, ctx...)
(==)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) == G(x, ctx...)
(!=)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) != G(x, ctx...)
(!)(F::Function) = (x, ctx...) -> !F(x, ctx...)
(&)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) && G(x, ctx...)
(|)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) || G(x, ctx...)
(+)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) + G(x, ctx...)
(-)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) - G(x, ctx...)
(/)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) / G(x, ctx...)
(%)(F::Function, G::Function) = (x, ctx...) -> F(x, ctx...) % G(x, ctx...)
rem (generic function with 133 methods)
Next, let us add combinator Var(name)
that extracts the value of a parameter from the query context.
Var(name) = (x, ctx...) -> Dict(ctx)[name]
Var (generic function with 1 method)
Now we can make parameterized queries. Find the number of employees with salary in a certain range:
Min_Salary = Var("min_salary")
Max_Salary = Var("max_salary")
Departments = Field("departments")
Employees = Field("employees")
Salary = Field("salary")
Num_Empls_By_Salary =
Count(
Departments >>
Employees >>
Sieve((Salary >= Min_Salary) & (Salary < Max_Salary)))
Num_Empls_By_Salary(citydb, "min_salary" => 100000, "max_salary" => 200000)
3916
Use of context is not limited to query parameters. We can also update the context dynamically.
Consider a problem: find the employee with the highest salary.
It can be solved in two queries. First, find the highest salary:
Max_Salary = Max(Departments >> Employees >> Salary)
Max_Salary(citydb)
260004
Second, find the employee with the given salary:
The_Salary = Var("salary")
Empl_With_Salary = Departments >> Employees >> Sieve(Salary == The_Salary)
Empl_With_Salary(citydb, "salary" => 260004)
1-element Array{Any,1}: Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)
We need to automate this sequence of operations. Specifically, we take a value calculated by one combinator, assign it to some context parameter, and then evaluate the other combinator in the updated context. That's what Given()
combinator does:
Given(F, vars...) =
(x, ctx...) ->
let ctx = (ctx..., map(v -> v.first => v.second(x, ctx...), vars)...)
F(x, ctx...)
end
Given (generic function with 1 method)
Combining Max_Salary
and Empl_With_Salary
using Given
, we get:
Empl_With_Max_Salary = # Given(Empl_With_Salary, "salary" => Max_Salary)
Given(
Departments >> Employees >> Sieve(Salary == The_Salary),
"salary" => Max(Departments >> Employees >> Salary))
Empl_With_Max_Salary(citydb)
1-element Array{Any,1}: Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004)
This is not just a convenience feature. Indeed, let us change this query to find the highest paid employee for each department. To implement it, we need to pull Departments
from the Given()
clause:
Empls_With_Max_Salary_By_Dept =
Departments >> Given(
Employees >> Sieve(Salary == The_Salary),
"salary" => Max(Employees >> Salary))
Empls_With_Max_Salary_By_Dept(citydb)
35-element Array{Any,1}: Dict{AbstractString,Any}("name"=>"THOMAS","surname"=>"P","position"=>"COMMISSIONER OF WATER MGMT","salary"=>169512) Dict{AbstractString,Any}("name"=>"GARRY","surname"=>"M","position"=>"SUPERINTENDENT OF POLICE","salary"=>260004) Dict{AbstractString,Any}("name"=>"DAVID","surname"=>"R","position"=>"COMMISSIONER OF FLEET & FACILITY MANAGEMENT","salary"=>157092) Dict{AbstractString,Any}("name"=>"MARLA","surname"=>"K","position"=>"CHIEF ADMINISTRATIVE OFFICER","salary"=>160248) Dict{AbstractString,Any}("name"=>"CHARLES","surname"=>"W","position"=>"COMMISSIONER OF STREETS AND SANITATION","salary"=>157092) ⋮ Dict{AbstractString,Any}("name"=>"STEVEN","surname"=>"B","position"=>"EXECUTIVE DIR - BOARD OF ETHICS","salary"=>131688) Dict{AbstractString,Any}("name"=>"MAX","surname"=>"C","position"=>"EXECUTIVE DIR - POLICE BOARD","salary"=>97728) Dict{AbstractString,Any}("name"=>"ALEXANDRA","surname"=>"H","position"=>"BUDGET DIR","salary"=>169992) Dict{AbstractString,Any}("name"=>"PATRICIA","surname"=>"J","position"=>"DIR OF ADMINISTRATIVE HEARINGS","salary"=>156420) Dict{AbstractString,Any}("name"=>"MICHELLE","surname"=>"G","position"=>"STAFF ASST","salary"=>69888)
Consider a problem: find the top salary for each department. This is an easy one:
Max_Salary_By_Dept = Departments >> Max(Employees >> Salary)
Max_Salary_By_Dept(citydb)
35-element Array{Any,1}: 169512 260004 157092 160248 157092 ⋮ 131688 97728 169992 156420 69888
Now change it to: find the top salary for each position. We can't solve it with our current set of combinators. Why?
Look at the database hierarchy diagram:
$$ \text{departments} \quad \begin{cases} \text{name} \\ \text{employees} \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \text{salary} \end{cases} \end{cases} $$The structure of the first query (top salary for each department) fits the structure of the database:
$$ \textbf{departments}\gg \quad \begin{cases} \text{name} \\ \gg\textbf{employees}\gg \quad \begin{cases} \text{name} \\ \text{surname} \\ \text{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$The structure of the second query (top salary for each position) violates this structure:
$$ \text{departments} \quad \begin{cases} \text{name} \\ \textbf{employees}\lessgtr \quad \begin{cases} \text{name} \\ \text{surname} \\ \ll\textbf{position} \\ \gg\textbf{salary} \end{cases} \end{cases} $$This is not the only limitation. Let us not forget that real databases are decidedly non-hierarchical. For example, this is the database schema (designed by Charles Tirrell) of our flagship product RexStudy. No hierarchy in sight!
As a conclusion, combinators are awesome for querying data as long as:
Otherwise, we are out of luck...
... Or are we?