SERHAT ÇEVİKEL
First let's start our PostgreSQL server again:
sudo service postgresql start
In order to check whether the server is responding by:
psql -U postgres -c "\l"
Now let's create our environment variables and extract the zipped tsv files if they are not yet:
datadir=~/data
imdbdir=$datadir/imdb
mkdir -p $imdbdir/tsv2
find $imdbdir/tsv -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
if [ ! -e ${imdbdir}/tsv2/\${basenm%.gz} ];
then
gunzip -c {} > \
${imdbdir}/tsv2/\${basenm%.gz};
fi
"
Now we will go through two similar shell tools to use sql syntax on csv or tsv files:
The source page for q project is:
q - Text as Data
q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).
q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.
q's web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.
The tsv files are as follows:
tldr q
ls $imdbdir/tsv2
A very basic query taking stdin as input:
cat $imdbdir/tsv2/title.ratings.tsv | \
q -t -H "SELECT * \
FROM - \
WHERE tconst = 'tt0000001';"
Or use the filename inside FROM clause:
q -t -H "SELECT * \
FROM $imdbdir/tsv2/title.ratings.tsv \
WHERE tconst = 'tt0000001';"
Part of the csvkit suite:
cat $imdbdir/tsv2/title.ratings.tsv | csvsql -t --query "SELECT * \
FROM stdin \
WHERE tconst = 'tt0000001';"
In order to use sql magic on Python3 kernel:
%load_ext sql
%sql postgres://postgres@localhost/imdb2
You can remember the structure of the data by either typing:
psql -U postgres -d imdb2 -c "\d+ public.*"
Or viewing the file:
cat ~/imdb_database
Now let's remember the query from session 2:
Filter for movies titled Godfather.\*Part
Genres include drama and exclude comedy
Start year not after 1990
People with primary profession actor|actress
First join titles and principal cast on title id's (tconst)
And then join principal cast and name basics on name id's (nconst)
Filter only for actors and actresses
And sort on first names (ascending) then title years
%%sql
SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,
tp.principalcast,
nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession
FROM title_basics tb
LEFT JOIN title_principals_melt tp USING (tconst)
LEFT JOIN name_basics nb ON tp.principalcast=nb.nconst
WHERE tb.originaltitle ~ 'Godfather.*Part'
AND tb.genres ~ '(?i)drama'
AND NOT tb.genres ~ '(?i)comedy'
AND tb.startyear <= 1990
AND nb.primaryprofession ~'actor|actress'
ORDER BY nb.primaryname, tb.startyear DESC;
We can design this query so that smaller steps are combined:
%%sql
WITH basics_principles AS
(
SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes,
tb.genres, tp.principalcast
FROM title_basics tb
LEFT JOIN title_principals_melt tp
USING (tconst)
WHERE tb.originaltitle ~ '.*Godfather.*Part.*'
AND tb.genres ~ '(?i)drama'
AND NOT tb.genres ~ '(?i)comedy'
AND tb.startyear <= 1990
)
SELECT bp.tconst, bp.originaltitle, bp.startyear, bp.runtimeminutes,
bp.genres, bp.principalcast, nb.primaryname,
nb.birthyear, nb.deathyear, nb.primaryprofession
FROM basics_principles bp LEFT JOIN name_basics nb ON bp.principalcast=nb.nconst
WHERE nb.primaryprofession ~'actor|actress'
ORDER BY nb.primaryname, bp.startyear DESC;
This was already easily implemented without CTE's.
Now a harder example which is much challenging to implement without CTE's:
Return all fields from title_basics for "movies" which feature Martin Scorsese as director and Robert De Niro as actor, sorted by ascending startyear
%%sql
-- first get the nconst of Martin Scorsese
WITH ms_id AS
(
SELECT nb.nconst
FROM name_basics nb
WHERE nb.primaryname = 'Martin Scorsese'
AND nb.primaryprofession ~ 'director'
),
-- then get the titles directed by Martin Scorsese
ms_titles AS
(
SELECT tc.tconst
FROM ms_id LEFT JOIN title_crew tc ON ms_id.nconst=tc.directors
),
-- and get the nconst of Robert De Niro
rdn_id AS
(
SELECT nb.nconst
FROM name_basics nb
WHERE nb.primaryname = 'Robert De Niro'
),
-- and get the titles by Robert De Niro
rdn_titles AS
(
SELECT tp.tconst
FROM rdn_id LEFT JOIN title_principals_melt tp ON rdn_id.nconst=tp.principalcast
)
-- and last, get the intersection of titles by De Niro and Scorsese
SELECT tb.*
FROM ms_titles INNER JOIN rdn_titles ON ms_titles.tconst=rdn_titles.tconst
LEFT JOIN title_basics tb ON rdn_titles.tconst=tb.tconst
WHERE tb.titletype = 'movie'
ORDER BY tb.startyear
Now another:
%%sql
-- get the nconst of pacino
WITH pacino AS
(
SELECT nconst
FROM name_basics nb
WHERE nb.primaryname = 'Al Pacino'
AND nb.birthyear = 1940
),
-- get the titles of pacino
t_ids AS
(
SELECT tp.tconst
FROM pacino LEFT JOIN title_principals_melt tp ON pacino.nconst=tp.principalcast
)
-- get the details of those titles
SELECT tb.*, tr.averagerating
FROM t_ids LEFT JOIN title_basics tb ON t_ids.tconst=tb.tconst
LEFT JOIN title_ratings tr ON tb.tconst=tr.tconst
WHERE tb.titletype = 'movie'
AND tr.averagerating IS NOT NULL
ORDER BY tr.averagerating DESC
EXERCISE 1:
Now remember the last query from session 3, in which we tried to get the oldest cast of each filtered movie.
With CTE's we can do it in a more efficient and neat way:
The version with subquery was as such:
SELECT mb.primarytitle, nb.birthyear, tb.startyear, tb.startyear - nb.birthyear as age,nb.primaryname
FROM
(
SELECT tb.primarytitle, max(nb.birthyear) as maxbirthyear, tb.tconst
FROM title_basics tb
LEFT JOIN title_ratings tr USING (tconst)
LEFT JOIN title_principals_melt tp USING (tconst)
LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast
WHERE tr.averagerating > 8.8
AND tr.numvotes > 500000
AND NOT tb.titletype = 'tvSeries'
GROUP BY tb.primarytitle, tb.tconst
) mb
LEFT JOIN title_basics tb USING (tconst)
LEFT JOIN title_principals_melt tp USING (tconst)
LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast
WHERE nb.birthyear = mb.maxbirthyear
and the basic query listing the primary casts was as such:
SELECT tb.primarytitle, tb.startyear, nb.primaryname, nb.birthyear, tr.averagerating
FROM title_basics tb
LEFT JOIN title_ratings tr USING (tconst)
LEFT JOIN title_principals_melt tp USING (tconst)
LEFT JOIN name_basics nb ON nb.nconst=tp.principalcast
WHERE tr.averagerating > 8.8
AND tr.numvotes > 500000
AND NOT tb.titletype = 'tvSeries'
ORDER BY nb.primaryname, tb.primarytitle;
%%sql
-- statement here
psql -U postgres -d imdb2 <<EOF
-- or you may write your statement here
EOF
SOLUTION 1:
pass1=3082
encrypt=$(cat encrypt_04_01)
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Now we will see how we can access to a PostgreSQL database from R
First load necessary package
library(RPostgreSQL)
Load PostgreSQL driver
drvv <- dbDriver("PostgreSQL")
drvv
Create a connection to a database using the driver
con <- dbConnect(drvv,
dbname = "imdb2",
host = "localhost",
port = 5432,
user = "postgres")
con
Get list of table names
table_names <- dbGetQuery(con,
"SELECT table_name
FROM information_schema.tables
WHERE table_type = 'BASE TABLE'
AND table_schema = 'public'
ORDER BY table_name
"
)
table_names
class(table_names)
Or more easily with:
dbListTables(con)
Good! The query returns a data frame
Now we can loop through tables to automatize queries
List tables:
table_names_vec <- dbListTables(con)
table_names_vec
class(table_names_vec)
This yields a character vector, not a data frame
Get row counts
# create a query template that we can change the parameters of
# each "%s" stands for a variable that we will manipulate with sprintf
query_text <- "SELECT count (*) FROM %s"
# create an empty vector, the length of the table names
# we will populate this vector with the row counts from tables
row_counts <- rep(NA, length(table_names_vec))
# update the names of the vector with table names
names(row_counts) <- table_names_vec
# for across table names indices (not the names themselves)
for (tbl_ind in seq_along(table_names_vec))
{
# replace the %s pointer in query text with table name
current_query <- sprintf(query_text, table_names_vec[tbl_ind])
# run the query and get results
table_name <- dbGetQuery(con, current_query)
# assign the query result to vector
row_counts[tbl_ind] <- table_name[[1]]
}
return(row_counts)
Let's import a table as a data frame into R
title_ratings_df <- dbGetQuery(con, "SELECT * from title_ratings")
Get the head, attributes and summary of the data frame
head(title_ratings_df)
class(title_ratings_df)
Now an easier way to read data into R
title_ratings_df_b <- dbReadTable(con, "title_ratings")
See whether they are identical objects
identical(title_ratings_df, title_ratings_df_b)
So, they are identical
What is the object size of each data frame?
format(object.size(title_ratings_df), units = "auto")
format(object.size(title_ratings_df_b), units = "auto")
They take too much space. Now let's get rid of one of them
rm(title_ratings_df_b)
New get the classes of all columns
str(title_ratings_df)
Let's query title_basics as such:
Let's say, take rows in which
name contains "Star Wars",
not an adult movie
title_type is a movie and
start year is between 1977 and 2016
title_basics_df2 <- dbGetQuery(con,
"SELECT *
FROM title_basics
WHERE primarytitle ~* 'star.*wars'
AND isadult = false
AND titletype = 'movie'
AND startyear BETWEEN 1977 AND 2016"
)
title_basics_df2
str(title_basics_df2)
Did not like NA's in tconst, let's enhance it
title_basics_df3 <- title_basics_df2[
!is.na(title_basics_df2$tconst),
]
title_basics_df3
Much better!
Now let's write this data frame into our imdb2 database as a new table
dbWriteTable(con, "star_wars", value = title_basics_df3, append = TRUE, row.names = FALSE)
Check whether table is created:
dbExistsTable(con, "star_wars")
And read from the database again
star_wars <- dbReadTable(con, "star_wars")
star_wars
See, the only difference seems to be the rownames which we excluded on purpose
Apart from that, let's check whet they are both identical
rownames(title_basics_df3) <- NULL
title_basics_df3
identical(title_basics_df3, star_wars)
Yes they are identical!
Let's remove the table
dbRemoveTable(con, "star_wars")
Check whether it still exists
dbExistsTable(con, "star_wars")
Now let's create the table again
dbWriteTable(con, "star_wars", value = title_basics_df3, append = TRUE, row.names = FALSE)
Check that it exists:
dbExistsTable(con, "star_wars")
Get its rowcount:
dbGetQuery(con, "SELECT count (*) FROM star_wars")
And repeat the data write statement, with append option
dbWriteTable(con, "star_wars", value = title_basics_df3, append = TRUE, row.names = FALSE)
Get the rowcount again
dbGetQuery(con, "SELECT count (*) FROM star_wars")
We did not define a primary key, so postgresql server allowed duplicates to be inserted in to the database
Now delete and recreate table
dbRemoveTable(con, "star_wars")
dbExistsTable(con, "star_wars")
dbWriteTable(con, "star_wars", value = title_basics_df3, append = TRUE, row.names = FALSE)
dbExistsTable(con, "star_wars")
And add a primary key constraint
dbGetQuery(con,
"
ALTER TABLE star_wars
ADD CONSTRAINT tconst_pk_10
PRIMARY KEY (tconst);
")
Let's try to import the data again
dbWriteTable(con, "star_wars", value = title_basics_df3, append = TRUE, row.names = FALSE)
See that:
*could not Retrieve the result : ERROR: duplicate key value violates unique constraint "tconst_pk_10*
Let's check the row count again:
dbGetQuery(con, "SELECT count (*) FROM star_wars")
OK, we duplicates were really not allowed!
Let's close connection
dbDisconnect(con)
And unload the driver
dbUnloadDriver(drvv)
First let's load the package:
library(RPostgreSQL)
library(sqldf)
And set the options:
options(sqldf.RPostgreSQL.user ="postgres",
sqldf.RPostgreSQL.dbname ="imdb2",
sqldf.RPostgreSQL.host ="localhost",
sqldf.RPostgreSQL.port =5432)
Now let's run a query to get average runtimeminutes from title_basics and group them and order them by startyear
query = "SELECT avg(runtimeminutes) runtimeminutes, startyear
FROM title_basics
GROUP BY startyear
ORDER BY startyear;"
av_runtime <- sqldf(query)
av_runtime
class(av_runtime)
It returns a data frame
Now subset years between 1940 and 2016
av_runtime_subset <- av_runtime[av_runtime$startyear %in% 1940:2016,]
av_runtime_subset
Check whether any NA's exist
which(is.na(av_runtime_subset[[1]]))
Let's plot the data
runtime_ts <- ts(av_runtime_subset[[1]],
start = av_runtime_subset[1,2],
end = av_runtime_subset[nrow(av_runtime_subset),2])
runtime_ts
plot(runtime_ts)
Well that data may have many different types of titles including shorts, etc
Let's first subset for movies
query = "SELECT avg(runtimeminutes) runtimeminutes, startyear
FROM title_basics
WHERE titletype = 'movie'
GROUP BY startyear
ORDER BY startyear;"
av_runtime <- sqldf(query)
av_runtime
av_runtime_subset <- av_runtime[av_runtime$startyear %in% 1940:2016,]
av_runtime_subset
runtime_ts <- ts(av_runtime_subset[[1]],
start = av_runtime_subset[1,2],
end = av_runtime_subset[nrow(av_runtime_subset),2])
runtime_ts
plot(runtime_ts)
library(RPostgreSQL)
drvv <- dbDriver("PostgreSQL")
drvv
con <- dbConnect(drvv,
dbname = "imdb2",
host = "localhost",
port = 5432,
user = "postgres")
con
Now the task is:
query <- "
WITH hitch AS
(
SELECT nconst
FROM name_basics nb
WHERE nb.primaryname = 'Alfred Hitchcock'
AND nb.primaryprofession ~ 'director'
),
t_ids AS
(
SELECT tp.tconst
FROM hitch LEFT JOIN title_crew tp ON hitch.nconst=tp.directors
)
SELECT startyear, count(*), avg(tr.averagerating) avrate
FROM t_ids LEFT JOIN title_basics tb ON t_ids.tconst=tb.tconst
LEFT JOIN title_ratings tr ON tb.tconst=tr.tconst
WHERE tb.titletype = 'movie'
AND tr.averagerating IS NOT NULL
GROUP BY startyear
ORDER BY startyear
"
hitch_df <- dbGetQuery(con, query)
hitch_df
In order to get an average rating for each decade weighted with the count of movies for each year, it is good to have a column for sum of ratings for a year:
hitch_df$sumrate <- hitch_df$count * hitch_df$avrate
hitch_df
Now let's get a column for decades:
hitch_df$dec <- (hitch_df$startyear %/% 10) * 10
hitch_df
And let's aggregate for each decade the sums of sumrate values:
hitch_agg <- aggregate(hitch_df[,c(2,4)],
by = list(hitch_df$dec),
FUN = sum)
hitch_agg
And let's get the average rating for each decade, deleting the sumrate column:
hitch_agg$avrate <- hitch_agg$sumrate / hitch_agg$count
hitch_agg$sumrate <- NULL
hitch_agg
Let's have a scatterplot of counts versus average ratings
plot(hitch_agg[,-1])
Let's have the total counts by decade:
plot(x = hitch_agg[[1]], y = hitch_agg[[2]], main = "total count by decade")
lines(x = hitch_agg[[1]], y = hitch_agg[[2]], type = "l")
And let's have the averate ratings by decade
plot(x = hitch_agg[[1]], y = hitch_agg[[3]], main = "av ratings by decade")
lines(x = hitch_agg[[1]], y = hitch_agg[[3]], type = "l")
Now let's go through a final example using tidyverse and data.table and an R package called "dbplyr" that incorporates dplyr verbes, converts a dplyr pipe to an sql query and executes remotely.
Apart from dbplyr usage, this example will be a simple showcase of tidyverse and data.table tools
library(dbplyr)
library(tidyverse)
library(data.table)
library(RPostgreSQL)
drvv <- dbDriver("PostgreSQL")
drvv
Create a connection to a database using the driver
con <- dbConnect(drvv,
dbname = "imdb2",
host = "localhost",
port = 5432,
user = "postgres")
con
And create a remote source object:
title_basics <- tbl(con, "title_basics")
title_basics
Now let's create an sqlquery using dplyr verbes:
query1 <- title_basics %>% select(c("titletype", "genres", "startyear")) %>%
filter(between(startyear, 1950, 2017) &
!is.na(genres) &
titletype == "movie") %>%
group_by(genres, startyear) %>%
summarise(count = n())
This is lazy query it is not executed and returned yet:
query1
Show the resulting sql query:
query1 %>%show_query
Execute and collect the data:
table1 <- query1 %>% collect
Summarize the data:
str(table1)
table1 %>% glimpse
Get unique values:
table1 %>% select(-count) %>% sapply(unique)
Any missing values?
which(!complete.cases(table1))
Convert to a data.table object:
setDT(table1)
Now what we will do with this data is that we will explore the count of multiple genres in titles:
Let's create a new column that shows how many genres each movie has:
table1[,genrec := stringr::str_extract_all(genres, ",", simplify= T) %>% length %>% "+"(1),
by = 1:nrow(table1)]
See how "genres" and "genrec" columns are related:
table1[,.SD[1], by = genres]
Now let's create contingency tables of genre counts as proportions for each startyear:
genre_counts <- table1[, as.list(table(genrec) %>% prop.table %>% "*"(100) %>% round(2)),
by = startyear]
genre_counts
And let's visualize those proportions as a stacked area chart:
genre_counts %>%
gather("key", "value", -startyear) %>%
ggplot(aes(x = startyear, y = value, fill = key)) +
geom_area()
See that over time movies represent multiple genres as opposed to single genres to a larger extent (increase in the area of "3").
This is probably due to changes in spectators' tastes and their desire to see more complex screenplays.
Enter the password provided and run the following cell:
pass=
cat quiz_2019_01q.ipynb.crypt | \
openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass 2> /dev/null > quiz_2019_01q.ipynb
Now open quiz_2019_01q.ipynb file and follow instructions