BY SERHAT ÇEVİKEL
In today's session, we will be continuing from where we left:
For simplicity purposes and due to time/resource constraints, we will import and set primary keys on only selected few tables
We will skip the melting of tables, a task for which we provided an example before
We already have a complete working database as imdb2
First as we did before, we unzip the gzipped tsv files and trim the header rows, since PostgreSQL do not allow for header rows while importing tsv files according to its own documentation on the COPY Command
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
One option might be to convert a tsv into a csv by translating tab characters to commas using "sed" or better "tr". We will also use this option for selected tables
Since that may take some time between 5-10 minutes in our binder environment, it is better that we do it now, so that we can proceed with file operations without waiting for the daemon to start:
sudo service postgresql start
If an error is returned, you should rerun the cell.
We can check whether the daemon has started with:
psql -U postgres -c "\l"
This will list all databases on our server
Until this command returns something successfully, the PostgreSQL server will not respond.
But until we reach 1.2, we do not need to connect to the PostgreSQL server.
First, declare the environment variables for paths
datadir=~/data
imdbdir=$datadir/imdb
Then create the directory for unzipped files:
mkdir -p $imdbdir/tsv2
And unzip the files if they do not exist yet:
find $imdbdir/tsv -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
if [ ! -e ${imdbdir}/tsv2/\${basenm%.gz} ];
then
gunzip -c {} > \
${imdbdir}/tsv2/\${basenm%.gz};
fi
"
the "basename" function extracts the filename portion from the full path name so we can reproduce the full path name under another directory
tldr basename
The ${basenm%.gz}
portion is a form of "parameter substitution"
It means:
\${var%Pattern}, \${var%%Pattern}
\${var%Pattern} Remove from \$var the shortest part of \$Pattern that matches the back end of \$var.
\${var%%Pattern} Remove from \$var the longest part of \$Pattern that matches the back end of \$var.
(https://www.tldp.org/LDP/abs/html/parameter-substitution.html)
Now let's check whether the gunzipped files are there:
ls -l $imdbdir/tsv2
Now we will trim the first lines and dump into a new directory named tsv3.
mkdir -p $imdbdir/tsv3
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
if [ ! -e ${imdbdir}/tsv3/\${basenm/.tsv/2.tsv} ];
then
tail -n+2 {} > \
${imdbdir}/tsv3/\${basenm/.tsv/2.tsv};
fi
"
"basenm/.tsv/2.tsv" part is like the substitution command of sed: find ".tsv" and replace with "2.tsv"
"-e" checks the existence of the file and proceeds only if the file does not exist yet
Check whether row counts are decremented:
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
{
printf '%s\t' \$basenm;
cat $imdbdir/tsv2/\$basenm | wc -l;
cat $imdbdir/tsv3/\${basenm/.tsv/2.tsv} | wc -l;
} | tr '\n' '\t' | xargs echo -e
" | column -t
Now let's convert tsv's to csv's
mkdir -p $imdbdir/csv
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
if [ ! -e ${imdbdir}/csv/\${basenm/.tsv/.csv} ];
then
cat {} | tr '\t' ',' > \
${imdbdir}/csv/\${basenm/.tsv/.csv};
fi
"
List the csv directory
ls -l $imdbdir/csv
And view the head of a file with column command, with "," as the separator/delimiter character:
head $imdbdir/csv/name.basics.csv | column -t -s ","
We can also use the csv files to create the database
Remember we probed whether postgresql server is listening by:
psql -U postgres -c "\l"
psql is the built-in interactive terminal client for PostgreSQL. It can both be used interactively with it shell or non-interactively with the "-c" (command) flag as we used above
tldr psql
psql is good for embedding sql codes in shell scripts. However for interactive purposes, it does not provide much features.
A better option is the "pgcli" interface. It is a part of the dbcli project, that provides better command line interfaces (CLI) as DBclients:
First we open a new terminal window:
last=$(( $(ps aux | awk -F " " '$7 ~ /pts/ { print $7 }' | grep -Po "\d+" | sort -nu | tail -1) + 1 ))
echo $last
echo "Follow the working one of the following links to open a terminal: (that matches the domain of the URL above)"
echo "https://hub.gke.mybinder.org/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last"
echo "https://hub-binder.mybinder.ovh/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last"
echo "https://notebooks.gesis.org/binder/jupyter/user/serhatcevikel-bdm_2019-$(hostname | grep -Po '(?<=2019-2d).*?$')/terminals/$last"
Now please copy and paste the below command inside that terminal:
screen -S 1
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff "pgcli -U postgres\n"
Some options are automatically selected at build time.
We may be using other interfaces, and here we will prefer to execute our commands through the Jupyter notebook. However, it is better to have pgcli opened on the terminal, since the smart completion facility makes it easier to write SQL statements. You can copy the statement to the Jupyter cells to be executed
Another client is pgAdmin4. It had a GUI interface before, but now it only works on web mode, hosted by your web browser - just as Jupyter is.
Due to limitations of the binder environment, we won't be using pgAdmin4 but you can check it from https://www.pgadmin.org/
There are several ways to access postgresql server from the Jupyter:
SQL kernel of beakerx extension is the most visually appealing and easy to use one, when used as a stand-alone notebook. However it does not play well from inside an SoS notebook, and we prefer the "polyglot" nature of the SoS notebook
postgres_kernel is buggy
So we will either use the sql magic on Python3 kernel or psql in non-interactive mode on bash_kernel
For the rest of the commands to work, PostgreSQL server should be listening on port 5432. Let's check:
psql -U postgres -c "\l"
If the command does not list the existing databases, we wait until it does so by re-running the cell.
Hopefully, we started the database daemon back in the beginning of the session.
First we will create a tablespace - a directory on the filesystem for the PostgreSQL server to store our new database:
22.6. Tablespaces
Tablespaces in PostgreSQL allow database administrators to define locations in the file system where the files representing database objects can be stored. Once created, a tablespace can be referred to by name when creating database objects.
By using tablespaces, an administrator can control the disk layout of a PostgreSQL installation. This is useful in at least two ways. First, if the partition or volume on which the cluster was initialized runs out of space and cannot be extended, a tablespace can be created on a different partition and used until the system can be reconfigured.
(https://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html)
Create a directory on the filesystem:
sudo mkdir -p /pg1
Check the user and group ownership with the stat command:
stat -c "%U %G" /pg1
tldr stat
And make it owned by postgres user:
sudo chown -R postgres:postgres /pg1
And check again:
stat -c "%U %G" /pg1
Note that, in this binder image, sudoer (admin) jovyan user becomes sudo without password prompt. This is for the non-interactive feature of bash_kernel on Jupyter (it cannot prompt for a password) and for better scriptability.
This environment is a short lived container isolated from the rest of the operating system it sits atop. This feature should not be done in a production environment
Let's check again whether our postgresql server responds to our requests:
Now it is time to create a tablespace on /pg location and create a database there:
Let's review the list of databases again:
psql -U postgres -c "\l"
First let's enable the sql magic on python kernel, so that we can send SQL commands to the PostgreSQL server easily with minimal additional syntax (using psql from the bash is another option):
%load_ext sql
And create a connection with our server into the maintenance database "postgres":
%sql postgres://postgres@localhost/postgres
Now from inside the Python3 kernel, we can execute sql statements easily if we insert a "%%sql" magic in the first line as such:
Comments start with "--"
Statements end with ";"
First let's create a tablespace "pg1" on our /pg1 directory: (we need the END; statement so that tablespace creation is executed without error)
%%sql
-- create a new tablespace
END;
CREATE TABLESPACE pg1 LOCATION '/pg1';
And let's create a database on pg1 tablespace.
Note that, since we already have the "imdb" database, we will create imdb1 for practice purposes.
%%sql
-- create a new database imdb1 on pg1 tablespace
END;
CREATE DATABASE imdb1
TABLESPACE = pg1;
Now connect to the new imdb1 database
%sql postgres://postgres@localhost/imdb1
Make sure we are connected to the newly created imdb1 database:
%sql postgres://postgres@localhost/imdb1
And create a new table title_ratings with defined fields and types
Note that, we cannot use "." dot inside table and field names
%%sql
-- create a new table
CREATE TABLE title_ratings
(
tconst text,
averageRating numeric,
numVotes integer
);
In order to copy files from the local filesystem into the postgresql server, it is best to use the psql command:
psql -U postgres -d imdb1 -c "\copy title_ratings from $imdbdir/tsv3/title.ratings2.tsv"
Note that this runs the COPY command of SQL under the hood, however it does not fall onto file read privilege problems:
Assuming the psql command-line tool, you may use \copy instead of copy.
\copy opens the file and feeds the contents to the server, whereas copy tells the server the open the file itself and read it, which may be problematic permission-wise, or even impossible if client and server run on different machines with no file sharing in-between.
Under the hood, \copy is implemented as COPY FROM stdin and accepts the same options than the server-side COPY.
Another advantage of using the "\copy" syntax through psql on Bash is that we can easily pass environment variables such as $imdbdir
Now we can check the table exists by using psql meta-command to describe the tables inside a database:
psql -U postgres -d imdb1 -c "\dt+"
And get the size of the database:
%%sql
SELECT pg_size_pretty(pg_database_size('imdb1'));
Create, insert and drop operations after some time may leave a junk space inside the database. In this case, it is better to run the "vacuum" command:
%%sql
VACUUM FULL;
Now let's view the data:
%%sql
SELECT *
FROM title_ratings
LIMIT 5;
To delete the data inside a table we can either use "DELETE or TRUNCATE statements:
%%sql
DELETE FROM title_ratings;
Now check whether data is deleted:
%%sql
SELECT *
FROM title_ratings
LIMIT 5;
Now let's copy the data again, this time using the sql command COPY instead of psql built-in \copy:
%%sql
COPY title_ratings from '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv'
%%sql
SELECT *
FROM title_ratings
LIMIT 5;
And delete data with TRUNCATE:
%%sql
TRUNCATE title_ratings;
%%sql
SELECT *
FROM title_ratings
LIMIT 5;
Check the size and vacuum:
%%sql
SELECT pg_size_pretty(pg_database_size('imdb1'));
%%sql
VACUUM FULL;
%%sql
SELECT pg_size_pretty(pg_database_size('imdb1'));
Now copy the data again using "\copy" meta-command of psql:
psql -U postgres -d imdb1 -c "\copy title_ratings from $imdbdir/tsv3/title.ratings2.tsv"
Let's report the total row count using countrows alias. Note that "count()" is an example of the aggregate functions to summarize data:
Note that we define an alias "countrows" for the aggregate calculation we do after SELECT:
%%sql
SELECT count(*) as countrows
FROM title_ratings;
More info on aggregate functions at:
https://www.postgresql.org/docs/current/static/functions-aggregate.html
And alias topic is covered in:
https://www.postgresql.org/docs/10/static/queries-table-expressions.html
Now, with a mistake, run the COPY query again! (either with psql's \copy or sql COPY)
%%sql
COPY title_ratings
FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';
%%sql
SELECT count(*) as countrows
FROM title_ratings;
We have twice the number of rows now, because postgresql did not complain to duplicate the same data!
We should define a primary key on the table and put a constraint on the uniqueness of that key But no problem we can ALTER anything in the table after it is first created:
%%sql
ALTER TABLE title_ratings
ADD CONSTRAINT title_ratings_pk
PRIMARY KEY (tconst);
It threw an integrity error: The data are duplicated and a primary key cannot have duplicate values!
Now let's drop the table alltogether (not delete the contents):
%%sql
DROP TABLE title_ratings;
Create the table again with the constraint:
%%sql
-- create a new table
CREATE TABLE title_ratings
(
tconst text,
averageRating numeric,
numVotes integer,
CONSTRAINT title_ratings_pk
PRIMARY KEY (tconst)
);
And import data:
%%sql
COPY title_ratings
FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';
Get the row count:
%%sql
SELECT count(*) as countrows
FROM title_ratings;
And try to import the data again:
%%sql
COPY title_ratings
FROM '/home/jovyan/data/imdb/tsv3/title.ratings2.tsv';
It did not allow us because uniqueness constraint prevented us to create duplicate values of primary key fields
EXERCISE 1:
Import the title.episode2.tsv into title_episode table in imdb1 database
Remember the structure of title.episode:
Note that, -- defines a commented line not to be executed
And always finish your statements with a semicolon ";"
%%sql
-- type your statement here
psql -U postgres -d imdb1 <<EOF
-- or you may write your statement here
EOF
SOLUTION 1:
pass1=
encrypt="U2FsdGVkX19fL3ZRoTCbtF1KZ9onuO616t3++mXfoHW6bGrMWcoRBDrUBk/YM5VJ j88eS5+6fA4oYm0V96zw4/9qzgBW4Gu594+lLxW/TQ9xqzODOyRi4j4pIAXIffP0 yrMiwyscX8n20jTEINctgC8/eM6rxM+C+zforrKwZWt0lcxbOarz8OqWZDn8MhRk k0piDCSrj3m4S/tLMl9VJNPwCxLe+RAEmjbCHbMlV+kEcsi6WyfuZhj34UE8H1V2 9/4w8ILi5Z9uJkgaAmEl1RUUHzGKCiNxelC3l2H3xfcjk+QGRZAzodtHEBOLVKoz e+trtOLlLeaaSt4dlz/c02uYecdHchRwqhdonZQVoaYfDZMnJoLMvDY8O//9KAfJ"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
psql and pgcli clients offer some meta-commands that start with "" and that offer shortcuts to some common operations
You can get help on that using "?" - a meta-command itself
psql -U postgres -c "\?"
"\l" lists databases:
psql -U postgres -c "\l"
"\c" connects to database:
psql -U postgres -c "\c imdb2"
"\dt" lists tables in a database: "\dt+" gives more detail
psql -U postgres -d imdb2 -c "\dt+"
"\d" lists fields in a table
"\d+" gives more detail:
psql -U postgres -d imdb2 -c "\d+ title_basics"
To get detailed information on all tables and primary keys in the public schema (like a named directory of databases, tables and keys)
psql -U postgres -d imdb2 -c "\d+ public.*"
See that here we have some additional tables to enhance our queries, and they do not exist in the original imdb database:
crew_molten, splits multiple directors or writers from commas, and has an additional field for director/writer profession:
%load_ext sql
%sql postgres://postgres@localhost/imdb2
%%sql
SELECT * FROM crew_molten
LIMIT 5;
genres_molten, has tconst and genres field split from commas:
%%sql
SELECT * FROM genres_molten
LIMIT 5;
title_principals_melt has the tconst and principalcast field split from the commas:
%%sql
SELECT * FROM title_principals_melt
LIMIT 5;
Now we will use the full database "imdb" in our PostgreSQL server:
list fields and data info
%load_ext sql
%sql postgres://postgres@localhost/imdb2
In our basic "SELECT" queries we will start with a few clauses:
%%sql
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ 'Godfather.*Part'
ORDER BY runtimeminutes;
In fact we are not concerned with documentaries, shorts and comedy movies
So we should include only dramas and exclude comedies and let's sort by runtimeminutes in descending order
%%sql
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ 'Godfather.*Part'
AND genres ~ 'drama'
ORDER BY runtimeminutes;
The query did not return any rows?
Why?
Because regex pattern search is case sensitive unless otherwise states and "drama" does not match "Drama"
In order to make the search case insensitive we add (?i) at the beginning of the pattern or we can use ~* operator which makes the pattern case insensitive:
%%sql
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ 'Godfather.*Part'
AND genres ~ '(?i)drama'
ORDER BY runtimeminutes DESC;
EXERCISE 2:
Note: You can try below cells or pgcli from the terminal to test your commands
%%sql
-- statement here
psql -U postgres -d imdb1 <<EOF
-- or you may write your statement here
EOF
SOLUTION 2:
pass1=
encrypt="U2FsdGVkX19laVXVoXmMBKtRHvrfSnfkNJhynaSpFpRrOpeoRykDziSFF1AiVIWA 0N1ai35oqn6zxx1/lh/qWM6/aMQXEGk/Njoj7TPIg63cInhl/ly/8Ho0Gl3Wop2m CVFRBWtZuzRP7RhyRxNFR8EjA2pSje6RkVrwQrrRTRK2s8cmEZPgt+8wgrLCo33A UgCHxIQ4T3el7GtPsmb7LKzjItZvROAdcXbIJX1Ekj6tJT9YWbIknKTtETtaHzOk OgTURnjlpIH5bYsB1Q6E5l/B/g6FjqZ3Y5lUDs5qxhY="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Note that, unless we explicitly tell SQL server to exclude NA values, it also returns NA values
Now, back to the Godfather trilogy...
We still have comedy. Now we enhance the WHERE clause so that comedy is also excluded
Note that we can either alter regex as case insensitive (?i) or make the regex operator case insensitive (~*):
%%sql
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ '.*Godfather.*Part.*'
AND genres ~ '(?i)drama'
AND NOT genres ~* 'comedy'
ORDER BY runtimeminutes;
That's ok but we only want titles from the original trilogy last of which was screened in 1990
%%sql
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ '.*Godfather.*Part.*'
AND genres ~ '(?i)drama'
AND NOT genres ~ '(?i)comedy'
AND startyear <= 1990
ORDER BY runtimeminutes;
There, we have what we want for further queries
EXERCISE 3:
%%sql
-- statement here
psql -U postgres -d imdb1 <<EOF
-- or you may write your statement here
EOF
SOLUTION 3:
pass1=
encrypt="U2FsdGVkX19EW9F4b4zlMMSvkx1iNjeL5AghD78cx7qqV1OJWShAL5DzI9kivtGh GTxPd+hVCPRscMqSPBWr34XzfQXvktM8Z41FIEKTu/LJ7/2Clrxubh6Zl6fb5j79 9KPMRdRcpRNghOkmvqQVSL5vLPlQPDipQRG6Zp2HEH0lavcTmobC5b9uksKCTlFO srC8sGsq9ssvqGeYgn9Lqdw+O5sWvlEI5VeKtRIZMxCPSlZnKTWuDlCeWFzbw6cC GOq4tx2aLhrDdcNea6Y7R2j+y98k6E5zxW22YMNr6QFkmeHmjbvjQi5q5vYiZcka oIIn5pcoE7pKe4X9ppDqFVsb5EgqiAa+so2dIenZC4uPVlZKzT2YU1xpg53laFp0 EiQVWd/NXRMl6fpBWVahBnl6pE7wSISke8ZovjPcwAE="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
We can represent the runtimeminutes filter with a BETWEEN condition:
pass1=
encrypt="U2FsdGVkX19j4SGpgZW4DfLWs5VI0x/jrCAtXCRUsr/jp8yxZsrbRXmbN22UpfB1 kvsy3i51vT10+p4ItE9XGOXlVEVhgsvW+Zanf4+ZKyTvNkUOBdSydvElBTJM1VIZ HnB8JISJeG/kipPG2LnGuMLdhCxqL/yHtuQDK5uPF74x+Mpm9BQsg5E0MVHWulB1 6CPkEnugOjDm1pPt30v86LSXG8LvFq7ArlQd57BTBJE6cTo7PR3vNbV6z/9MLLaK kCIADJaGtGT+D8fhhUNolgG7Zv/t5uaurtd3xvGt96x7MmRuoHtfGb4KzfoVSZDC ASGVAbX2a1hiObKGrHB4CSFhg3zu748kAivxAjqMqdXb/m9WtlZAUvqiALPTSm13 pf4NqjJiGeidG2GAxSyhXHOJnqCmhH5wWKJwDavw3kc="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Note that regex or regexp (regular expressions) is a domain specific querying language
And a very powerful tool for mining, querying and manipulating textual data
And is worth spending effort to learn if you deal with textual data
This tutorial is a good starting point:
And this site if for testing regex patterns on some text:
And this competition website if good for progressing in regex and having fun at the same time:
Let's say we want to get the unique valued rows from the result of a query, as we did in the first session.
Get unique genres from genres_molten:
%%sql
SELECT DISTINCT genres
FROM genres_molten;
Let's get the average runtime and count of movies including Godfather by decades:
%%sql
SELECT (startyear / 10) * 10 as decade, avg(runtimeminutes) as runavg, count(*)
FROM title_basics
WHERE originaltitle ~ 'Godfather'
GROUP BY decade;
Now let's filter records on a condition including an aggregated field such as, filter for decades with average runtimeminutes greater than 80:
Note that we can define an alias for a calculated field or a table with a long name:
%%sql
SELECT (startyear / 10) * 10 as decade, avg(runtimeminutes) as runavg, count(*)
FROM title_basics
WHERE originaltitle ~ 'Godfather'
GROUP BY decade
HAVING avg(runtimeminutes) > 80;
EXERCISE 4:
%%sql
-- statement here
psql -U postgres -d imdb1 <<EOF
-- or you may write your statement here
EOF
SOLUTION 4:
pass1=
encrypt="U2FsdGVkX1/4Df2hQ29wd6U/qzTn4z2MYoMx8tNYA2nR2w2vVDspxUYSNdg0YNNa oWXZ3p0SSjLU87USaZTdAryMwF51trIJyFebz7jjIgbCupCqzUnBxYUTGvFT32Ic 7qcaiyRWNDZ4UHrgcosBvZbAoDn7po8swpZFlfE3Zp7p3pc0LMCjurxsKCSW1EfA cD88pVWBXcjQLrFiIvldXHB0gaiKJeFFq6o4n2uqIlYjT21A6tDuao3aVtRPMQ84 hlyqsAvsxWE1RRLeSmjgr4nsDn2HtM+/71GR5widmUJgh2SmTlFbFpore1I0lezU WKhENNO+4VOOjBJQrdEkC47NXkz2AlNQPV7aHcpiumdzjvvGtwdM90UogFIndHKl pt6VMU5mAh9iW8HrYXTbYvJEJ5VS8dZ8AZdSi/j3ipZqag+qkuIKiFmRZdhJE4iP"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Now we would like to join titles from the original Godfather trilogy last of which was screened in 1990 with id's of principal cast
%%sql
SELECT tb.tconst, tb.originaltitle, tb.startyear,
tb.runtimeminutes, tb.genres, tp.principalcast
FROM title_basics tb
LEFT JOIN title_principals_melt tp ON tb.tconst=tp.tconst
WHERE tb.originaltitle ~ 'Godfather.*Part'
AND tb.genres ~ '(?i)drama'
AND NOT tb.genres ~ '(?i)comedy'
AND tb.startyear <= 1990
ORDER BY tb.runtimeminutes;
We could also get the same result by running a subquery
A subquery is a query with a alias (a name) and executed inside another query
Using the alias, the subquery results can be treated as a separate table
%%sql
SELECT subq1.*, tp.principalcast
FROM
(
SELECT tconst, originaltitle, startyear, runtimeminutes, genres
FROM title_basics
WHERE originaltitle ~ 'Godfather.*Part'
AND genres ~ '(?i)drama'
AND NOT genres ~ '(?i)comedy'
AND startyear <= 1990
ORDER BY runtimeminutes
) subq1
LEFT JOIN title_principals_melt tp ON subq1.tconst=tp.tconst;
You can find more information on subqueries:
Now let's do sth more complicated:
%%sql
SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,
tp.principalcast,
nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession
FROM title_basics tb
LEFT JOIN title_principals_melt tp ON tb.tconst=tp.tconst
LEFT JOIN name_basics nb ON tp.principalcast=nb.nconst
WHERE tb.originaltitle ~ 'Godfather.*Part'
AND tb.genres ~ '(?i)drama'
AND NOT tb.genres ~ '(?i)comedy'
AND tb.startyear <= 1990
AND nb.primaryprofession ~'actor|actress'
ORDER BY nb.primaryname, tb.startyear DESC;
When the column names are the same in a join, we can use USING as a shorthand:
%%sql
SELECT tb.tconst, tb.originaltitle, tb.startyear, tb.runtimeminutes, tb.genres,
tp.principalcast,
nb.primaryname, nb.birthyear, nb.deathyear, nb.primaryprofession
FROM title_basics tb
LEFT JOIN title_principals_melt tp USING (tconst)
LEFT JOIN name_basics nb ON tp.principalcast=nb.nconst
WHERE tb.originaltitle ~ 'Godfather.*Part'
AND tb.genres ~ '(?i)drama'
AND NOT tb.genres ~ '(?i)comedy'
AND tb.startyear <= 1990
AND nb.primaryprofession ~'actor|actress'
ORDER BY nb.primaryname, tb.startyear DESC;
EXERCISE 5:
List the living principal cast names in movies directed by Stanley Kubrick (nm0000040) in 1970's and 1980's. Report these columns:
Use the following tables:
Follow this plan:
Left join title_crew to title_basics using common tconst field to combine directors with movie details
Left join to title_principals using common tconst field to combine director/movie details with pricipal cast
Left join to name_basics using the relation between principalcast and nconst fields to combine with the names of the principal cast
Filter for the director code in title_crew nm0000040
Filter for 1970's and 1980's startyear in title_basics
Filter for deathyear in name_basics being NULL
Order by first startyear and then primaryname
Hint:
%%sql
-- statement here
psql -U postgres -d imdb1 <<EOF
-- or you may write your statement here
EOF
SOLUTION 5:
pass1=
encrypt="U2FsdGVkX19HQOMAJZjspC7Lo2IESEwcPIhqrmWzsghnzubsh60z3xkXKld7pEvy jx2MuhEAETgwH32Ukr4XDtAZdDU07wpXT1BrTMNzdO917ZlsPbhReM76GIfN5Adg lotWy7O9hdmTRGxA5+izGCLqgB0VwbH0PyMBxER5SdAvhWNpziDTOJnPqH1lZOi/ eDuqBMYzNELuJJiKDjv73FrcOTYydu6LYKdyglg2bCa6rNJHGSBAmnBJnUTRcgNN Ms7P9JrzIBxUvJ5HlSknLFvlz53OhE/JrtCQV59FMtVaZ3jfEOdTcd35RYLBWFbw Fd4Cl4O+rkrD7Abs15sk0wwu4dDz/jTzflnOV6E88bB5ODavWH+UdnoOzmyqLkS1 8Rwxd9E502uaAnfcykE2inLKncsJPFvVpHvT/OlAGK45TZ2QJhpw2zv4ecgL+bOb nugWFHOixq4Ly28mnuqCMqy200Nh4ZfiWi+mxiKv4ZumuZJb1zzVQWPGEP1IB5v9 B6XFOK+cnznjbmAnz5FEY4zo/9DWqO2iqK//xrWswOXFZSpZYCyQizwybPs9gI6S ianRCR9Z66/6Zk8R0kgzYw=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Unfortunately, R. Lee Ermey who starred in one of the definitive scenes in movie history (opening scene of Full Metal Jacket) and was still alive in 2017 (the data this version of the database was last updated) passed away in 2018.
Now let's say we want to add a new record to title_ratings table
First let's switch to imdb1 table to leave imdb2 intact
%sql postgres://postgres@localhost/imdb1
%%sql
SELECT *
FROM title_ratings
LIMIT 5;
The structure of a record is as follows:
tt0000005 6.2 1565
Suppose we want to add a record for a title as such:
tt0000000 8.2 2000
We can insert the record as such:
%%sql
INSERT INTO title_ratings
(tconst, averagerating, numvotes)
VALUES ('tt0000000', 8.2, 2000);
Note that we may omit the (tconst, averagerating, numvotes) since we are referring to all columns
Running the same code again returns an error as:
IntegrityError: duplicate key value violates unique constraint "title_ratings_pk"
DETAIL: Key (tconst)=(tt0000000) already exists.
Now let's check whether the record is added: (we will go over SELECT, FROM, ORDER BY and LIMIT command/clauses later):
%%sql
SELECT * FROM title_ratings WHERE tconst = 'tt0000000';
For more info on INSERT command:
https://www.postgresql.org/docs/current/static/sql-insert.html
Now suppose we want to update the record such that averagerating is now 8.7 and numvotes is 3000:
%%sql
UPDATE title_ratings
SET
averagerating = 8.2,
numvotes = 3000
WHERE tconst = 'tt0000000';
And check whether the record is updated:
%%sql
SELECT *
FROM title_ratings
WHERE tconst = 'tt0000000';
For more info on the UPDATE command:
https://www.postgresql.org/docs/current/static/sql-update.html
Now let's delete the record we just added:
%%sql
DELETE FROM title_ratings
WHERE tconst = 'tt0000000';
%%sql
SELECT *
FROM title_ratings
WHERE tconst = 'tt0000000';
To get more info on the delete command:
https://www.postgresql.org/docs/current/static/sql-delete.html
EXERCISE 6:
Into imdb1 database:
psql -U postgres -d imdb1 <<EOF
-- statements here. end each of them with a semicolon
EOF
SOLUTION 6:
pass1=
encrypt="U2FsdGVkX1/7SgGX1CAiSzJ57xtYsg/bcu4afSsoE7A1y2QiXSbeoPNQ3Odswc9q sn/RNaZozo6geal6To9P04xwqgGwGykq8d+D/QPtR3iog7ie+MV57TP0qaVjBcXe BpU6V/F08Bqjq72vOa5nFjco4m1tOr/WPBbkXN6IF3+Qa9v1C8e/jR46aFG5fhoZ JcXmhq+8776FXLScUosrrLBr7twr6ZY+9RSrUu29UVeiPlcA+AkbVWlDzNP38cmN 0IA6fU8jjOo1rtDhJ9ykmZpdqhJgWrdU1STR/9cdtJcIPA+S4L0/dBfKF+jX/YcA xyTmjCkihJ2Ypj0ZxhN/0HNnfP/xsH97DFOFha8P2gfj73IHioYSExxcD9c0FCuM MmWISXyoY7BkTwL/DAUpiFA0p4zIBXtTivdd2Wa7ngehGJRmM+erMa91QM6GDRC+ s+LooUBHtZo3SwH5TGzuGgS2p6lEkKHOlKPkswi4NJGQh2S1VXgAmhJTx7U/puQN cl7Njp/NRuVaylSu43LYlpbj6Xe9LW6BdR58XrQ/dlQzdvIBYC/CZ7f0vWiFd1c/ dRNQT5eM5Ek83gDCDDvoCRmnSWgQU2SCK0H/jum2ImY0b2GN6uG3J+aI1o5gRDRb /V6jl0DHhulPsi+W/1bPaoNqznrnnw2jL5oIyVCfPcKrgENj4XOYya6sOpUO+ly4 MzXsy/RC5LX9qbo0B7PkNA=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Now use your imagination and using any of the tables in imdb2 (you can get the details with \dt+ and \d+ public.* from psql/pgcli), create any interesting query of yourself using (single or multiple) joins, (single or multiple) where, group by, having, order, distinct or limit clauses. You can also use subqueries.
But you should first explain in clear English, what your query is meant to do. What is the purpose of your query?
Save your query for future reference. You may need it for a compulsory homework!
Feel free ...