Notebook

FILES¶

by Serhat Çevikel

Working with JSON Objects¶

JSON is a hierarchical data format that allows data that is not appropriate to be formatted as columns and rows to be stored and queried.
Let's say we are tracking our contact data in a csv file:

Lastname, Firstname, Phone Number

Membrey, Peter, +852 1234 5678

Thielen, Wouter, +81 1234 5678

If one of the contacts have more than one phone numbers, we have to create a new column

Lastname, Firstname, Phone Number1, Phone Number2

Membrey, Peter, +852 1234 5678, +44 1234 565 555

Thielen, Wouter, +81 1234 5678

But suppose, we have million of records with tens of fields, and in some exceptions, some records may have many multiple values of some fields: e.g. 10 telephone numbers, etc
JSON format is a remedy for these kinds of flexibility issues and hierarchical data formats.
Integrity rules are softer for handling JSON data

JSON stands for "JavaScript Object Notation"
In JSON, each record is called a "document"
Let's write the first record as a JSON document:

In [ ]:

record1='{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}'

echo $record1

Echoing the JSON as comma separated values as such, is not easy to parse and understand the format.
We may use online json parsers for this purpose.
You can copy and paste the above string into the input pane:

http://jsonparseronline.com/

Querying JSON with JQ¶

Or better, we will use a handy tool called "jq" for this purpose:

In [ ]:

echo $record1 | jq .

jq is a parser and querying tool for json, that creates a nice output

You can have more info on jq following the links:

The Home Page

Tutorial

Manual

Each document (equivalent to a row in RDBMS) in JSON is delimited by curly braces "{"
And all values are given as "key" and "value" pairs:

{
  "firstname": "Peter",
  "lastname": "Membrey",
  "phone_numbers": [
    "+852 1234 5678",
    "+44 1234 565 555"
  ]
}

firstname is the key, "Peter" is the value, an so on
We also have arrays of values for a single key, delimited by square brackets []

What is more powerful about JSON format is that you can embedded documents inside other ones: (we print the data here in three visual formats)

In [ ]:

record1b='{
"firstname": "Peter",
"lastname": "Membrey",
"numbers": [
{
"phone": "+852 1234 5678"
},
{
"fax": "+44 1234 565 555"
}
]
}'

echo -e $record1b "\n"

echo -e "$record1b\n"

echo $record1b | jq .

See, the phone and fax numbers are inside embedded documents

Multiple documents can be "collected" inside "collections":
A "collection" in NoSQL terminology is analogous to a "table" in the RDBMS jargon. A collection is a collection of similar items (or documents with similar key-value pairs)

A real json database example: UN COMTRADE¶

We will be using a part of the UN COMTRADE database:

UN COMTRADE

UN COMTRADE is the widest and most comprehensive database on international trade:

250+ reporter countries
290+ partner countries
6500+ commodity codes
50+ of history
Both imports and exports
Both values and quantities!

Let's first save the path:

In [ ]:

datadir=~/data
comtrade=$datadir/comtrade_s1

In [ ]:

ls $comtrade

classificationS1.json lists the item classification according to SITC1 method
reporterAreas.json and partnerAreas.json lists the countries and their respective codes
data files are under 2010

Reporters¶

In [ ]:

jq . $comtrade/reporterAreas.json

Another way to pretty print and navigate through json files is R:

In [ ]:

reporter <- jsonlite::fromJSON("~/data/comtrade_s1/reporterAreas.json")
reporter

In [ ]:

str(reporter)

See that, in not-so-nested structures, the data is automatically flattened into a data frame

And a collapsable and interactive gadget for viewing json and similar hierarchical data types: (you may need two execute several times to get the JS gadget)

In [ ]:

listviewer::jsonedit(reporter, mode = "form")

Now let's traverse through this document to list country texts:

In [ ]:

jq '.results[].text' $comtrade/reporterAreas.json | tr -d '"'

And let's list the country codes:

In [ ]:

jq -r '.results[].id' $comtrade/reporterAreas.json

Separate lists of country names and id's do not mean much.
Suppose we want to find the country code of turkey

In [ ]:

reporters=$(jq -r '.results[] | "\(.text)\t\(.id)"' $comtrade/reporterAreas.json | \
xargs -0 -i echo -e "{}")

echo "$reporters" |  column -s $'\t' -t

We can filter with grep or awk:

In [ ]:

echo "$reporters" | awk -F "\t" '$1=="Turkey"{print $2}'

Another way to flatten fields is:

In [ ]:

reporters2=$(jq -r '.results[] | .text + "\t" + .id' $comtrade/reporterAreas.json | xargs -0 -i echo -e "{}")

echo "$reporters2" |  column -s $'\t' -t

In [ ]:

echo "$reporters" | awk -F "\t" '$1=="Turkey"{print $2}'

Or we can use the "select" statement for filtering values:

In [ ]:

jq -r '.results[] | select(.text == "Turkey") | .id' $comtrade/reporterAreas.json

All the files under 2010e directory are files in which Turkey is either a reporter or a partner
So sometimes there exists a pair of files where Turkey and another country swap roles as reporter and partner

Classification¶

Now let's go through the classification file:

In [ ]:

classification <- jsonlite::fromJSON("~/data/comtrade_s1/classificationS1.json")
listviewer::jsonedit(classification, mode = "form")

Now we will filter for those entries, in which text includes "textile" and code is only 3 digits:

In [ ]:

jq -r '.results[] | select((.id|test("^\\d+$")) and (.text|test("(?i)textile"))) |
select((.id|tonumber < 1000) and (.id|tonumber > 99)) | .text' \
$comtrade/classificationS1.json | \
sed 's/ - /\t/g' | column -s $'\t' -t

See how it works:

We filter for id values that are numeric (so exclude ALL, TOTAL, AG1..AG5) and where text includes case insensitive textile
We filter for id values larger than 99 and smaller than 1000
We return the text
The text has already id info at the beginning split with " - ". We substitute these character with a tab character

EXERCISE 1:

Now your turn: Find the id and text of codes parent of which are 651 (Textile yarn and thread).

The result will be:

6511  Thrown silk & silk yarn and thread
6512  Yarn of wool and animal hair
6513  Cotton yarn & thread, grey, not mercerized
6514  Cotton yarn & thread, bleached, dyed, mercerd.
6515  Yarn and thread of flax, ramie and true hemp
6516  Yarn and thread of synthetic fibres
6517  Yarn and thread of regenerated fibres
6518  Yarn of glass fibre
6519  Yarn of textile fibres,nes incl.paper yarn

In [ ]:

SOLUTION 1:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+WtEYqehLba8zmJ9evvmr1ohvj0BWaggG2cQUjY/HMZia25jIkV9zR jyLxrSmHJRCX55wFh7JIB9U1sjWJ1cvipesjhECBk3B3GaP97Kjsq9cDu0jnjYu7 cM9b9CKPMJN1Am/sywL5RQT+z0nDKopIuq78l3p+eeqEHQNvloe35hu/y19+lFmt 6f4EG2i1cQ2JJq8bBMyl9g=="
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Data files¶

Now let's go through the actual data files that includes trade volumes:

In [ ]:

ls $comtrade/2010e | { head -5; tail -5; }

In [ ]:

ls $comtrade/2010e | grep -P "2010_792" | head -5

The code after the first underscore is the reporter's and the code after the second underscore is the partner country's code
So there are the files for which Turkey is either a reporter or partner
For how many files Turkey is reporter or a partner?

In [ ]:

ls $comtrade/2010e | grep -P "2010_792" | wc -l
ls $comtrade/2010e | grep -P "792_s1" | wc -l

Turkey reported her trade with 227 partners, while 165 reporter countries reported their trade partnering with Turkey (for s1 classification only)

In [ ]:

tradedata <- jsonlite::fromJSON("~/data/comtrade_s1/2010e/2010_792_100_s1.json")
listviewer::jsonedit(tradedata, mode = "form")

Descriptions for several variables are:

CmdCode: Commodity code
CmdDesc: Commodity description
IsLeaf: Basic code without children
Parentcode: High level of that commodity code
pfDesc: Commodity classification
PfCode: Commodity classification code
yr: Year
rtCode Reporter Code
ptCode: Partner Code
qtCode: Quantity code

Now from all files for which Turkey is a reporter, the TradeValue of exports (rgCode is 2) in 651 code (Textile yarn and thread)

We will report:

ptTitle (name of partner country)
TradeValue
TradeQuantity

For a single file we have:

In [ ]:

jq -r '.dataset[] | select(.cmdCode == "651" and .rgCode == 2) |
"\(.ptTitle)\t\(.TradeValue)\t\(.TradeQuantity)"' $comtrade/2010e/2010_792_100_s1.json | \
xargs -0 -i echo -e "{}"

In [ ]:

textileexports=$(for file in $comtrade/2010e/2010_792*.json;
do
    jq -r '.dataset[] | select(.cmdCode == "651" and .rgCode == 2) |
"\(.ptTitle)\t\(.TradeValue)\t\(.TradeQuantity)"' $file | \
xargs -0 -i echo -e "{}";
done)

In [ ]:

echo "$textileexports" | column -s $'\t' -t

Let's exclude World total:

In [ ]:

echo "$textileexports" | awk -F "\t" '$1 != "World"' | \
column -s $'\t' -t

In [ ]:

echo "$textileexports" | \
awk -F "\t" '$1 != "World"' | \
sort -t $'\t' -nr -k 2 | \
numfmt -d $'\t' --field=2-3 --to-unit=M | \
column -s $'\t' -t

What we do here:

Exclude the row first field of which had "World"
Sort as number in reverse order by second field
Format 2nd-3rd columns in millions
And align columns

So in 2010, Turkey's significant partners for Textile yarn and thread export are:

Italy (USD 166 mio)
Russion Federation (USD 95 mio)
Germany (USD 86 mio)

EXERCISE 2:

For each file in which Turkey is a partner, return the reporting country name (rtTitle), comodity description (cmdDescE) and trade value (TradeValue) for those items which:
are exported from those countries to Turkey (rgCode is 2)
TradeValue's are at least 500000000 (500 mio)
and are at the most detailed level (IsLeaf is 1)

So you should fill in this template below:

for file in $comtrade/2010e/*792_s1.json;
do
    jq -r '#fill in this part' $file | \
    xargs -0 -i echo -e "{}";
done | \
awk -F "\t" '$1 != "EU-28"' | \
sort -t $'\t' -nr -k 3 | \
numfmt -d $'\t' --field=3 --to-unit=M | \
column -s $'\t' -t

The output will be:

Germany             Passenger motor cars, other than buses           2653
Algeria             Gas, natural                                     2591
USA                 Special transactions                             2010
Russian Federation  Crude petroleum                                  1607
Germany             Other parts for motor vehicles                   1231
China               Statistical machines cards or tapes              1027
France              Other parts for motor vehicles                   1003
Germany             Internal combustion engines, not for aircraft    985
Russian Federation  Coal /anthracite, bituminous/                    972
Germany             Special transactions                             948
Spain               Passenger motor cars, other than buses           923
Russian Federation  Aluminium and aluminium alloys, unwrought        865
USA                 Raw cotton, other than linters                   862
Russian Federation  Refined copper including remelted                861
Saudi Arabia        Products of polymerization and copolymerization  849
Italy               Other parts for motor vehicles                   806
France              Aircraft, heavier than air                       799
Belgium             Products of polymerization and copolymerization  644
Qatar               Gas, natural                                     631
Italy               Machinery and mechanical appliances, nes         562

You can view a sample data:

In [ ]:

tradedata <- jsonlite::fromJSON("~/data/comtrade_s1/2010e/2010_100_792_s1.json")
listviewer::jsonedit(tradedata, mode = "form")

In [ ]:

SOLUTION 2:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+4esKDuc9iGGz0tWhqscfUMmjPGTa2MIp+YPGWOvdJZY0ZLMx3XSI5 42ueQ3AERCACRgSCZOnUhnDJG4wgnAgcmv+RXW0fR82yLC6g5M2+t3XuODqTIZag Y+g6DkbnapcedfV2K5hg0ELct8YzomN7648I/vQxCQEYEPlwxv/Pdw7IBEeIgFTH zJ25jJlAKJerLYTl9G+XXr01Pwlh5VS36wh16/opSTYjeh++FgvCCH2WKBiFfLjK oxzjfLBcpcqWLVOQ8ozqCean7tnTPIHR0QzjJeiqJ91jBOLbxPpnyT93Yu5Ikxne MbZ/7zAVqLmQVJCpLj6igmHtYxMv3Ck1HC5RKLX2Qrb8sPygcafwyGRp+r0jEsOk nl3MxKOVMuWIwvA+gmJwQhUJXmf7Zm1YX9tcn3PJe6IWPom2P9esB+GXKMnFkpL9 IsDEP/nvQ5TXe+gDGpKWzlEirPp6CwAB6YNiH/phRMUmrbg2DtsRSvZA9bMslFbf eXrItWZ34YuwnqDSlAaquA=="
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

So according to numbers reported by other countries, in 2010 Turkey's largest imports were:

Passenger cars, engines and parts for motor vehicles from Germany and France (some portion for domestic production)
Natural gas, petroleum and Coal from Algeria and Russia
(Probably) mobile devices from China

Note that SITC 1 is a very old classification so some modern commodities may not fit well into this schema

MongoDB¶

Now we will import the files into MongoDB

First we create a directory for MongoDB database:

In [ ]:

mkdir -p ~/mongo

And run the mongo server:

In [ ]:

mongod -dbpath ~/mongo &

And import the files into a new database called comtrade and collection called 2010s1:

DO NOT REPEAT THIS STEP MULTIPLE TIMES, YOU WILL HAVE DUPLICATE DOCUMENTS!

In [ ]:

for file in $comtrade/2010e/*.json;
do mongoimport --db comtrade --collection 2010s1 $file; done

Now from R, import monglite package:

In [ ]:

library(mongolite)

We create a connection object:

In [ ]:

con <- mongo(collection = "2010s1", db = "comtrade")

Select the document(s) where rtTitle is Bulgaria and return, commodity description, commodity code and TradeValue columns:

In [ ]:

bulgaria_vals <- con$find('{"dataset.rtTitle" : "Bulgaria"}',
               fields = '{"dataset.cmdDescE": 1, "dataset.cmdCode": 1, "dataset.TradeValue":1 }')

The syntax of find() method has

The query
And the projection (selected fields to be returned)

The fields to be returned get the value 1

Let's view the structure of the return value:

In [ ]:

str(bulgaria_vals)

And view only the dataset:

In [ ]:

bulgaria_vals$dataset

Now we will get the values for countries with which Turkey has at least USD 50 mio export of 651 coded "Textile yarn and thread":

In [ ]:

export_651 <- con$find('{
"dataset":
{ "$elemMatch":

{ "$and": [
    { "rtTitle" : "Turkey" },
    { "cmdCode" : "651" },
    { "TradeValue" : { "$gt": 50000000 } },
    { "ptTitle" : { "$ne" : "World" } },
    { "rgCode" : 2 }

] } } }',
               fields = '{"dataset.ptTitle": 1,
                            "dataset.rgCode": 1,
                            "dataset.cmdDescE": 1,
                            "dataset.cmdCode": 1,
                            "dataset.TradeValue":1 }')

In [ ]:

str(export_651)

In [ ]:

export_651_2 <- do.call(rbind, export_651$dataset)
with(export_651_2, export_651_2[rgCode == "2" & cmdCode == "651",]) 

EXERCISE 3:

Select those countries for which Turkey has at least USD 5 bio exports

In [ ]:

SOLUTION 3:

In [ ]:

pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1/cPhgJB5HjYQ4xWOnuyfs5vKrofNpNlV9UHETg+d+KYoLJw7Py0t9p ap3m5rxDCl7JtmwH6VZKMf5sfooPDhhAnI5YXHJJmAc78U5ZYT/aVL75466bi6Te ic4Yd4uM1jcyblnj9LAp2n76eyGFFaEmAY2hmwn+uaDbk91xTNTYnE0gNqmjMJHs t60XDU19QfOTF8YUyG1cbYQwGqMp4k9u1o45Co5f4BRH8aWrgdCp1NBCd+X2itsl ujQYQ45+dnj7D01iezb9gPxvWiEMT+rIP7rWEQh8+wg/mIRR0zctqwk0xrIXLSNo /q1RCrw8DTaoqX5De3EZXxFzWNsw81+/9eehspVmosb9KnoNEKfZzgj99dm/ICDq 8uf9bqxZ373tN6oC23znD1vndhDPVl9KbwoNHjLa9Cb+Do5qoQqQ+hxEnDQm9I96 jai0dRPTnf5RXeb4Zs4Us3+8wahR3gm797WU/qJVIymbZH+5DjSID27XCI2WinPi"
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))

Working with XML/HTML Objects¶

What is XML¶

(https://www.w3schools.com/XML/xml_whatis.asp)

XML is a software- and hardware-independent tool for storing and transporting data.

What is XML?

XML stands for eXtensible Markup Language
XML is a markup language much like HTML
XML was designed to store and transport data
XML was designed to be self-descriptive
XML is a W3C Recommendation

XML Does Not DO Anything

Maybe it is a little hard to understand, but XML does not DO anything.

This note is a note to Tove from Jani, stored as XML:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

The XML above is quite self-descriptive:

It has sender information.
It has receiver information
It has a heading
It has a message body.

But still, the XML above does not DO anything. XML is just information wrapped in tags.

The Difference Between XML and HTML

XML and HTML were designed with different goals:

XML was designed to carry data - with focus on what data is
HTML was designed to display data - with focus on how data looks
XML tags are not predefined like HTML tags are

XML Does Not Use Predefined Tags

The XML language has no predefined tags.

The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.

HTML works with predefined tags like <p>, <h1>, <table>, etc.

With XML, the author must define both the tags and the document structure.

XML is Extensible

Most XML applications will work as expected even if new data is added (or removed).

Imagine an application designed to display the original version of note.xml (<to> <from> <heading> <body>).

Then imagine a newer version of note.xml with added <date> and <hour> elements, and a removed <heading>.

The way XML is constructed, older version of the application can still work:

<note>
  <date>2015-09-01</date>
  <hour>08:30</hour>
  <to>Tove</to>
  <from>Jani</from>
  <body>Don't forget me this weekend!</body>
</note>

Basic XML Syntax¶

(https://www.w3schools.com/XML/xml_syntax.asp)

The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to use.

XML Documents Must Have a Root Element

XML documents must contain one root element that is the parent of all other elements:

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

In this example is the root element:

<?xml version="1.0" encoding="UTF-8"?>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

All XML Elements Must Have a Closing Tag

In XML, it is illegal to omit the closing tag. All elements must have a closing tag:

<p>This is a paragraph.</p>
<br />

XML Tags are Case Sensitive

XML tags are case sensitive. The tag is different from the tag .

Opening and closing tags must be written with the same case:

<message>This is correct</message>

XML Elements Must be Properly Nested

In HTML, you might see improperly nested elements:

<b><i>This text is bold and italic</b></i>

In XML, all elements must be properly nested within each other:

<b><i>This text is bold and italic</i></b>

In the example above, "Properly nested" simply means that since the element is opened inside the element, it must be closed inside the element.

XML Attribute Values Must Always be Quoted

XML elements can have attributes in name/value pairs just like in HTML.

In XML, the attribute values must always be quoted:

<note date="12/11/2007"> <to>Tove</to> <from>Jani</from> </note>

Entity References

Some characters have a special meaning in XML.

If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element.

This will generate an XML error:

<message>salary < 1000</message>

To avoid this error, replace the "<" character with an entity reference:

<message>salary < 1000</message>

There are 5 pre-defined entity references in XML:

< < less than > > greater than & & ampersand ' ' apostrophe " " quotation mark

A real XML/HTML dataset¶

On the 2nd of December 2018, I scraped 994 realty listing pages of residences for sale in Mecidiyekoy, Sisli neighbourhood from www.hurriyetemlak.com

In [ ]:

datadir=~/data hemlak=$datadir/he_sisli

In [ ]:

ls -l $hemlak | head

Viewing the data as is not quite intuitive:

In [ ]:

find $hemlak -mindepth 1 | head -1 | xargs cat

An XML/HTMl parser can parse, pretty print, and traverse through the files.

Xidel is highly performant:

In [ ]:

find $hemlak -mindepth 1 | head -1 | xargs -i xidel --input-format html --output-format html -e "/" {}

Another way to view an indented version of the XML/HTMl file might be:

In [ ]:

a <- xml2::read_html("~/data/he_sisli/10005103")

In [ ]:

htmltidy::html_view(a)

However, opening the file with a web browser (preferably Chrome), hitting the F12 button and viewing the "elements" pane is the best option.

This way we will get the Xpath nodes for the information we want from the files

XPath basics¶

https://www.w3schools.com/xml/xpath_intro.asp

XPath can be used to navigate through elements and attributes in an XML document.

XPath stands for XML Path Language

XPath uses "path like" syntax to identify and navigate nodes in an XML document

XPath contains over 200 built-in functions

XPath uses path expressions to select nodes or node-sets in an XML document.

These path expressions look very much like the path expressions you use with traditional computer file systems

XPath Nodes¶

https://www.w3schools.com/xml/xpath_nodes.asp

XPath Terminology Nodes

In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes.

XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.

Look at the following XML document:

<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>

Example of nodes in the XML document above:

<bookstore> (root element node) <author>J K. Rowling</author> (element node) lang="en" (attribute node)

Atomic values

Atomic values are nodes with no children or parent.

Example of atomic values:

J K. Rowling

"en"

Items

Items are atomic values or nodes.

Relationship of Nodes¶

Parent¶

Parent

Each element and attribute has one parent.

In the following example; the book element is the parent of the title, author, year, and price:

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>

Children¶

Element nodes may have zero, one or more children.

In the following example; the title, author, year, and price elements are all children of the book element:

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>

Siblings¶

Nodes that have the same parent.

In the following example; the title, author, year, and price elements are all siblings:

<book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book>

Ancestors¶

A node's parent, parent's parent, etc.

In the following example; the ancestors of the title element are the book element and the bookstore element:

<bookstore> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>

Descendants¶

A node's children, children's children, etc.

In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:

<bookstore> <book> <title>Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>

XPath Syntax¶

https://www.w3schools.com/xml/xpath_syntax.asp

The XML Example Document¶

We will use the following XML document in the examples below.

<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book> <title lang="en">Harry Potter</title> <price>29.99</price> </book> <book> <title lang="en">Learning XML</title> <price>39.95</price> </book> </bookstore>

Selecting Nodes¶

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:

Expression Description

nodename Selects all nodes with the name "nodename"

/ Selects from the root node

// Selects nodes in the document from the current node that match the selection no matter where they are

. Selects the current node

.. Selects the parent of the current node

@ Selects attributes

In the table below we have listed some path expressions and the result of the expressions:

Path Expression Result

bookstore Selects all nodes with the name "bookstore"

/bookstore Selects the root element bookstore
Note: If the path starts with a slash ( / ) it always represents an absolute path to an element!

bookstore/book Selects all book elements that are children of bookstore

//book Selects all book elements no matter where they are in the document

bookstore//book Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element

//@lang Selects all attributes that are named lang

Predicates¶

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets.

In the table below we have listed some path expressions with predicates and the result of the expressions:

Path Expression Result

/bookstore/book[1] Selects the first book element that is the child of the bookstore element.
Note: In IE 5,6,7,8,9 first node is[0], but according to W3C, it is [1]. To solve this problem in IE, set the SelectionLanguage to XPath:
In JavaScript: xml.setProperty("SelectionLanguage","XPath");

/bookstore/book[last()] Selects the last book element that is the child of the bookstore element

/bookstore/book[last()-1] Selects the last but one book element that is the child of the bookstore element

/bookstore/book[position()<3] Selects the first two book elements that are children of the bookstore element

//title[@lang] Selects all the title elements that have an attribute named lang

//title[@lang='en'] Selects all the title elements that have a "lang" attribute with a value of "en"

/bookstore/book[price>35.00] Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00

/bookstore/book[price>35.00]/title Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

Selecting Unknown Nodes¶

XPath wildcards can be used to select unknown XML nodes.

Wildcard Description

* Matches any element node

@* Matches any attribute node

node() Matches any node of any kind

In the table below we have listed some path expressions and the result of the expressions:

Path Expression Result

/bookstore/* Selects all the child element nodes of the bookstore element

//* Selects all elements in the document

//title[@*] Selects all title elements which have at least one attribute of any kind

Selecting Several Paths¶

By using the | operator in an XPath expression you can select several paths.

In the table below we have listed some path expressions and the result of the expressions:

Path Expression Result

//book/title | //book/price Selects all the title AND price elements of all book elements

//title | //price Selects all the title AND price elements in the document

/bookstore/book/title | //price Selects all the title elements of the book element of the bookstore element AND all the price elements in the document

Xpath example: Get the price info from listings¶

Now please open a web browser windows on Chrome, and open:

https://www.hurriyetemlak.com/konut-satilik/istanbul-sisli-mecidiyekoy-daire/listeleme?pageSize=50&page=1

Navigate to any listing, hit F12 (debug tools) and select the elements pane

By using the element selector on top left, hit any point on the web page and see how the Elements windows navigates

Now hit the price info with the selector

Right click the highlighted element on the right pane, and click on "Copy Element"

The result will be something like:

<span>500.000 TL</span>

Now we want to get the path to this node:

Right click again, this time click on "Copy XPath"

XPath /html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span

Now we can use Xidel to get the value at this XPath, provided that the queried XML/HTML file has a similar DOM structure (hierarchy of nodes)

In [ ]:

xidel --input-format html --output-format adhoc -e \ "/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span" \ $hemlak/10005103 2> /dev/null

However, traversing using only indices might not be correct in all cases: The count of a certain element may change across similar pages

So we will use attributes to be more robust:

In [ ]:

xidel --input-format html --output-format adhoc -e \ "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \ $hemlak/10005103 2> /dev/null

Or course it is better to get only the numeric values and skip "." and "TL" parts

In [ ]:

xidel --input-format html --output-format adhoc -e \ "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \ $hemlak/10005103 2> /dev/null | \ tr -d "." | grep -Po "\d+"

Now we can traverse through all 994 files to get price information

In [ ]:

he_prices=$(for i in $hemlak/*; do \ xidel --input-format html --output-format adhoc -e \ "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \ $i 2> /dev/null | \ tr -d "." | grep -Po "\d+"; done)

In [ ]:

echo "$he_prices"

In [ ]:

ls $hemlak | wc -l echo "$he_prices" | wc -l

So we have 994 price information for 994 listing files

Get the square meter information¶

Similarly we will get the square meter information from listing files:

In [ ]:

he_sqm=$(for i in $hemlak/*; do \ xidel --input-format html --output-format adhoc -e \ "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[@id='realtyGrossSqm']/following-sibling::span/text()" \ $i 2> /dev/null | \ tr -d "." | grep -Po "\d+"; done)

In [ ]:

echo "$he_sqm" | wc -l

Get "loan eligibility info"¶

Note that we also have to control for missing values so that the output is parallel to previous ones

In [ ]:

he_kredi=$(for i in $hemlak/*; do \ kredi=$(xidel --input-format html --output-format adhoc -e \ "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[text()='Krediye Uygunluk']/following-sibling::span/text()" \ $i 2> /dev/null); if [[ -z $kredi ]]; then echo NA; else echo $kredi; fi; done)

In [ ]:

echo "$he_kredi" | wc -l

Get property age info¶

EXERCISE 4:

Get the age (Bina Yaşı) info from files and save into "he_age" object.

Check that total length is 994

In [ ]:

SOLUTION 4:

In [ ]:

pass1= encrypt="U2FsdGVkX1/eIM2iFwU8EiycbOHQU9x+E17J9APxpeP4xuHq88DO5QUGfmMq0yvi 7BqTIovpXlJ08UUBy+C30htC9YXNU0yc5RO3poMIC57QJ/TsIO1S7cJcmHAQJKYl 0mfbwfMgIEjHmjjAQ8XPwmARSwobvXlyl/6IK6lLNDZL3Ve9bwHBJD9cm8KsSEeX TQGukNiUej07kD7jGD6+q3CKz29an0olwDLvK7dgPaKavfZSGIpa7CAHqAriVx83 Fq+uQKu2s1E5vpuA85huxyrVuWqMJFwoq5uiq2KBv446Sm+AOPSOJDVVxS1B+DB8 Ayh9E18VChQ0JS2Aw+wjlRF3wzE3Ew3Ne1pUUUSbPHBde3t3RgfCzD3u+LdR1K7h QQlGJGIOpbSHe7mRNVeiuhNwl0qGe72jdFVPqcC4DE8ZuW5AySn+SQ2I4kvvmhRs jOjHOddqrH6c9BDkoibzrrWoR2teu8aIbMBCNaBrDIU=" solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null) echo "$solution"; echo for l in "$solution"; do eval "${l}"; done

Combine info into a tsv file¶

Now together with filenames (listing codes), let's combine all information into a tab seperated text:

In [ ]:

selected_info=$(paste <(ls $hemlak) <(echo "$he_prices") <(echo "$he_sqm") \ <(echo "$he_kredi") <(echo "$he_age") --delimiters "\t")

In [ ]:

echo "$selected_info" | column -s $'\t' -t

And save into a file:

In [ ]:

echo "$selected_info" > selected_info.tsv

Analyze selected info¶

Now let's analyze the selected info using R

In [ ]:

selected_info <- read.delim("selected_info.tsv", header = F)

See the structure:

In [ ]:

str(selected_info)

See missing data:

In [ ]:

lapply(selected_info, function(x) which(is.na(x)))

In [ ]:

selected_info2 <- na.omit(selected_info)

In [ ]:

str(selected_info2)

In [ ]:

selected_info2

Let's rename columns:

In [ ]:

names(selected_info2) <- c("id", "price", "sqm", "eligibility", "age")

In [ ]:

selected_info2

Let's add another column for price per sqm

In [ ]:

selected_info2$pricesqm <- with(selected_info2, price / sqm)

In [ ]:

selected_info2

Now let's see whether price per sqm differs across eligibility values:

In [ ]:

with(selected_info2, aggregate(pricesqm, by = list(eligibility), FUN = mean))

price per sqm for loan eligible properties are significantly higher

The distribution across eligibility classes are:

In [ ]:

table(selected_info2$eligibility)

Now let's draw a simple scatteplot between price per sqm and age for eligible properties

In [ ]:

eligible <- selected_info2[selected_info2$eligibility == "Uygun",]

In [ ]:

with(selected_info2, plot(age, pricesqm))

No significant trend is detected without further data wrangling and cleaning (such as omitting outliers)

In [ ]:

cat ~/data/he_sisli/10005103 | head

Extract tables from XML/HTML¶

We will extract all the information to the left of the picture in a listing as a table using R

First parse the file as html:

In [ ]:

library(XML)

In [ ]:

parsed <- XML::htmlParse("~/data/he_sisli/10005103")

In [ ]:

parsed

The xpath for the table is:

In [ ]:

xpath1 <- "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='info-line']/ul[@class='clearfix']"

Apply the XPath to document

In [ ]:

table1 <- xpathApply(parsed, xpath1)[[1]]

In [ ]:

str(table1)

Convert XMLNode to an XMLInternalDocument:

In [ ]:

xml1 <- xmlDoc(table1) xml1

And extract items as two columns:

In [ ]:

table2 <- sapply(1:2, function(x) xpathSApply(xml1, sprintf("//li/span[%s]/text()", x), xmlValue)) table2

Expression	Description
nodename	Selects all nodes with the name "nodename"
/	Selects from the root node
//	Selects nodes in the document from the current node that match the selection no matter where they are
.	Selects the current node
..	Selects the parent of the current node
@	Selects attributes

Path Expression	Result
bookstore	Selects all nodes with the name "bookstore"
/bookstore	Selects the root element bookstore Note: If the path starts with a slash ( / ) it always represents an absolute path to an element!
bookstore/book	Selects all book elements that are children of bookstore
//book	Selects all book elements no matter where they are in the document
bookstore//book	Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element
//@lang	Selects all attributes that are named lang

Path Expression	Result
/bookstore/book[1]	Selects the first book element that is the child of the bookstore element. Note: In IE 5,6,7,8,9 first node is[0], but according to W3C, it is [1]. To solve this problem in IE, set the SelectionLanguage to XPath: In JavaScript: xml.setProperty("SelectionLanguage","XPath");
/bookstore/book[last()]	Selects the last book element that is the child of the bookstore element
/bookstore/book[last()-1]	Selects the last but one book element that is the child of the bookstore element
/bookstore/book[position()<3]	Selects the first two book elements that are children of the bookstore element
//title[@lang]	Selects all the title elements that have an attribute named lang
//title[@lang='en']	Selects all the title elements that have a "lang" attribute with a value of "en"
/bookstore/book[price>35.00]	Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00
/bookstore/book[price>35.00]/title	Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00

Wildcard	Description
*	Matches any element node
@*	Matches any attribute node
node()	Matches any node of any kind

Path Expression	Result
/bookstore/*	Selects all the child element nodes of the bookstore element
//*	Selects all elements in the document
//title[@*]	Selects all title elements which have at least one attribute of any kind

Path Expression	Result
//book/title \| //book/price	Selects all the title AND price elements of all book elements
//title \| //price	Selects all the title AND price elements in the document
/bookstore/book/title \| //price	Selects all the title elements of the book element of the bookstore element AND all the price elements in the document