by Serhat Çevikel
JSON is a hierarchical data format that allows data that is not appropriate to be formatted as columns and rows to be stored and queried.
Let's say we are tracking our contact data in a csv file:
Lastname, Firstname, Phone Number
Membrey, Peter, +852 1234 5678
Thielen, Wouter, +81 1234 5678
Lastname, Firstname, Phone Number1, Phone Number2
Membrey, Peter, +852 1234 5678, +44 1234 565 555
Thielen, Wouter, +81 1234 5678
But suppose, we have million of records with tens of fields, and in some exceptions, some records may have many multiple values of some fields: e.g. 10 telephone numbers, etc
JSON format is a remedy for these kinds of flexibility issues and hierarchical data formats.
Integrity rules are softer for handling JSON data
record1='{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}'
echo $record1
Echoing the JSON as comma separated values as such, is not easy to parse and understand the format.
We may use online json parsers for this purpose.
You can copy and paste the above string into the input pane:
echo $record1 | jq .
jq is a parser and querying tool for json, that creates a nice output
You can have more info on jq following the links:
{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}
firstname is the key, "Peter" is the value, an so on
We also have arrays of values for a single key, delimited by square brackets []
record1b='{
"firstname": "Peter",
"lastname": "Membrey",
"numbers": [
{
"phone": "+852 1234 5678"
},
{
"fax": "+44 1234 565 555"
}
]
}'
echo -e $record1b "\n"
echo -e "$record1b\n"
echo $record1b | jq .
Multiple documents can be "collected" inside "collections":
A "collection" in NoSQL terminology is analogous to a "table" in the RDBMS jargon. A collection is a collection of similar items (or documents with similar key-value pairs)
UN COMTRADE is the widest and most comprehensive database on international trade:
Let's first save the path:
datadir=~/data
comtrade=$datadir/comtrade_s1
ls $comtrade
jq . $comtrade/reporterAreas.json
Another way to pretty print and navigate through json files is R:
reporter <- jsonlite::fromJSON("~/data/comtrade_s1/reporterAreas.json")
reporter
str(reporter)
See that, in not-so-nested structures, the data is automatically flattened into a data frame
And a collapsable and interactive gadget for viewing json and similar hierarchical data types: (you may need two execute several times to get the JS gadget)
listviewer::jsonedit(reporter, mode = "form")
Now let's traverse through this document to list country texts:
jq '.results[].text' $comtrade/reporterAreas.json | tr -d '"'
And let's list the country codes:
jq -r '.results[].id' $comtrade/reporterAreas.json
reporters=$(jq -r '.results[] | "\(.text)\t\(.id)"' $comtrade/reporterAreas.json | \
xargs -0 -i echo -e "{}")
echo "$reporters" | column -s $'\t' -t
We can filter with grep or awk:
echo "$reporters" | awk -F "\t" '$1=="Turkey"{print $2}'
Another way to flatten fields is:
reporters2=$(jq -r '.results[] | .text + "\t" + .id' $comtrade/reporterAreas.json | xargs -0 -i echo -e "{}")
echo "$reporters2" | column -s $'\t' -t
echo "$reporters" | awk -F "\t" '$1=="Turkey"{print $2}'
Or we can use the "select" statement for filtering values:
jq -r '.results[] | select(.text == "Turkey") | .id' $comtrade/reporterAreas.json
Now let's go through the classification file:
classification <- jsonlite::fromJSON("~/data/comtrade_s1/classificationS1.json")
listviewer::jsonedit(classification, mode = "form")
Now we will filter for those entries, in which text includes "textile" and code is only 3 digits:
jq -r '.results[] | select((.id|test("^\\d+$")) and (.text|test("(?i)textile"))) |
select((.id|tonumber < 1000) and (.id|tonumber > 99)) | .text' \
$comtrade/classificationS1.json | \
sed 's/ - /\t/g' | column -s $'\t' -t
See how it works:
EXERCISE 1:
Now your turn: Find the id and text of codes parent of which are 651 (Textile yarn and thread).
The result will be:
6511 Thrown silk & silk yarn and thread
6512 Yarn of wool and animal hair
6513 Cotton yarn & thread, grey, not mercerized
6514 Cotton yarn & thread, bleached, dyed, mercerd.
6515 Yarn and thread of flax, ramie and true hemp
6516 Yarn and thread of synthetic fibres
6517 Yarn and thread of regenerated fibres
6518 Yarn of glass fibre
6519 Yarn of textile fibres,nes incl.paper yarn
SOLUTION 1:
pass1=
encrypt="U2FsdGVkX1+WtEYqehLba8zmJ9evvmr1ohvj0BWaggG2cQUjY/HMZia25jIkV9zR jyLxrSmHJRCX55wFh7JIB9U1sjWJ1cvipesjhECBk3B3GaP97Kjsq9cDu0jnjYu7 cM9b9CKPMJN1Am/sywL5RQT+z0nDKopIuq78l3p+eeqEHQNvloe35hu/y19+lFmt 6f4EG2i1cQ2JJq8bBMyl9g=="
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Now let's go through the actual data files that includes trade volumes:
ls $comtrade/2010e | { head -5; tail -5; }
ls $comtrade/2010e | grep -P "2010_792" | head -5
The code after the first underscore is the reporter's and the code after the second underscore is the partner country's code
So there are the files for which Turkey is either a reporter or partner
For how many files Turkey is reporter or a partner?
ls $comtrade/2010e | grep -P "2010_792" | wc -l
ls $comtrade/2010e | grep -P "792_s1" | wc -l
Turkey reported her trade with 227 partners, while 165 reporter countries reported their trade partnering with Turkey (for s1 classification only)
tradedata <- jsonlite::fromJSON("~/data/comtrade_s1/2010e/2010_792_100_s1.json")
listviewer::jsonedit(tradedata, mode = "form")
Descriptions for several variables are:
Now from all files for which Turkey is a reporter, the TradeValue of exports (rgCode is 2) in 651 code (Textile yarn and thread)
We will report:
For a single file we have:
jq -r '.dataset[] | select(.cmdCode == "651" and .rgCode == 2) |
"\(.ptTitle)\t\(.TradeValue)\t\(.TradeQuantity)"' $comtrade/2010e/2010_792_100_s1.json | \
xargs -0 -i echo -e "{}"
textileexports=$(for file in $comtrade/2010e/2010_792*.json;
do
jq -r '.dataset[] | select(.cmdCode == "651" and .rgCode == 2) |
"\(.ptTitle)\t\(.TradeValue)\t\(.TradeQuantity)"' $file | \
xargs -0 -i echo -e "{}";
done)
echo "$textileexports" | column -s $'\t' -t
Let's exclude World total:
echo "$textileexports" | awk -F "\t" '$1 != "World"' | \
column -s $'\t' -t
echo "$textileexports" | \
awk -F "\t" '$1 != "World"' | \
sort -t $'\t' -nr -k 2 | \
numfmt -d $'\t' --field=2-3 --to-unit=M | \
column -s $'\t' -t
What we do here:
So in 2010, Turkey's significant partners for Textile yarn and thread export are:
EXERCISE 2:
So you should fill in this template below:
for file in $comtrade/2010e/*792_s1.json;
do
jq -r '#fill in this part' $file | \
xargs -0 -i echo -e "{}";
done | \
awk -F "\t" '$1 != "EU-28"' | \
sort -t $'\t' -nr -k 3 | \
numfmt -d $'\t' --field=3 --to-unit=M | \
column -s $'\t' -t
The output will be:
Germany Passenger motor cars, other than buses 2653
Algeria Gas, natural 2591
USA Special transactions 2010
Russian Federation Crude petroleum 1607
Germany Other parts for motor vehicles 1231
China Statistical machines cards or tapes 1027
France Other parts for motor vehicles 1003
Germany Internal combustion engines, not for aircraft 985
Russian Federation Coal /anthracite, bituminous/ 972
Germany Special transactions 948
Spain Passenger motor cars, other than buses 923
Russian Federation Aluminium and aluminium alloys, unwrought 865
USA Raw cotton, other than linters 862
Russian Federation Refined copper including remelted 861
Saudi Arabia Products of polymerization and copolymerization 849
Italy Other parts for motor vehicles 806
France Aircraft, heavier than air 799
Belgium Products of polymerization and copolymerization 644
Qatar Gas, natural 631
Italy Machinery and mechanical appliances, nes 562
You can view a sample data:
tradedata <- jsonlite::fromJSON("~/data/comtrade_s1/2010e/2010_100_792_s1.json")
listviewer::jsonedit(tradedata, mode = "form")
SOLUTION 2:
pass1=
encrypt="U2FsdGVkX1+4esKDuc9iGGz0tWhqscfUMmjPGTa2MIp+YPGWOvdJZY0ZLMx3XSI5 42ueQ3AERCACRgSCZOnUhnDJG4wgnAgcmv+RXW0fR82yLC6g5M2+t3XuODqTIZag Y+g6DkbnapcedfV2K5hg0ELct8YzomN7648I/vQxCQEYEPlwxv/Pdw7IBEeIgFTH zJ25jJlAKJerLYTl9G+XXr01Pwlh5VS36wh16/opSTYjeh++FgvCCH2WKBiFfLjK oxzjfLBcpcqWLVOQ8ozqCean7tnTPIHR0QzjJeiqJ91jBOLbxPpnyT93Yu5Ikxne MbZ/7zAVqLmQVJCpLj6igmHtYxMv3Ck1HC5RKLX2Qrb8sPygcafwyGRp+r0jEsOk nl3MxKOVMuWIwvA+gmJwQhUJXmf7Zm1YX9tcn3PJe6IWPom2P9esB+GXKMnFkpL9 IsDEP/nvQ5TXe+gDGpKWzlEirPp6CwAB6YNiH/phRMUmrbg2DtsRSvZA9bMslFbf eXrItWZ34YuwnqDSlAaquA=="
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
So according to numbers reported by other countries, in 2010 Turkey's largest imports were:
Note that SITC 1 is a very old classification so some modern commodities may not fit well into this schema
Now we will import the files into MongoDB
First we create a directory for MongoDB database:
mkdir -p ~/mongo
And run the mongo server:
mongod -dbpath ~/mongo &
And import the files into a new database called comtrade and collection called 2010s1:
DO NOT REPEAT THIS STEP MULTIPLE TIMES, YOU WILL HAVE DUPLICATE DOCUMENTS!
for file in $comtrade/2010e/*.json;
do mongoimport --db comtrade --collection 2010s1 $file; done
Now from R, import monglite package:
library(mongolite)
We create a connection object:
con <- mongo(collection = "2010s1", db = "comtrade")
Select the document(s) where rtTitle is Bulgaria and return, commodity description, commodity code and TradeValue columns:
bulgaria_vals <- con$find('{"dataset.rtTitle" : "Bulgaria"}',
fields = '{"dataset.cmdDescE": 1, "dataset.cmdCode": 1, "dataset.TradeValue":1 }')
The syntax of find() method has
The fields to be returned get the value 1
Let's view the structure of the return value:
str(bulgaria_vals)
And view only the dataset:
bulgaria_vals$dataset
Now we will get the values for countries with which Turkey has at least USD 50 mio export of 651 coded "Textile yarn and thread":
export_651 <- con$find('{
"dataset":
{ "$elemMatch":
{ "$and": [
{ "rtTitle" : "Turkey" },
{ "cmdCode" : "651" },
{ "TradeValue" : { "$gt": 50000000 } },
{ "ptTitle" : { "$ne" : "World" } },
{ "rgCode" : 2 }
] } } }',
fields = '{"dataset.ptTitle": 1,
"dataset.rgCode": 1,
"dataset.cmdDescE": 1,
"dataset.cmdCode": 1,
"dataset.TradeValue":1 }')
str(export_651)
export_651_2 <- do.call(rbind, export_651$dataset)
with(export_651_2, export_651_2[rgCode == "2" & cmdCode == "651",])
EXERCISE 3:
Select those countries for which Turkey has at least USD 5 bio exports
SOLUTION 3:
pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1/cPhgJB5HjYQ4xWOnuyfs5vKrofNpNlV9UHETg+d+KYoLJw7Py0t9p ap3m5rxDCl7JtmwH6VZKMf5sfooPDhhAnI5YXHJJmAc78U5ZYT/aVL75466bi6Te ic4Yd4uM1jcyblnj9LAp2n76eyGFFaEmAY2hmwn+uaDbk91xTNTYnE0gNqmjMJHs t60XDU19QfOTF8YUyG1cbYQwGqMp4k9u1o45Co5f4BRH8aWrgdCp1NBCd+X2itsl ujQYQ45+dnj7D01iezb9gPxvWiEMT+rIP7rWEQh8+wg/mIRR0zctqwk0xrIXLSNo /q1RCrw8DTaoqX5De3EZXxFzWNsw81+/9eehspVmosb9KnoNEKfZzgj99dm/ICDq 8uf9bqxZ373tN6oC23znD1vndhDPVl9KbwoNHjLa9Cb+Do5qoQqQ+hxEnDQm9I96 jai0dRPTnf5RXeb4Zs4Us3+8wahR3gm797WU/qJVIymbZH+5DjSID27XCI2WinPi"
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))
(https://www.w3schools.com/XML/xml_whatis.asp)
XML is a software- and hardware-independent tool for storing and transporting data.
What is XML?
XML Does Not DO Anything
Maybe it is a little hard to understand, but XML does not DO anything.
This note is a note to Tove from Jani, stored as XML:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
The XML above is quite self-descriptive:
But still, the XML above does not DO anything. XML is just information wrapped in tags.
The Difference Between XML and HTML
XML and HTML were designed with different goals:
XML Does Not Use Predefined Tags
The XML language has no predefined tags.
The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.
HTML works with predefined tags like <p>, <h1>, <table>, etc.
With XML, the author must define both the tags and the document structure.
XML is Extensible
Most XML applications will work as expected even if new data is added (or removed).
Imagine an application designed to display the original version of note.xml (<to> <from> <heading> <body>).
Then imagine a newer version of note.xml with added <date> and <hour> elements, and a removed <heading>.
The way XML is constructed, older version of the application can still work:
<note>
<date>2015-09-01</date>
<hour>08:30</hour>
<to>Tove</to>
<from>Jani</from>
<body>Don't forget me this weekend!</body>
</note>
(https://www.w3schools.com/XML/xml_syntax.asp)
The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to use.
XML Documents Must Have a Root Element
XML documents must contain one root element that is the parent of all other elements:
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
In this example
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
All XML Elements Must Have a Closing Tag
In XML, it is illegal to omit the closing tag. All elements must have a closing tag:
<p>This is a paragraph.</p>
<br />
XML Tags are Case Sensitive
XML tags are case sensitive. The tag
Opening and closing tags must be written with the same case:
<message>This is correct</message>
XML Elements Must be Properly Nested
In HTML, you might see improperly nested elements:
<b><i>This text is bold and italic</b></i>
In XML, all elements must be properly nested within each other:
<b><i>This text is bold and italic</i></b>
In the example above, "Properly nested" simply means that since the element is opened inside the element, it must be closed inside the element.
XML Attribute Values Must Always be Quoted
XML elements can have attributes in name/value pairs just like in HTML.
In XML, the attribute values must always be quoted:
<note date="12/11/2007">
<to>Tove</to>
<from>Jani</from>
</note>
Entity References
Some characters have a special meaning in XML.
If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element.
This will generate an XML error:
<message>salary < 1000</message>
To avoid this error, replace the "<" character with an entity reference:
<message>salary < 1000</message>
There are 5 pre-defined entity references in XML:
< < less than
> > greater than
& & ampersand
' ' apostrophe
" " quotation mark
On the 2nd of December 2018, I scraped 994 realty listing pages of residences for sale in Mecidiyekoy, Sisli neighbourhood from www.hurriyetemlak.com
datadir=~/data
hemlak=$datadir/he_sisli
ls -l $hemlak | head
Viewing the data as is not quite intuitive:
find $hemlak -mindepth 1 | head -1 | xargs cat
An XML/HTMl parser can parse, pretty print, and traverse through the files.
Xidel is highly performant:
find $hemlak -mindepth 1 | head -1 | xargs -i xidel --input-format html --output-format html -e "/" {}
Another way to view an indented version of the XML/HTMl file might be:
a <- xml2::read_html("~/data/he_sisli/10005103")
htmltidy::html_view(a)
However, opening the file with a web browser (preferably Chrome), hitting the F12 button and viewing the "elements" pane is the best option.
This way we will get the Xpath nodes for the information we want from the files
https://www.w3schools.com/xml/xpath_intro.asp
https://www.w3schools.com/xml/xpath_nodes.asp
XPath Terminology Nodes
In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes.
XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.
Look at the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
Example of nodes in the XML document above:
<bookstore> (root element node)
<author>J K. Rowling</author> (element node)
lang="en" (attribute node)
Atomic values
Atomic values are nodes with no children or parent.
Example of atomic values:
J K. Rowling
"en"
Items
Items are atomic values or nodes.
Parent
Each element and attribute has one parent.
In the following example; the book element is the parent of the title, author, year, and price:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Element nodes may have zero, one or more children.
In the following example; the title, author, year, and price elements are all children of the book element:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
Nodes that have the same parent.
In the following example; the title, author, year, and price elements are all siblings:
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
A node's parent, parent's parent, etc.
In the following example; the ancestors of the title element are the book element and the bookstore element:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
A node's children, children's children, etc.
In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:
<bookstore>
<book>
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
</bookstore>
We will use the following XML document in the examples below.
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book>
<title lang="en">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="en">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>
XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:
Expression | Description |
---|---|
nodename | Selects all nodes with the name "nodename" |
/ | Selects from the root node |
// | Selects nodes in the document from the current node that match the selection no matter where they are |
. | Selects the current node |
.. | Selects the parent of the current node |
@ | Selects attributes |
In the table below we have listed some path expressions and the result of the expressions:
Path Expression | Result |
---|---|
bookstore | Selects all nodes with the name "bookstore" |
/bookstore | Selects the root element bookstore Note: If the path starts with a slash ( / ) it always represents an absolute path to an element! |
bookstore/book | Selects all book elements that are children of bookstore |
//book | Selects all book elements no matter where they are in the document |
bookstore//book | Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element |
//@lang | Selects all attributes that are named lang |
Path Expression | Result |
---|---|
/bookstore/book[1] | Selects the first book element that is the child of the bookstore element.
Note: In IE 5,6,7,8,9 first node is[0], but according to W3C, it is [1]. To solve this problem in IE, set the SelectionLanguage to XPath: In JavaScript: xml.setProperty("SelectionLanguage","XPath"); |
/bookstore/book[last()] | Selects the last book element that is the child of the bookstore element |
/bookstore/book[last()-1] | Selects the last but one book element that is the child of the bookstore element |
/bookstore/book[position()<3] | Selects the first two book elements that are children of the bookstore element |
//title[@lang] | Selects all the title elements that have an attribute named lang |
//title[@lang='en'] | Selects all the title elements that have a "lang" attribute with a value of "en" |
/bookstore/book[price>35.00] | Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00 |
/bookstore/book[price>35.00]/title | Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00 |
XPath wildcards can be used to select unknown XML nodes.
Wildcard | Description |
---|---|
* | Matches any element node |
@* | Matches any attribute node |
node() | Matches any node of any kind |
In the table below we have listed some path expressions and the result of the expressions:
Path Expression | Result |
---|---|
/bookstore/* | Selects all the child element nodes of the bookstore element |
//* | Selects all elements in the document |
//title[@*] | Selects all title elements which have at least one attribute of any kind |
Path Expression | Result |
---|---|
//book/title | //book/price | Selects all the title AND price elements of all book elements |
//title | //price | Selects all the title AND price elements in the document |
/bookstore/book/title | //price | Selects all the title elements of the book element of the bookstore element AND all the price elements in the document |
<span>500.000 TL</span>
Now we want to get the path to this node:
XPath
/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span
Now we can use Xidel to get the value at this XPath, provided that the queried XML/HTML file has a similar DOM structure (hierarchy of nodes)
xidel --input-format html --output-format adhoc -e \
"/html/body/div[1]/div[2]/div[2]/div[1]/div[2]/div/div/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li[1]/span" \
$hemlak/10005103 2> /dev/null
However, traversing using only indices might not be correct in all cases: The count of a certain element may change across similar pages
So we will use attributes to be more robust:
xidel --input-format html --output-format adhoc -e \
"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \
$hemlak/10005103 2> /dev/null
Or course it is better to get only the numeric values and skip "." and "TL" parts
xidel --input-format html --output-format adhoc -e \
"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \
$hemlak/10005103 2> /dev/null | \
tr -d "." | grep -Po "\d+"
Now we can traverse through all 994 files to get price information
he_prices=$(for i in $hemlak/*; do \
xidel --input-format html --output-format adhoc -e \
"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()" \
$i 2> /dev/null | \
tr -d "." | grep -Po "\d+";
done)
echo "$he_prices"
ls $hemlak | wc -l
echo "$he_prices" | wc -l
So we have 994 price information for 994 listing files
Similarly we will get the square meter information from listing files:
he_sqm=$(for i in $hemlak/*; do \
xidel --input-format html --output-format adhoc -e \
"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[@id='realtyGrossSqm']/following-sibling::span/text()" \
$i 2> /dev/null | \
tr -d "." | grep -Po "\d+";
done)
echo "$he_sqm" | wc -l
Note that we also have to control for missing values so that the output is parallel to previous ones
he_kredi=$(for i in $hemlak/*; do \
kredi=$(xidel --input-format html --output-format adhoc -e \
"//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[text()='Krediye Uygunluk']/following-sibling::span/text()" \
$i 2> /dev/null);
if [[ -z $kredi ]]; then echo NA; else echo $kredi; fi;
done)
echo "$he_kredi" | wc -l
EXERCISE 4:
Get the age (Bina Yaşı) info from files and save into "he_age" object.
Check that total length is 994
SOLUTION 4:
pass1=
encrypt="U2FsdGVkX1/eIM2iFwU8EiycbOHQU9x+E17J9APxpeP4xuHq88DO5QUGfmMq0yvi 7BqTIovpXlJ08UUBy+C30htC9YXNU0yc5RO3poMIC57QJ/TsIO1S7cJcmHAQJKYl 0mfbwfMgIEjHmjjAQ8XPwmARSwobvXlyl/6IK6lLNDZL3Ve9bwHBJD9cm8KsSEeX TQGukNiUej07kD7jGD6+q3CKz29an0olwDLvK7dgPaKavfZSGIpa7CAHqAriVx83 Fq+uQKu2s1E5vpuA85huxyrVuWqMJFwoq5uiq2KBv446Sm+AOPSOJDVVxS1B+DB8 Ayh9E18VChQ0JS2Aw+wjlRF3wzE3Ew3Ne1pUUUSbPHBde3t3RgfCzD3u+LdR1K7h QQlGJGIOpbSHe7mRNVeiuhNwl0qGe72jdFVPqcC4DE8ZuW5AySn+SQ2I4kvvmhRs jOjHOddqrH6c9BDkoibzrrWoR2teu8aIbMBCNaBrDIU="
solution=$(echo "$encrypt" | openssl enc -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Now together with filenames (listing codes), let's combine all information into a tab seperated text:
selected_info=$(paste <(ls $hemlak) <(echo "$he_prices") <(echo "$he_sqm") \
<(echo "$he_kredi") <(echo "$he_age") --delimiters "\t")
echo "$selected_info" | column -s $'\t' -t
And save into a file:
echo "$selected_info" > selected_info.tsv
Now let's analyze the selected info using R
selected_info <- read.delim("selected_info.tsv", header = F)
See the structure:
str(selected_info)
See missing data:
lapply(selected_info, function(x) which(is.na(x)))
selected_info2 <- na.omit(selected_info)
str(selected_info2)
selected_info2
Let's rename columns:
names(selected_info2) <- c("id", "price", "sqm", "eligibility", "age")
selected_info2
Let's add another column for price per sqm
selected_info2$pricesqm <- with(selected_info2, price / sqm)
selected_info2
Now let's see whether price per sqm differs across eligibility values:
with(selected_info2, aggregate(pricesqm, by = list(eligibility), FUN = mean))
price per sqm for loan eligible properties are significantly higher
The distribution across eligibility classes are:
table(selected_info2$eligibility)
Now let's draw a simple scatteplot between price per sqm and age for eligible properties
eligible <- selected_info2[selected_info2$eligibility == "Uygun",]
with(selected_info2, plot(age, pricesqm))
No significant trend is detected without further data wrangling and cleaning (such as omitting outliers)
cat ~/data/he_sisli/10005103 | head
We will extract all the information to the left of the picture in a listing as a table using R
First parse the file as html:
library(XML)
parsed <- XML::htmlParse("~/data/he_sisli/10005103")
parsed
The xpath for the table is:
xpath1 <- "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='info-line']/ul[@class='clearfix']"
Apply the XPath to document
table1 <- xpathApply(parsed, xpath1)[[1]]
str(table1)
Convert XMLNode to an XMLInternalDocument:
xml1 <- xmlDoc(table1)
xml1
And extract items as two columns:
table2 <- sapply(1:2, function(x) xpathSApply(xml1, sprintf("//li/span[%s]/text()", x), xmlValue))
table2