Djoerd Hiemstra, Robin Aly, University of Twente
In this tutorial (made for the SIKS/CBS DataCamp, 6 December 2016) we will go over some basic Spark Scala examples following the paper by (Zahari et al. 2012). The tutorial assumes that you have basic knowledge of Scala. If you are not very strong in Scala, you could first follow this tutorial.
Double click on cells to edit its contents. Run a cell by pressing shift-enter.
Suppose we would like to analyse a huge log file, for instance a search engine's query log. The following line reads the contents of the file log.txt
into a RDD (Resilient Distributed Dataset).
val lines = sc.textFile("log.txt")
NB The variable
spark
in the paper by Zahari et al. (2012) is calledsc
in our case (for 'Spark context').When we introduce the new varable
lines
we use eitherval lines
in Scala to define an immutable variable (a read-only value that cannot be modified) or we usevar lines
to define a mutable variable (that can be modified). Because RDDs are read-only, it is natural to only useval
when defining variables that hold an RDD.
An RDD is a read-only collection of records that may be partitioned over many machines in your cluster. RDDs can only be created from other RDDs through a limited number of operations called transformations (they may also be read from the distributed file system). Table 2 of Zahari et al. (2012) contains 13 transformations of RDDs. Why these transformations? Together these transformations support many algorithms, and, the transformations can be executed efficiently in parallel on clusters of machines.
As an example of a transformation, map()
takes as input a function, and applies this function to each record in the RDD. The functions that may be provided to map may not have any side effects, that is, they may input a record of the RDD and output a transformed record, but they cannot read or write files, nor can they have an internal state that is updated. If functions without side effects are used, then each machine in a large cluster can perform the function on part of the data without needing to know anything about results from the other machines on other parts of the data. As an example, take the following function that takes a line and outputs the line with all letters put to lower case:
line => line.toLowerCase()
This is called an anonymous function in Scala because the function has no name.
val lowerLines = lines.map(line => line.toLowerCase())
Anonymous functions are a consice way to use a function once (maybe they should have been called disposable functions). The same line with a named function would be:
def toLower(line: String): String = {
return line.toLowerCase()
}
val lowerLines2 = lines.map(toLower)
Spark runs all your operations on RDDs in parallel. If you want to do something in linear order in plain old Scala, for instance outputting the contents of the RDD, then the function collect()
turns your RDD into an ordinary Scala array. The function mkString()
turns the array into a string representation by taking an array item separator (in our case "\n"
which is the new line character):
NB Beware, the Jupyter notebook might only show part of your result.
lines.collect().mkString("\n")
INFO 2016-12-05 18:00:00 nl.utwente.santa Heerlijk avondje is gekomen INFO 2016-12-05 18:00:01 nl.utwente.santa Avondje van santa INFO 2016-12-05 18:00:02 nl.utwente.santa Vol verwachting klopt ons hart WARN 2016-12-05 18:00:03 nl.utwente.santa Warning Wie de roe krijgt wie de gard INFO 2016-12-05 18:00:04 nl.utwente.santa Vol verwachting klopt ons hart WARN 2016-12-05 18:00:06 nl.utwente.santa Warning Wie de roe krijgt wie de gard INFO 2016-12-05 18:00:07 nl.utwente.santa Singing approved by Saint Nicolas ERROR 2016-12-05 18:00:08 nl.utwente.santa Error Fire place on Aborting chimney descend WARN 2016-12-05 18:00:09 nl.utwente.santa Warning Rollback presents INFO 2016-12-05 18:01:00 nl.utwente.santa Heerlijk avondje is gekomen INFO 2016-12-05 18:01:01 nl.utwente.santa ...
Follow the examples from Zaharia et al. (2012), and print the number of lines that start with "ERROR". Your solution should print: 4.
Tip: Put your solution between curly brackets if statements that cover multiple lines confuse the Scala interpreter.
Comments in Scala are preceded by a double slash
{
// BEGIN SOLUTION
val errors = lines.filter(line => line.startsWith("ERROR")).persist()
errors.count()
// END SOLUTION
}
4
Follow the examples from Zaharia et al. (2012), and print the number of lines that start with "ERROR" and that contain "chimney". Your solution should print: 3. Tip: you might build on the result of the previous question.
{
// BEGIN SOLUTION
errors.filter(line => line.contains("chimney")).count()
// END SOLUTION
}
3
Follow the examples from Zaharia et al. (2012), and print the time fields of the lines that start with "ERROR" and that contain "chimney". Your solution should print: 2016-12-05 18:00:08
, 2016-12-05 18:01:06
, and 2016-12-05 18:01:06
.
{
// BEGIN SOLUTION
errors
.filter(line => line.contains("chimney"))
.map(line => line.split('\t')(1))
.collect()
.mkString("\n")
// END SOLUTION
}
2016-12-05 18:00:08 2016-12-05 18:01:06 2016-12-05 18:01:06
In 2004, Google employees Jeff Dean and Sanjay Ghemawat proposed a framework for distributed data processing that supports only 2 transformations: map()
and reduce()
. They called their framework appropriately MapReduce (Dean and Ghemawat 2004).
The seminal example they introduce in their paper is word count: Input a large text corpus, and output all words with for each word its count, i.e. the total number of times it occurs in the text corpus. A naive implementation might update a global data structure for each word that it encounters, adding 1 for the particular word. However, remember that we need functions without side effects to be able to distribute computations over many machines (so no updating a data structure!). Dean and Ghemawat therefore propose a solution that splits words and outputs pairs (word, 1) in the "map phase"; and then adds the 1's in the "reduce phase" (after the framework groups all data with the same together). Spark can process every MapReduce algorithm (and many more complex algorithms) using its transformations. Study the word count solution by Dean and Ghemawat, and come up with the equivalent Spark solution. See also the remarks by Zahari et al. (2012), for instance in Section 7.1.
Execute word count on the message fields of the lines that start with "ERROR". Your solution should find that the three most occurring words are "error", "descend" and "chimney", which occur respectively 5, 3 and 3 times.
Tip: build your solution one transformation at a time: Start from the lines that start with "ERROR", then take the message field, then split the (lower-cased) fields on space " " to get the words, then transform each word to (word, 1), etc. Test your solution after adding each transformation.
{
// BEGIN SOLUTION
errors
.map(line => line.split('\t')(3))
.flatMap(line => line.toLowerCase().split(' '))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b) // .groupByKey().map(t => (t._1, t._2.reduce(((a, b) => a + b))))
.sortBy(c => c._2, false)
.collect()
// END SOLUTION
}
Array((error,5), (descend,3), (chimney,3), (aborting,2), (fire,2), (on,2), (place,2), (fatal,1), (retry,1), (shutting,1), (down,1), (failed,1))
Zahari et al. (2012) cover two more elegant examples of iterative algorithms that can be efficiently processed by Spark: logistic regression, and Google's PageRank. Similar solutions exist using MapReduce, but they are less efficient, because MapReduce writes all intermediate data to disk for each iteration of the algorithm, whereas Spark tries to keep data in memory, if possible.
In the following examples we use Spark with Dataframes. In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with the optimizations provided by RDDs under the hood. We base this part of the tutorial on the SIGMOD 2015 paper by (Armbrust et al. 2015)
Let's read some example data in a DataFrame using the standard environment variable sqlContext
. We read data from a json file, because it is self-describing, that is, it contains the schema information needed for DataFrames (except for the table name).
val people = sqlContext.read.json("people.json")
people.printSchema()
people.show()
root |-- gender: string (nullable = true) |-- name: string (nullable = true) |-- organization: string (nullable = true) +------+---------+------------+ |gender| name|organization| +------+---------+------------+ | male| Arjen| RU| | male| Djoerd| UT| | male| Robin| UT| | male| Yuri| UT| |female| Doina| UT| |female| Anna| UT| | male| Piet| CBS| | male| Barteld| CBS| |female|Jacobiene| CBS| | male| Marco| CBS| |female| Claudia| TUD| +------+---------+------------+
DataFrames support special functions like printSchema()
and show()
and the common relational algebra operations: projection, called select()
; selection, called where()
; and join, called join()
, as well as aggregations: (groupBy()
and agg()
).
Yes, you are right: Someone really messed up naming the relational algebra operations! The naming of relational algebra operations differs in some unfortunate ways from the naming used in SQL statements. The Spark algebra operations use the SQL conventions, to make the confusion complete.
So, like RDDs, DataFrames support a limited number of transformations (but now based on relational algebra), and that's what we will have to work with. However, we might also work directly in SQL (see Exercise 2.4 below).
Show the list of names of all people (the projection of the column 'name').
{
// START SOLUTION
people.select("name").show()
// END SOLUTION
}
+---------+ | name| +---------+ | Arjen| | Djoerd| | Robin| | Yuri| | Doina| | Anna| | Piet| | Barteld| |Jacobiene| | Marco| | Claudia| +---------+
Show the list of names of people that work at CBS (Can we first do the projection of the column 'name' and then the selection of the row for which organization="CBS"
?).
{
// START SOLUTION
people.where(people("organization") === "CBS").select("name").show()
// END SOLUTION
}
+---------+ | name| +---------+ | Piet| | Barteld| |Jacobiene| | Marco| +---------+
Let's introduce the following organization DataFrame. Interestingly, this DataFrame contains two structured columns, a feature known from object-relational databases.
val organizations = sqlContext.read.json("organizations.json")
organizations.printSchema()
organizations.show()
root |-- address: struct (nullable = true) | |-- city: string (nullable = true) | |-- number: long (nullable = true) | |-- street: string (nullable = true) |-- attractions: array (nullable = true) | |-- element: string (containsNull = true) |-- color: string (nullable = true) |-- organization: string (nullable = true) +--------------------+--------------------+-----+------------+ | address| attractions|color|organization| +--------------------+--------------------+-----+------------+ |[Enschede,5,drien...|[torentje, hadoop...|black| UT| |[Heerlen,11,CBS-weg]|[mijnmuseum, spar...| blue| CBS| |[Nijmegen,4,Comen...| [doornroosje]| red| RU| | [Delft,5,Postbus]| [nice cluster]| blue| TUD| +--------------------+--------------------+-----+------------+
Show the name and organization of people whose organization's color is blue by joining the two tables.
Question: Can we speed up the computation by changing the order of the statements?
Bonus: Count for each organization the number of people, outputting ("organization", "count"). Note that the Armbrust et al. paper contains an error in Section 3.3: the operation
.agg(count("name"))
should be.count()
.
{
// START SOLUTION
people
.join(organizations, people("organization") === organizations("organization"))
.where(organizations("color") === "blue")
// .groupBy(organizations("organization")).count()
.select(people("name"), organizations("organization"))
.show();
// END SOLUTION
}
+---------+------------+ | name|organization| +---------+------------+ | Piet| CBS| | Barteld| CBS| |Jacobiene| CBS| | Marco| CBS| | Claudia| TUD| +---------+------------+
If you are familiar with SQL, then the following approach would also find the people with blue organizations. Once we registered the DataFrames people
and organizations
, we can use sqlContext.sql()
to execute any SQL query.
people.registerTempTable("people");
organizations.registerTempTable("organizations");
sqlContext.sql(
"SELECT P.name, O.organization " +
"FROM people P, organizations O " +
"WHERE P.organization = O.organization " +
"AND O.color = 'blue'"
).show()
+---------+------------+ | name|organization| +---------+------------+ | Piet| CBS| | Barteld| CBS| |Jacobiene| CBS| | Marco| CBS| | Claudia| TUD| +---------+------------+
In SQL, the complex types address
and attractions
are available as follows: For instance, Organizations.address.city = 'Heerlen'
for organizations in Heerlen.
Count the number of female employees in Enschede. Your answer should be: 2.
{
// START SOLUTION
sqlContext.sql(
"SELECT count(name) AS enschedeWomen " +
"FROM people P, organizations O " +
"WHERE P.organization = O.organization "+
"AND P.gender = 'female' " +
"AND O.address.city = 'Enschede'"
).show()
// END SOLUTION
}
+-------------+ |enschedeWomen| +-------------+ | 2| +-------------+
New to SQL? Then we recommend additional SQL exercises, for instance Learn SQL from Codecademy.
Below you find a samples of the data that is available on the DataCamp Hadoop/Spark cluster. Use the sample data to develop and test your Spark scripts before executing them on the cluster.
We use the Twitter data described by (Tjong-Kim-Sang and Van den Bosch 2013) which is available on the Twente Hadoop cluster under: /data/twitterNL
.
val tweets = sqlContext.read.json("tweets.json.gz")
// uncomment to print the (crazy) schema:
tweets.printSchema()
tweets.select("text").show(5)
root |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- entities: struct (nullable = true) | |-- hashtags: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- text: string (nullable = true) | |-- media: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- display_url: string (nullable = true) | | | |-- expanded_url: string (nullable = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- media_url: string (nullable = true) | | | |-- media_url_https: string (nullable = true) | | | |-- sizes: struct (nullable = true) | | | | |-- large: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- medium: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- small: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- thumb: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | |-- source_status_id: long (nullable = true) | | | |-- source_status_id_str: string (nullable = true) | | | |-- source_user_id: long (nullable = true) | | | |-- source_user_id_str: string (nullable = true) | | | |-- type: string (nullable = true) | | | |-- url: string (nullable = true) | |-- symbols: array (nullable = true) | | |-- element: string (containsNull = true) | |-- urls: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- display_url: string (nullable = true) | | | |-- expanded_url: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- url: string (nullable = true) | |-- user_mentions: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- name: string (nullable = true) | | | |-- screen_name: string (nullable = true) |-- extended_entities: struct (nullable = true) | |-- media: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- display_url: string (nullable = true) | | | |-- expanded_url: string (nullable = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- media_url: string (nullable = true) | | | |-- media_url_https: string (nullable = true) | | | |-- sizes: struct (nullable = true) | | | | |-- large: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- medium: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- small: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | | |-- thumb: struct (nullable = true) | | | | | |-- h: long (nullable = true) | | | | | |-- resize: string (nullable = true) | | | | | |-- w: long (nullable = true) | | | |-- source_status_id: long (nullable = true) | | | |-- source_status_id_str: string (nullable = true) | | | |-- source_user_id: long (nullable = true) | | | |-- source_user_id_str: string (nullable = true) | | | |-- type: string (nullable = true) | | | |-- url: string (nullable = true) | | | |-- video_info: struct (nullable = true) | | | | |-- aspect_ratio: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- duration_millis: long (nullable = true) | | | | |-- variants: array (nullable = true) | | | | | |-- element: struct (containsNull = true) | | | | | | |-- bitrate: long (nullable = true) | | | | | | |-- content_type: string (nullable = true) | | | | | | |-- url: string (nullable = true) |-- favorite_count: long (nullable = true) |-- favorited: boolean (nullable = true) |-- filter_level: string (nullable = true) |-- geo: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- id: long (nullable = true) |-- id_str: string (nullable = true) |-- in_reply_to_screen_name: string (nullable = true) |-- in_reply_to_status_id: long (nullable = true) |-- in_reply_to_status_id_str: string (nullable = true) |-- in_reply_to_user_id: long (nullable = true) |-- in_reply_to_user_id_str: string (nullable = true) |-- is_quote_status: boolean (nullable = true) |-- lang: string (nullable = true) |-- place: struct (nullable = true) | |-- bounding_box: struct (nullable = true) | | |-- coordinates: array (nullable = true) | | | |-- element: array (containsNull = true) | | | | |-- element: array (containsNull = true) | | | | | |-- element: double (containsNull = true) | | |-- type: string (nullable = true) | |-- country: string (nullable = true) | |-- country_code: string (nullable = true) | |-- full_name: string (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) | |-- place_type: string (nullable = true) | |-- url: string (nullable = true) |-- possibly_sensitive: boolean (nullable = true) |-- quoted_status: struct (nullable = true) | |-- contributors: string (nullable = true) | |-- coordinates: string (nullable = true) | |-- created_at: string (nullable = true) | |-- entities: struct (nullable = true) | | |-- hashtags: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- text: string (nullable = true) | | |-- media: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- media_url: string (nullable = true) | | | | |-- media_url_https: string (nullable = true) | | | | |-- sizes: struct (nullable = true) | | | | | |-- large: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- medium: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- small: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- thumb: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | |-- type: string (nullable = true) | | | | |-- url: string (nullable = true) | | |-- symbols: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- urls: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- url: string (nullable = true) | | |-- user_mentions: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- name: string (nullable = true) | | | | |-- screen_name: string (nullable = true) | |-- extended_entities: struct (nullable = true) | | |-- media: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- media_url: string (nullable = true) | | | | |-- media_url_https: string (nullable = true) | | | | |-- sizes: struct (nullable = true) | | | | | |-- large: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- medium: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- small: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- thumb: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | |-- type: string (nullable = true) | | | | |-- url: string (nullable = true) | |-- favorite_count: long (nullable = true) | |-- favorited: boolean (nullable = true) | |-- filter_level: string (nullable = true) | |-- geo: string (nullable = true) | |-- id: long (nullable = true) | |-- id_str: string (nullable = true) | |-- in_reply_to_screen_name: string (nullable = true) | |-- in_reply_to_status_id: string (nullable = true) | |-- in_reply_to_status_id_str: string (nullable = true) | |-- in_reply_to_user_id: long (nullable = true) | |-- in_reply_to_user_id_str: string (nullable = true) | |-- is_quote_status: boolean (nullable = true) | |-- lang: string (nullable = true) | |-- place: string (nullable = true) | |-- possibly_sensitive: boolean (nullable = true) | |-- quoted_status_id: long (nullable = true) | |-- quoted_status_id_str: string (nullable = true) | |-- retweet_count: long (nullable = true) | |-- retweeted: boolean (nullable = true) | |-- source: string (nullable = true) | |-- text: string (nullable = true) | |-- truncated: boolean (nullable = true) | |-- user: struct (nullable = true) | | |-- contributors_enabled: boolean (nullable = true) | | |-- created_at: string (nullable = true) | | |-- default_profile: boolean (nullable = true) | | |-- default_profile_image: boolean (nullable = true) | | |-- description: string (nullable = true) | | |-- favourites_count: long (nullable = true) | | |-- follow_request_sent: string (nullable = true) | | |-- followers_count: long (nullable = true) | | |-- following: string (nullable = true) | | |-- friends_count: long (nullable = true) | | |-- geo_enabled: boolean (nullable = true) | | |-- id: long (nullable = true) | | |-- id_str: string (nullable = true) | | |-- is_translator: boolean (nullable = true) | | |-- lang: string (nullable = true) | | |-- listed_count: long (nullable = true) | | |-- location: string (nullable = true) | | |-- name: string (nullable = true) | | |-- notifications: string (nullable = true) | | |-- profile_background_color: string (nullable = true) | | |-- profile_background_image_url: string (nullable = true) | | |-- profile_background_image_url_https: string (nullable = true) | | |-- profile_background_tile: boolean (nullable = true) | | |-- profile_banner_url: string (nullable = true) | | |-- profile_image_url: string (nullable = true) | | |-- profile_image_url_https: string (nullable = true) | | |-- profile_link_color: string (nullable = true) | | |-- profile_sidebar_border_color: string (nullable = true) | | |-- profile_sidebar_fill_color: string (nullable = true) | | |-- profile_text_color: string (nullable = true) | | |-- profile_use_background_image: boolean (nullable = true) | | |-- protected: boolean (nullable = true) | | |-- screen_name: string (nullable = true) | | |-- statuses_count: long (nullable = true) | | |-- time_zone: string (nullable = true) | | |-- url: string (nullable = true) | | |-- utc_offset: long (nullable = true) | | |-- verified: boolean (nullable = true) |-- quoted_status_id: long (nullable = true) |-- quoted_status_id_str: string (nullable = true) |-- retweet_count: long (nullable = true) |-- retweeted: boolean (nullable = true) |-- retweeted_status: struct (nullable = true) | |-- contributors: string (nullable = true) | |-- coordinates: string (nullable = true) | |-- created_at: string (nullable = true) | |-- entities: struct (nullable = true) | | |-- hashtags: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- text: string (nullable = true) | | |-- media: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- media_url: string (nullable = true) | | | | |-- media_url_https: string (nullable = true) | | | | |-- sizes: struct (nullable = true) | | | | | |-- large: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- medium: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- small: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- thumb: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | |-- source_status_id: long (nullable = true) | | | | |-- source_status_id_str: string (nullable = true) | | | | |-- source_user_id: long (nullable = true) | | | | |-- source_user_id_str: string (nullable = true) | | | | |-- type: string (nullable = true) | | | | |-- url: string (nullable = true) | | |-- symbols: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- urls: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- url: string (nullable = true) | | |-- user_mentions: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- name: string (nullable = true) | | | | |-- screen_name: string (nullable = true) | |-- extended_entities: struct (nullable = true) | | |-- media: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- display_url: string (nullable = true) | | | | |-- expanded_url: string (nullable = true) | | | | |-- id: long (nullable = true) | | | | |-- id_str: string (nullable = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- media_url: string (nullable = true) | | | | |-- media_url_https: string (nullable = true) | | | | |-- sizes: struct (nullable = true) | | | | | |-- large: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- medium: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- small: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | | |-- thumb: struct (nullable = true) | | | | | | |-- h: long (nullable = true) | | | | | | |-- resize: string (nullable = true) | | | | | | |-- w: long (nullable = true) | | | | |-- source_status_id: long (nullable = true) | | | | |-- source_status_id_str: string (nullable = true) | | | | |-- source_user_id: long (nullable = true) | | | | |-- source_user_id_str: string (nullable = true) | | | | |-- type: string (nullable = true) | | | | |-- url: string (nullable = true) | | | | |-- video_info: struct (nullable = true) | | | | | |-- aspect_ratio: array (nullable = true) | | | | | | |-- element: long (containsNull = true) | | | | | |-- duration_millis: long (nullable = true) | | | | | |-- variants: array (nullable = true) | | | | | | |-- element: struct (containsNull = true) | | | | | | | |-- bitrate: long (nullable = true) | | | | | | | |-- content_type: string (nullable = true) | | | | | | | |-- url: string (nullable = true) | |-- favorite_count: long (nullable = true) | |-- favorited: boolean (nullable = true) | |-- filter_level: string (nullable = true) | |-- geo: string (nullable = true) | |-- id: long (nullable = true) | |-- id_str: string (nullable = true) | |-- in_reply_to_screen_name: string (nullable = true) | |-- in_reply_to_status_id: long (nullable = true) | |-- in_reply_to_status_id_str: string (nullable = true) | |-- in_reply_to_user_id: long (nullable = true) | |-- in_reply_to_user_id_str: string (nullable = true) | |-- is_quote_status: boolean (nullable = true) | |-- lang: string (nullable = true) | |-- place: struct (nullable = true) | | |-- bounding_box: struct (nullable = true) | | | |-- coordinates: array (nullable = true) | | | | |-- element: array (containsNull = true) | | | | | |-- element: array (containsNull = true) | | | | | | |-- element: double (containsNull = true) | | | |-- type: string (nullable = true) | | |-- country: string (nullable = true) | | |-- country_code: string (nullable = true) | | |-- full_name: string (nullable = true) | | |-- id: string (nullable = true) | | |-- name: string (nullable = true) | | |-- place_type: string (nullable = true) | | |-- url: string (nullable = true) | |-- possibly_sensitive: boolean (nullable = true) | |-- quoted_status: struct (nullable = true) | | |-- contributors: string (nullable = true) | | |-- coordinates: string (nullable = true) | | |-- created_at: string (nullable = true) | | |-- entities: struct (nullable = true) | | | |-- hashtags: array (nullable = true) | | | | |-- element: string (containsNull = true) | | | |-- media: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- display_url: string (nullable = true) | | | | | |-- expanded_url: string (nullable = true) | | | | | |-- id: long (nullable = true) | | | | | |-- id_str: string (nullable = true) | | | | | |-- indices: array (nullable = true) | | | | | | |-- element: long (containsNull = true) | | | | | |-- media_url: string (nullable = true) | | | | | |-- media_url_https: string (nullable = true) | | | | | |-- sizes: struct (nullable = true) | | | | | | |-- large: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- medium: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- small: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- thumb: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | |-- source_status_id: long (nullable = true) | | | | | |-- source_status_id_str: string (nullable = true) | | | | | |-- source_user_id: long (nullable = true) | | | | | |-- source_user_id_str: string (nullable = true) | | | | | |-- type: string (nullable = true) | | | | | |-- url: string (nullable = true) | | | |-- symbols: array (nullable = true) | | | | |-- element: string (containsNull = true) | | | |-- urls: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- display_url: string (nullable = true) | | | | | |-- expanded_url: string (nullable = true) | | | | | |-- indices: array (nullable = true) | | | | | | |-- element: long (containsNull = true) | | | | | |-- url: string (nullable = true) | | | |-- user_mentions: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- id: long (nullable = true) | | | | | |-- id_str: string (nullable = true) | | | | | |-- indices: array (nullable = true) | | | | | | |-- element: long (containsNull = true) | | | | | |-- name: string (nullable = true) | | | | | |-- screen_name: string (nullable = true) | | |-- extended_entities: struct (nullable = true) | | | |-- media: array (nullable = true) | | | | |-- element: struct (containsNull = true) | | | | | |-- display_url: string (nullable = true) | | | | | |-- expanded_url: string (nullable = true) | | | | | |-- id: long (nullable = true) | | | | | |-- id_str: string (nullable = true) | | | | | |-- indices: array (nullable = true) | | | | | | |-- element: long (containsNull = true) | | | | | |-- media_url: string (nullable = true) | | | | | |-- media_url_https: string (nullable = true) | | | | | |-- sizes: struct (nullable = true) | | | | | | |-- large: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- medium: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- small: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | | |-- thumb: struct (nullable = true) | | | | | | | |-- h: long (nullable = true) | | | | | | | |-- resize: string (nullable = true) | | | | | | | |-- w: long (nullable = true) | | | | | |-- source_status_id: long (nullable = true) | | | | | |-- source_status_id_str: string (nullable = true) | | | | | |-- source_user_id: long (nullable = true) | | | | | |-- source_user_id_str: string (nullable = true) | | | | | |-- type: string (nullable = true) | | | | | |-- url: string (nullable = true) | | |-- favorite_count: long (nullable = true) | | |-- favorited: boolean (nullable = true) | | |-- filter_level: string (nullable = true) | | |-- geo: string (nullable = true) | | |-- id: long (nullable = true) | | |-- id_str: string (nullable = true) | | |-- in_reply_to_screen_name: string (nullable = true) | | |-- in_reply_to_status_id: string (nullable = true) | | |-- in_reply_to_status_id_str: string (nullable = true) | | |-- in_reply_to_user_id: long (nullable = true) | | |-- in_reply_to_user_id_str: string (nullable = true) | | |-- is_quote_status: boolean (nullable = true) | | |-- lang: string (nullable = true) | | |-- place: string (nullable = true) | | |-- possibly_sensitive: boolean (nullable = true) | | |-- quoted_status_id: long (nullable = true) | | |-- quoted_status_id_str: string (nullable = true) | | |-- retweet_count: long (nullable = true) | | |-- retweeted: boolean (nullable = true) | | |-- source: string (nullable = true) | | |-- text: string (nullable = true) | | |-- truncated: boolean (nullable = true) | | |-- user: struct (nullable = true) | | | |-- contributors_enabled: boolean (nullable = true) | | | |-- created_at: string (nullable = true) | | | |-- default_profile: boolean (nullable = true) | | | |-- default_profile_image: boolean (nullable = true) | | | |-- description: string (nullable = true) | | | |-- favourites_count: long (nullable = true) | | | |-- follow_request_sent: string (nullable = true) | | | |-- followers_count: long (nullable = true) | | | |-- following: string (nullable = true) | | | |-- friends_count: long (nullable = true) | | | |-- geo_enabled: boolean (nullable = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- is_translator: boolean (nullable = true) | | | |-- lang: string (nullable = true) | | | |-- listed_count: long (nullable = true) | | | |-- location: string (nullable = true) | | | |-- name: string (nullable = true) | | | |-- notifications: string (nullable = true) | | | |-- profile_background_color: string (nullable = true) | | | |-- profile_background_image_url: string (nullable = true) | | | |-- profile_background_image_url_https: string (nullable = true) | | | |-- profile_background_tile: boolean (nullable = true) | | | |-- profile_banner_url: string (nullable = true) | | | |-- profile_image_url: string (nullable = true) | | | |-- profile_image_url_https: string (nullable = true) | | | |-- profile_link_color: string (nullable = true) | | | |-- profile_sidebar_border_color: string (nullable = true) | | | |-- profile_sidebar_fill_color: string (nullable = true) | | | |-- profile_text_color: string (nullable = true) | | | |-- profile_use_background_image: boolean (nullable = true) | | | |-- protected: boolean (nullable = true) | | | |-- screen_name: string (nullable = true) | | | |-- statuses_count: long (nullable = true) | | | |-- time_zone: string (nullable = true) | | | |-- url: string (nullable = true) | | | |-- utc_offset: long (nullable = true) | | | |-- verified: boolean (nullable = true) | |-- quoted_status_id: long (nullable = true) | |-- quoted_status_id_str: string (nullable = true) | |-- retweet_count: long (nullable = true) | |-- retweeted: boolean (nullable = true) | |-- scopes: struct (nullable = true) | | |-- followers: boolean (nullable = true) | |-- source: string (nullable = true) | |-- text: string (nullable = true) | |-- truncated: boolean (nullable = true) | |-- user: struct (nullable = true) | | |-- contributors_enabled: boolean (nullable = true) | | |-- created_at: string (nullable = true) | | |-- default_profile: boolean (nullable = true) | | |-- default_profile_image: boolean (nullable = true) | | |-- description: string (nullable = true) | | |-- favourites_count: long (nullable = true) | | |-- follow_request_sent: string (nullable = true) | | |-- followers_count: long (nullable = true) | | |-- following: string (nullable = true) | | |-- friends_count: long (nullable = true) | | |-- geo_enabled: boolean (nullable = true) | | |-- id: long (nullable = true) | | |-- id_str: string (nullable = true) | | |-- is_translator: boolean (nullable = true) | | |-- lang: string (nullable = true) | | |-- listed_count: long (nullable = true) | | |-- location: string (nullable = true) | | |-- name: string (nullable = true) | | |-- notifications: string (nullable = true) | | |-- profile_background_color: string (nullable = true) | | |-- profile_background_image_url: string (nullable = true) | | |-- profile_background_image_url_https: string (nullable = true) | | |-- profile_background_tile: boolean (nullable = true) | | |-- profile_banner_url: string (nullable = true) | | |-- profile_image_url: string (nullable = true) | | |-- profile_image_url_https: string (nullable = true) | | |-- profile_link_color: string (nullable = true) | | |-- profile_sidebar_border_color: string (nullable = true) | | |-- profile_sidebar_fill_color: string (nullable = true) | | |-- profile_text_color: string (nullable = true) | | |-- profile_use_background_image: boolean (nullable = true) | | |-- protected: boolean (nullable = true) | | |-- screen_name: string (nullable = true) | | |-- statuses_count: long (nullable = true) | | |-- time_zone: string (nullable = true) | | |-- url: string (nullable = true) | | |-- utc_offset: long (nullable = true) | | |-- verified: boolean (nullable = true) |-- source: string (nullable = true) |-- text: string (nullable = true) |-- timestamp_ms: string (nullable = true) |-- truncated: boolean (nullable = true) |-- twinl_lang: string (nullable = true) |-- twinl_source: array (nullable = true) | |-- element: string (containsNull = true) |-- user: struct (nullable = true) | |-- contributors_enabled: boolean (nullable = true) | |-- created_at: string (nullable = true) | |-- default_profile: boolean (nullable = true) | |-- default_profile_image: boolean (nullable = true) | |-- description: string (nullable = true) | |-- favourites_count: long (nullable = true) | |-- follow_request_sent: string (nullable = true) | |-- followers_count: long (nullable = true) | |-- following: string (nullable = true) | |-- friends_count: long (nullable = true) | |-- geo_enabled: boolean (nullable = true) | |-- id: long (nullable = true) | |-- id_str: string (nullable = true) | |-- is_translator: boolean (nullable = true) | |-- lang: string (nullable = true) | |-- listed_count: long (nullable = true) | |-- location: string (nullable = true) | |-- name: string (nullable = true) | |-- notifications: string (nullable = true) | |-- profile_background_color: string (nullable = true) | |-- profile_background_image_url: string (nullable = true) | |-- profile_background_image_url_https: string (nullable = true) | |-- profile_background_tile: boolean (nullable = true) | |-- profile_banner_url: string (nullable = true) | |-- profile_image_url: string (nullable = true) | |-- profile_image_url_https: string (nullable = true) | |-- profile_link_color: string (nullable = true) | |-- profile_sidebar_border_color: string (nullable = true) | |-- profile_sidebar_fill_color: string (nullable = true) | |-- profile_text_color: string (nullable = true) | |-- profile_use_background_image: boolean (nullable = true) | |-- protected: boolean (nullable = true) | |-- screen_name: string (nullable = true) | |-- statuses_count: long (nullable = true) | |-- time_zone: string (nullable = true) | |-- url: string (nullable = true) | |-- utc_offset: long (nullable = true) | |-- verified: boolean (nullable = true) +--------------------+ | text| +--------------------+ |Pinnen over je wa...| |@AllesofNooit @He...| |#dooba #sexdate J...| |En weer een uur v...| |Frederik Meulewae...| +--------------------+ only showing top 5 rows
The Automatic identification system (AIS) is an automatic tracking system used on ships and by vessel traffic services (VTS) for identifying and locating vessels by electronically exchanging data with other nearby ships, AIS base stations, and satellites. The data is available on the Twente Hadoop cluster under: /data/aisUT
.
val ais = sqlContext.read.json("ais.json.gz")
//ais.show(5)
ais.printSchema()
root |-- callsign: string (nullable = true) |-- cog: long (nullable = true) |-- destination: string (nullable = true) |-- dimbow: long (nullable = true) |-- dimport: long (nullable = true) |-- dimstarboard: long (nullable = true) |-- dimstern: long (nullable = true) |-- draught: long (nullable = true) |-- eta_day: long (nullable = true) |-- eta_hour: long (nullable = true) |-- eta_minute: long (nullable = true) |-- eta_month: long (nullable = true) |-- heading: long (nullable = true) |-- imo: long (nullable = true) |-- lat: double (nullable = true) |-- lat2: string (nullable = true) |-- lon: double (nullable = true) |-- lon2: string (nullable = true) |-- mmsi: long (nullable = true) |-- nav_status: long (nullable = true) |-- rot_angle: double (nullable = true) |-- rot_direction: string (nullable = true) |-- shipname: string (nullable = true) |-- shiptype: long (nullable = true) |-- sog: long (nullable = true) |-- timestamp: long (nullable = true) |-- ts: long (nullable = true) |-- type: long (nullable = true)
The Rijksdienst Wegverkeer is the Dutch ministery the takes care of the public roads. The data contains measurements from sensors in the roads. This data is available as comma-separated value (CSV) files and is available on the Twente Hadoop cluster under: /data/cbs/loopraw
.
val rdw = sc.textFile("rdw.csv.gz")
// rdw.map(line => line.split(',')(2)).collect().mkString("\n")
rdw.collect().mkString("\n")
RWS01_MONIBAS_0091hrl0763ra,1,11B,2014-12-27 05:51:00,2014-12-27 05:52:00,,,1,,,,,,,1,,0.000000,,,arithmeticAverageOfSamplesInATimePeriod,,0091hrl0763ra,,2,,southBound,100,60,lane2,greaterThan 12.20,52.6225,4.72241,8,5.5,B,negative,10609,150,P3.1,Tunnel,N9,Ring Alkmaar,Regulierstunnel,n.n.,generatedValue,,,mainCarriageway,Niet bepaald,Niet bepaald,"Nog niet gegenereerd" RWS01_MONIBAS_0091hrl0763ra,1,12B,2014-12-27 05:51:00,2014-12-27 05:52:00,,,1,,,,,,,1,,0.000000,,,arithmeticAverageOfSamplesInATimePeriod,,0091hrl0763ra,,2,,southBound,100,60,lane2,anyVehicle,52.6225,4.72241,8,5.5,B,negative,10609,150,P3.1,Tunnel,N9,Ring Alkmaar,Regulierstunnel,n.n.,generatedValue,,,mainCarriageway,Niet bepaald,Niet bepaald,"Nog niet gegenereerd" RWS01_MONIBAS_0091hrl0785ra,1,1B,2014-12-2...
The Dutch newspaper Volkskrant has a large archive of its articles online. The archive of the years 2000 - 2016 is available on the Twente Hadoop cluster under: /data/volkskrant
.
val volkskrant = sqlContext.read.json("volkskrant.json.gz")
volkskrant.show(5)
volkskrant.printSchema()
+--------+---+--------------------+-----+--------------------+--------------------+---------+--------------------+----+ |category|day| href|month| text| time|timeofday| title|year| +--------+---+--------------------+-----+--------------------+--------------------+---------+--------------------+----+ | Archief| 5|http://www.volksk...| 7|Een provocatieve ...| 5 juli 2011, 00:00| 00:00|Mensen uitlachen ...|2011| |Politiek| 18|http://www.volksk...| 3|Een euforisch gej...|18 maart 2015, 22:39| 22:39|'De boodschap van...|2015| |Economie| 21|http://www.volksk...| 6|De klant is konin...| 21 juni 2008, 02:47| 02:47|Klant geen koning...|2008| |Magazine| 4|http://www.volksk...| 1|Het gaat van kwaa...|4 januari 2013, 1...| 12:04|Indiase politicus...|2013| | Archief| 29|http://www.volksk...| 3|Chris Klomp vindt...|29 maart 2013, 00:00| 00:00|Uitgedaagd op Vol...|2013| +--------+---+--------------------+-----+--------------------+--------------------+---------+--------------------+----+ only showing top 5 rows root |-- category: string (nullable = true) |-- day: long (nullable = true) |-- href: string (nullable = true) |-- month: long (nullable = true) |-- text: string (nullable = true) |-- time: string (nullable = true) |-- timeofday: string (nullable = true) |-- title: string (nullable = true) |-- year: long (nullable = true)