This notebook is part of Hadoop tutorials delivered by IT-DB group

SPARK DATAFRAME Hands-On Lab using table from relational database

Hands-On 1 - Load a relation table into DataFrame

First, lets import the bits we need

In [ ]:
from pyspark import SparkContext, SQLContext, SparkConf
import os

download the oracle jdbc driver

In [ ]:
!curl -o /tmp/ojdbc7.jar https://cernbox.cern.ch/index.php/s/e7NJ5jD3qZB7RNO/download

This is only required in the notebook mode, in spark-shell, spark-submit, pyspark you can direcly pass --jars (e.g pyspark --jars /tmp/ojdbc7.jar)

In [ ]:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /tmp/ojdbc7.jar pyspark-shell'

Create the SparkContext and SQLContext explicitly

In [ ]:
sc = SparkContext()
In [ ]:
sqlContext = SQLContext(sc)

Create the jdbc DataFrame as below, please note how to pass the driver, connection url, username and password

before executing the next cell, please replace PWD, SERVER and PORT with password, database host and port respectively

In [ ]:
db_df = sqlContext.read.format('jdbc') \
    .options(driver='oracle.jdbc.driver.OracleDriver',url='jdbc:oracle:thin:meetup/[email protected]$SERVER:$PORT/PIMT_RAC51.cern.ch',dbtable='meetup_data') \
    .load()

Read the data from the dataframe

In [ ]:
db_df.select("EVENT_ID","EVENT_NAME").show(5)