geotwt_sdf=spark.read.parquet("BMC_UserGeoTwt/BMC_GeoTwt_Snappy*")
print geotwt_sdf.count()
geotwt_sdf.show(2)
38820846 +--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+ | ctime| uid| uname| lat| lng| profile_location| term| hashtag| fulltext|place_full_name| country|place_type| bounding_box|c_code|attribute| geoid| place_name|__index_level_0__| +--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+ |[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|37.5407246|-77.4360481|[43 68 61 72 6C 6...|[6A 6F 69 6E 20 6...|[6A 6F 62 20 46 6...|Join the Crothall...| Richmond, VA|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|00f751614d8ce37b| Richmond| 0| |[54 75 65 20 53 6...|126371773|CompassUSAJobBoard|28.3252878|-81.5331286|[43 68 61 72 6C 6...|[72 65 63 6F 6D 6...|[6A 6F 62 20 43 6...|Can you recommend...|Celebration, FL|United States| city|[7B 75 27 74 79 7...| US| [7B 7D]|01bbe9ba4078361c|Celebration| 1| +--------------------+---------+------------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+---------------+-------------+----------+--------------------+------+---------+----------------+-----------+-----------------+ only showing top 2 rows
We notice that there are several columns have been encoded into binary type. Well, it is pretty easy to cast byte array into string using astype
function. However, it becomes tricky for colume with structured information,e.g., the bounding_box
column. The bounding_box
column contains a json string that is supposed to be read in as a struct type column.
schema = StructType([
StructField("type", StringType(), True),
StructField("coordinates", ArrayType(ArrayType(ArrayType(FloatType()))),
True),
])
temp_sdf=geotwt_sdf.withColumn('ctime_str',geotwt_sdf.ctime.astype('string'))
temp_sdf=temp_sdf.withColumn('term_str',temp_sdf.term.astype('string'))
temp_sdf=temp_sdf.withColumn('bbox_str',temp_sdf.bounding_box.astype('string'))
temp_sdf=temp_sdf.withColumn('coords',func.regexp_replace('bbox_str','u',""))
temp_sdf=temp_sdf.withColumn('bbox',func.from_json('coords',schema))
temp_sdf=temp_sdf.withColumn('htag_str',temp_sdf.hashtag.astype('string'))
temp_sdf=temp_sdf.withColumn('plocation_str',temp_sdf.profile_location.astype('string'))
temp_sdf = temp_sdf.withColumn(
'll_lat',
temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(1)).withColumn(
'll_lng',
temp_sdf.bbox.coordinates.getItem(0).getItem(0).getItem(0))
temp_sdf = temp_sdf.withColumn(
'ur_lat',
temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(1)).withColumn(
'ur_lng',
temp_sdf.bbox.coordinates.getItem(0).getItem(2).getItem(0))
temp_sdf = temp_sdf.drop('ctime', 'profile_location', 'term', 'hashtag',
'bounding_box', 'attribute', 'bbox', 'coords',
'bbox_str')
temp_sdf.show(1)
+---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+ | uid| uname| lat| lng| fulltext|place_full_name| country|place_type|c_code| geoid|place_name|__index_level_0__| ctime_str| term_str| htag_str|plocation_str| ll_lat| ll_lng| ur_lat| ur_lng| +---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+ |126371773|CompassUSAJobBoard|37.5407246|-77.4360481|Join the Crothall...| Richmond, VA|United States| city| US|00f751614d8ce37b| Richmond| 0|Tue Sep 25 04:52:...|join crothall tea...|job FacilitiesMgm...|Charlotte, NC|37.447044|-77.60104|37.61272|-77.3853| +---------+------------------+----------+-----------+--------------------+---------------+-------------+----------+------+----------------+----------+-----------------+--------------------+--------------------+--------------------+-------------+---------+---------+--------+--------+ only showing top 1 row