Let's take libexpat.jl for a whirl

We'll start with a very simple chunk of XML, and then move to a more realistic example.

In [160]:
using LibExpat
In [161]:
names(LibExpat)
Out[161]:
14-element Array{Symbol,1}:
 :LibExpat           
 :XPStreamHandler    
 :free               
 :xpath              
 :pause              
 :ETree              
 symbol("@xpath_str")
 :ParsedData         
 :stop               
 :resume             
 :parse              
 :XPCallbacks        
 :parsefile          
 :xp_parse           

Use xp_parse(string) to load a chunk of XML into an Etree

In [162]:
sm = """<blah id="42" class="top">hi
          <blue id="1" class="cold">hey</blue>
          <red id="2" class="hot">yo</red>
        </blah>"""        
Out[162]:
"<blah id=\"42\" class=\"top\">hi\n  <blue id=\"1\" class=\"cold\">hey</blue>\n  <red id=\"2\" class=\"hot\">yo</red>\n</blah>"
In [162]:
 et=xp_parse(s);

Use LibExpat.find(et, element_path) to return an array of ETree objects matching an element path string

The LibExpat.jl README describes the format of element_path.

Let's check the structure of a simple ETree

  • name = tag name of top element
  • attr = Dict of top level attributes
  • elements = array of top level payload/content, including junk whitespace.
  • parent = parent ETree. (the root node is self-referential, causing it to be displayed multiple times)
In [163]:
esm = xp_parse(sm)
dump(esm)
ETree 
  name: ASCIIString "blah"
  attr: Dict{String,String} len 2
    class: ASCIIString "top"
    id: ASCIIString "42"
  elements: Array(Union(String,ETree),(8,)) ["hi","\n","  ",<blue class="cold" id="1">hey</blue>,"\n","  ",<red class="hot" id="2">yo</red>,"\n"]
  parent: ETree 
    name: ASCIIString ""
    attr: Dict{String,String} len 0
    elements: Array(Union(String,ETree),(1,)) [<blah class="top" id="42">hi
  <blue class="cold" id="1">hey</blue>
  <red class="hot" id="2">yo</red>
</blah>]
    parent: ETree 
      name: ASCIIString ""
      attr: Dict{String,String} len 0
      elements: Array(Union(String,ETree),(1,)) [<blah class="top" id="42">hi
  <blue class="cold" id="1">hey</blue>
  <red class="hot" id="2">yo</red>
</blah>]
      parent: ETree 
        name: ASCIIString ""
        attr: Dict{String,String} len 0
        elements: Array(Union(String,ETree),(1,)) [<blah class="top" id="42">hi
  <blue class="cold" id="1">hey</blue>
  <red class="hot" id="2">yo</red>
</blah>]
        parent: ETree 
          name: ASCIIString ""
          attr: Dict{String,String} len 0
          elements: Array(Union(String,ETree),(1,)) [<blah class="top" id="42">hi
  <blue class="cold" id="1">hey</blue>
  <red class="hot" id="2">yo</red>
</blah>]
          parent: ETree 
In [164]:
esm.name, esm.attr
Out[164]:
("blah",["class"=>"top","id"=>"42"])
In [165]:
esm.elements
Out[165]:
8-element Array{Union(String,ETree),1}:
 "hi"                                
 "\n"                                
 "  "                                
 <blue class="cold" id="1">hey</blue>
 "\n"                                
 "  "                                
 <red class="hot" id="2">yo</red>    
 "\n"                                
In [166]:
typeof(esm.elements[1]) <: String
Out[166]:
true

Extract payload/contents from an element, ignoring whitespace and sub-elements

In [167]:
for e in esm.elements
    stre = strip(string(e))
    if length(stre)>0
        println(stre, "  ", typeof(e))
        if typeof(e) <: String
            println("Payload: ",stre)
        end
    end
end
  
hi  ASCIIString
Payload: hi
<blue class="cold" id="1">hey</blue>  ETree
<red class="hot" id="2">yo</red>  ETree

A more realistic example

Here we are scraping data from a chunk of fairly clean HTML.

In [167]:
s="""<div id="flight_container" style="padding: 2px;">
	<table class="table_sides" width="100%" cellpadding="0" cellspacing="0" border="0" align=""><tbody><tr>
			<td bgcolor="FFFFFF">
			
<table width="100%" border="0" cellpadding="4" cellspacing="0" class=""><thead>
<tr><td colspan="15" class="table_header" align="left">Flight Info - NXXXXXX(Rogers Bleeblah #)  </td></tr>
	<tr>
<td width="" class="table_row_header" align="left" valign="middle">Date</td>
<td width="" class="table_row_header" align="left" valign="middle">Origin</td>
<td width="" class="table_row_header" align="left" valign="middle">Dest</td>
<td width="" class="table_row_header" align="left" valign="middle">Depart</td>
<td width="" class="table_row_header" align="left" valign="middle">Arrive</td>
<td width="" class="table_row_header" align="left" valign="middle">Hobbs</td>
<td width="" class="table_row_header" align="left" valign="middle">Flight Time</td>
<td width="" class="table_row_header" align="left" valign="middle">Ground Time</td>
<td width="" class="table_row_header" align="left" valign="middle">Flight Distance</td>
<td width="" class="table_row_header" align="left" valign="middle">Taxi Distance</td>
<td width="" class="table_row_header" align="left" valign="middle">Fuel</td>
<td width="" class="table_row_header" align="left" valign="middle">Fuel/hr</td>
<td width="" class="table_row_header" align="left" valign="middle">Fuel/nm</td>
<td width="" class="table_row_header" align="left" valign="middle">Altitude</td>
<td width="" class="table_row_header" align="left" valign="middle">Gnd Speed</td>
</tr></thead><tbody>
<tr class="table_row1" onmouseover="style.backgroundColor=&#39;#FFF9C4&#39;" onmouseout="style.backgroundColor=&#39;#FFFFFF&#39;">
<td width="" class="table_td" align="left" valign="top">Mon, May xx, 2010</td>
<td width="" class="table_td" align="left" valign="top">KMYF</td>
<td width="" class="table_td" align="left" valign="top">XXXX</td>
<td width="" class="table_td" align="left" valign="top">10:44</td>
<td width="" class="table_td" align="left" valign="top">12:43</td>
<td width="" class="table_td" align="left" valign="top">1.92 hrs</td>
<td width="" class="table_td" align="left" valign="top">1.8 hrs (1:48)</td>
<td width="" class="table_td" align="left" valign="top">0.12 hrs (0:07)</td>
<td width="" class="table_td" align="left" valign="top">177.27 nm</td>
<td width="" class="table_td" align="left" valign="top">1.32 nm</td>
<td width="" class="table_td" align="left" valign="top">16.69 gal</td>
<td width="" class="table_td" align="left" valign="top">8.68 gal/hr</td>
<td width="" class="table_td" align="left" valign="top">0.09 gal/nm</td>
<td width="" class="table_td" align="left" valign="top">9511 msl</td>
<td width="" class="table_td" align="left" valign="top">95.21 kts</td>
</tr>
</tbody></table>

</td></tr></tbody></table>
</div>
""";

The // in "/div/table//table//td" allows expat to skip layers of elements, reaching anywhere under /div/table

In [168]:
tds = LibExpat.find(et, "/div/table//table//td")
Out[168]:
31-element Array{ETree,1}:
 <td class="table_header" align="left" colspan="15">Flight Info - NXXXXXX(Rogers Bleeblah #)  </td>
 <td class="table_row_header" valign="middle" align="left" width="">Date</td>                      
 <td class="table_row_header" valign="middle" align="left" width="">Origin</td>                    
 <td class="table_row_header" valign="middle" align="left" width="">Dest</td>                      
 <td class="table_row_header" valign="middle" align="left" width="">Depart</td>                    
 <td class="table_row_header" valign="middle" align="left" width="">Arrive</td>                    
 <td class="table_row_header" valign="middle" align="left" width="">Hobbs</td>                     
 <td class="table_row_header" valign="middle" align="left" width="">Flight Time</td>               
 <td class="table_row_header" valign="middle" align="left" width="">Ground Time</td>               
 <td class="table_row_header" valign="middle" align="left" width="">Flight Distance</td>           
 <td class="table_row_header" valign="middle" align="left" width="">Taxi Distance</td>             
 <td class="table_row_header" valign="middle" align="left" width="">Fuel</td>                      
 <td class="table_row_header" valign="middle" align="left" width="">Fuel/hr</td>                   
 ⋮                                                                                                 
 <td class="table_td" valign="top" align="left" width="">10:44</td>                                
 <td class="table_td" valign="top" align="left" width="">12:43</td>                                
 <td class="table_td" valign="top" align="left" width="">1.92 hrs</td>                             
 <td class="table_td" valign="top" align="left" width="">1.8 hrs (1:48)</td>                       
 <td class="table_td" valign="top" align="left" width="">0.12 hrs (0:07)</td>                      
 <td class="table_td" valign="top" align="left" width="">177.27 nm</td>                            
 <td class="table_td" valign="top" align="left" width="">1.32 nm</td>                              
 <td class="table_td" valign="top" align="left" width="">16.69 gal</td>                            
 <td class="table_td" valign="top" align="left" width="">8.68 gal/hr</td>                          
 <td class="table_td" valign="top" align="left" width="">0.09 gal/nm</td>                          
 <td class="table_td" valign="top" align="left" width="">9511 msl</td>                             
 <td class="table_td" valign="top" align="left" width="">95.21 kts</td>                            
In [169]:
el = tds[1]
Out[169]:
<td class="table_header" align="left" colspan="15">Flight Info - NXXXXXX(Rogers Bleeblah #)  </td>
In [170]:
typeof(el)
Out[170]:
ETree (constructor with 2 methods)

Just get the text of the element:

In [171]:
string(el)
Out[171]:
"<td class=\"table_header\" align=\"left\" colspan=\"15\">Flight Info - NXXXXXX(Rogers Bleeblah #)  </td>"

Check the attribute Dict to identifier elements by class

In [172]:
el.attr["class"]
Out[172]:
"table_header"
In [173]:
get(el.attr, "class","")
Out[173]:
"table_header"

Build a dictionary of labels and values by parsing element payloads

To extract from dirty html, it might make sense to match on class="table_td" or class="table_row_header" and then use expat to extract payloads.

Get the flight acid

In [174]:
function parse_header( hdr )
    #hdr = strip(td.elements[1])
    hdr = strip( split(hdr,'-')[2] )
    (acid, actype) = [strip(s) for s in split(hdr,'(')]
    actype = strip(replace(actype, "#)",""))
    return (acid, actype)
end 
Out[174]:
parse_header (generic function with 1 method)
In [175]:
parse_header( "Flight Info - NXXXXXX (Rogers Bleeblah #)  " )
Out[175]:
("NXXXXXX","Rogers Bleeblah")

Extract element payloads

In [176]:
labels = ASCIIString[]
values = ASCIIString[]
hdr = ""
for td in tds
    if get(td.attr,"class","")=="table_header" 
        hdr = strip(td.elements[1])
        (acid, actype) = parse_header(hdr)
    end
    if get(td.attr,"class","")=="table_td" 
        push!(values, strip(td.elements[1]) )
    end
    if get(td.attr,"class","")=="table_row_header" 
        push!(labels, strip(td.elements[1]) )
    end
end    
In [177]:
acid, actype
Out[177]:
("NXXXXXX","Rogers Bleeblah")

Load to Dict()

In [178]:
dmap = Dict()
for (i,el) in enumerate(labels)
    v = values[i]
    if '0'<=v[end]<='9'
        dmap[el] = v
    else
        dmap[el] = split(v,' ')[1]
    end
end
dump(dmap)
Dict{Any,Any} len 15
  Flight Time: ASCIIString "1.8"
  Fuel/hr: ASCIIString "8.68"
  Gnd Speed: ASCIIString "95.21"
  Fuel: ASCIIString "16.69"
  Fuel/nm: ASCIIString "0.09"
  Hobbs: ASCIIString "1.92"
  Flight Distance: ASCIIString "177.27"
  Date: ASCIIString "Mon, May xx, 2010"
  Ground Time: ASCIIString "0.12"
  Taxi Distance: ASCIIString "1.32"
  Dest: ASCIIString "XXXX"
  ...
In [159]: