Notebook

Handling XML Files¶

This notebook showcases methods to read XML type data using:

xml library
xmltodict library

In [1]:

# import required libraries
import xml.etree.ElementTree as ET
import xmltodict

Utilities¶

In [2]:

def print_nested_dicts(nested_dict,indent_level=0):
    """This function prints a nested dict object
    Args:
        nested_dict (dict): the dictionary to be printed
        indent_level (int): the indentation level for nesting
    Returns:
        None

    """
    
    for key, val in nested_dict.items():
        if isinstance(val, dict):
          print("{0} : ".format(key))
          print_nested_dicts(val,indent_level=indent_level+1)
        elif isinstance(val,list):
            print("{0} : ".format(key))
            for rec in val:
                print_nested_dicts(rec,indent_level=indent_level+1)
        else:
          print("{0}{1} : {2}".format("\t"*indent_level,key, val))
          
def print_xml_tree(xml_root,indent_level=0):
    """This function prints a nested dict object
    Args:
        xml_root (dict): the xml tree to be printed
        indent_level (int): the indentation level for nesting
    Returns:
        None

    """
    for child in xml_root:
            print("{0}tag:{1}, attribute:{2}".format(
                                                "\t"*indent_level,
                                                child.tag,
                                                child.attrib))
                                                
            print("{0}tag data:{1}".format("\t"*indent_level,
                                            child.text))
                                            
            print_xml_tree(child,indent_level=indent_level+1)
            


def read_xml(file_name):
    """This function extracts and prints XML content from a given file
    Args:
        file_name (str): file path to be read
    Returns:
        None

    """
    try:
        tree = ET.parse(file_name)
        root = tree.getroot()
        
        print("Root tag:{0}".format(root.tag))
        print("Attributes of Root:: {0}".format(root.attrib))
        
        print_xml_tree(root)
            
    except IOError:
        raise IOError("File path incorrect/ File not found")
    except Exception:
        raise

    

def read_xml2dict_xml(file_name):
    """This function extracts and prints xml content from a file using xml2dict
    Args:
        file_name (str): file path to be read
    Returns:
        None

    """
    try:
        xml_filedata = open(file_name).read() 
        ordered_dict = xmltodict.parse(xml_filedata)
        
        print_nested_dicts(ordered_dict)
    except IOError:
        raise IOError("File path incorrect/ File not found")
    except ValueError:
        ValueError("XML file has errors")
    except Exception:
        raise 

Parse using XML module¶

The read_xml() function takes the input file name as input parameter.

In [3]:

read_xml(r'sample_xml.xml')

Root tag:records
Attributes of Root:: {'attr': 'sample xml records'}
tag:record, attribute:{'name': 'rec_1'}
tag data:
	  
	tag:sub_element, attribute:{}
	tag data:
	    
		tag:detail1, attribute:{}
		tag data:Attribute 1
		tag:detail2, attribute:{}
		tag data:2
	tag:sub_element_with_attr, attribute:{'attr': 'complex'}
	tag data:
	    Sub_Element_Text
	  
	tag:sub_element_only_attr, attribute:{'attr_val': 'only_attr'}
	tag data:None
tag:record, attribute:{'name': 'rec_2'}
tag data:
	  
	tag:sub_element, attribute:{}
	tag data:
	    
		tag:detail1, attribute:{}
		tag data:Attribute 1
		tag:detail2, attribute:{}
		tag data:2
	tag:sub_element_with_attr, attribute:{'attr': 'complex'}
	tag data:
	    Sub_Element_Text
	  
	tag:sub_element_only_attr, attribute:{'attr_val': 'only_attr'}
	tag data:None
tag:record, attribute:{'name': 'rec_3'}
tag data:
	  
	tag:sub_element, attribute:{}
	tag data:
	    
		tag:detail1, attribute:{}
		tag data:Attribute 1
		tag:detail2, attribute:{}
		tag data:2
	tag:sub_element_with_attr, attribute:{'attr': 'complex'}
	tag data:
	    Sub_Element_Text
	  
	tag:sub_element_only_attr, attribute:{'attr_val': 'only_attr'}
	tag data:None

The function generates a nested output resembling the structure of the XML itself. This function provides flexibility in terms of identifying the structure and parsing XML nodes as required.

Parse using xmltodict¶

The read_xml2dict_xml() function takes the input file name as input parameter. It uses xmltodict to do the heavy lifting

In [4]:

read_xml2dict_xml(r'sample_xml.xml')

records : 
	@attr : sample xml records
record : 
		@name : rec_1
sub_element : 
			detail1 : Attribute 1
			detail2 : 2
sub_element_with_attr : 
			@attr : complex
			#text : Sub_Element_Text
sub_element_only_attr : 
			@attr_val : only_attr
		@name : rec_2
sub_element : 
			detail1 : Attribute 1
			detail2 : 2
sub_element_with_attr : 
			@attr : complex
			#text : Sub_Element_Text
sub_element_only_attr : 
			@attr_val : only_attr
		@name : rec_3
sub_element : 
			detail1 : Attribute 1
			detail2 : 2
sub_element_with_attr : 
			@attr : complex
			#text : Sub_Element_Text
sub_element_only_attr : 
			@attr_val : only_attr

The output in the above cell shows how xmltodict reads an XML file. The function utilizes the xmltodict library to perform the node traversal and extract relevant information.