Markup languages like XML are really handy for structured data that can have multiple values for the same attribute, or attributes which are nested within other attributes in a hierarchical structure. For simple analysis, however, we just want a rectangular data-frame with columns and rows and we need to flatten all that structure. The following code does a very simple job of converting an XML file into a Pandas data-frame. It recursively parses every branch in the file creating new columns and storing their value when information is found. It stores not just raw text as variables in the new dataset, but also all of the attributes stored in tags as well.
from bs4 import BeautifulSoup import pandas as pd def xml2df(xml_doc): f = open(xml_doc, 'r') soup = BeautifulSoup(f) name_list= text_list= attr_list= def recurs(soup): try: for j in soup.contents: try: #print j.name if j.name!=None: name_list.append(j.name) except: pass try: #print j.text if j.name!=None: #print j.string text_list.append(j.string) except: pass try: #print j.attrs if j.name!=None: attr_list.append(j.attrs) except: pass recurs(j) except: pass recurs(soup) attr_names_list = [q.keys() for q in attr_list] attr_values_list = [q.values() for q in attr_list] columns = hstack((hstack(name_list), hstack(attr_names_list)) ) data = hstack((hstack(text_list), hstack(attr_values_list)) ) df = pd.DataFrame(data=matrix(data.T), columns=columns ) return df