Parsing XML files to a flat dataframe

FacebookTwitterGoogle+RedditLinkedIn

Markup languages like XML are really handy for structured data that can have multiple values for the same attribute, or attributes which are nested within other attributes in a hierarchical structure. For simple analysis, however, we just want a rectangular data-frame with columns and rows and we need to flatten all that structure. The following code does a very simple job of converting an XML file into a Pandas data-frame. It recursively parses every branch in the file creating new columns and storing their value when information is found. It stores not just raw text as variables in the new dataset, but also all of the attributes stored in tags as well.


from bs4 import BeautifulSoup
import pandas as pd

def xml2df(xml_doc):
    f = open(xml_doc, 'r')
    soup = BeautifulSoup(f)

    name_list=[]
    text_list=[]
    attr_list=[]

    def recurs(soup):
        try:
            for j in soup.contents:
                try:
                    #print j.name
                    if j.name!=None:
                        name_list.append(j.name)
                except:
                    pass
                try:
                    #print j.text
                    if j.name!=None:
                        #print j.string
                        text_list.append(j.string)
                except:
                    pass
                try:
                    #print j.attrs
                    if j.name!=None:
                        attr_list.append(j.attrs)
                except:
                    pass
                recurs(j)
        except:
            pass

    recurs(soup)

    attr_names_list = [q.keys() for q in attr_list]
    attr_values_list = [q.values() for q in attr_list]

    columns = hstack((hstack(name_list),
                      hstack(attr_names_list)) )
    data = hstack((hstack(text_list),
                   hstack(attr_values_list)) )

    df = pd.DataFrame(data=matrix(data.T), columns=columns )

    return df

FacebookTwitterGoogle+RedditLinkedIn

1 comment to Parsing XML files to a flat dataframe

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>