Wrangling with Various Data Formats

  |   Source

Wrangling data from various resources can be quite tedious as there's many data formats out there. Here we want to address what are the issue out there, and handle some of the famous data type formats, e.g CSV,JSON,Excel, and XML.

Data Wrangling is skill that every data scientist must have. About 75% time is spent on the data wrangling. Data Wrangling is about gathering data from many databases. And this is the core of the problem. We can't believe the data is clean, because if the data involve human input, there's definitely human error will involved. We don't want missing/error value in our data, it will have some error if we try to compute the data. Or, if you work in a team, your team will not trust you if you provide your data with error in it.

Here i want to show you how to clean the data in CSV,JSON,Excel,and XML. This is what i learn base from Data Wrangling with MongoDB, one of the course at Udacity. You may want to take a look at the course.

I'm not going to discuss in more detail of each of data formats. You may want to google each of them.

I also not going to dig into more detail of each python module to handle the data format, for that you can read the documentation.

CSV

CSV has been around for a while and one the most used type of data out there.

There's many advantage using csv:

  • We can use any text editor to open the file
  • First row is always the header/fieldnames of the data.
  • Different fields just delimit by comma

Let's take a look how we handle simple csv

In [8]:
DWMDB_DIR = '../../../DataWranglingMongoDB/'
In [ ]:
%load ../../../DataWranglingMongoDB/Lesson1/csv_to_dict.py
In [ ]:
# Your task is to read the input DATAFILE line by line, and for the first 10 lines (not including the header)
# split each line on "," and then for each line, create a dictionary
# where the key is the header title of the field, and the value is the value of that field in the row.
# The function parse_file should return a list of dictionaries,
# each data line in the file being a single list entry.
# Field names and values should not contain extra whitespace, like spaces or newline characters.
# You can use the Python string method strip() to remove the extra whitespace.
# You have to parse only the first 10 data lines in this exercise,
# so the returned list should have 10 entries!
import os

def parse_file(datafile):
    with open(datafile, "r") as f:
        items = [line.strip().split(',') for line in f]
        key_list = items[0]
        data = [dict(zip(key_list,items[i])) for i in range(1,11)]
    return data

This will load all the data in Python dictionary format. But such a code will not benefit as there's also value that have comma in it.

To do so, in python we have csv module to do all the parse for us.

In [9]:
import csv

def parse_file(datafile):
    name = ""
    data = []
    with open(datafile,'rb') as f:
        items = list(csv.reader(f))
        name = items[0][1]
        data = items[2:]
    # Do not change the line below
    return (name, data)

Here we taking an advantage of csv module, have header in first line.

it will return the name as list, and data as list.

We also can change the csv file into a list of dictionaries.

In [2]:
import csv
def process_file(filename):
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        reader = list(reader)
        for row  in reader:
            #process row as dictionary with keys as fields
            pass

Excel

There's also the case when we face with Microsoft Excel's format. Excel have divided the data into rows,columns,and sheets. In python, we have xlrd module to handle the Excel data format.

In [3]:
import xlrd
def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)

    data = [[sheet.cell_value(r, col) 
                for col in range(sheet.ncols)] 
                    for r in range(sheet.nrows)]

    print "\nList Comprehension"
    print "data[3][2]:",
    print data[3][2]

    print "\nCells in a nested loop:"    
    for row in range(sheet.nrows):
        for col in range(sheet.ncols):
            if row == 50:
                print sheet.cell_value(row, col),


    ### other useful methods:
    print "\nROWS, COLUMNS, and CELLS:"
    print "Number of rows in the sheet:", 
    print sheet.nrows
    print "Type of data in cell (row 3, col 2):", 
    print sheet.cell_type(3, 2)
    print "Value in cell (row 3, col 2):", 
    print sheet.cell_value(3, 2)
    print "Get a slice of values in column 3, from rows 1-3:"
    print sheet.col_values(3, start_rowx=1, end_rowx=4)

    print "\nDATES:"
    print "Type of data in cell (row 1, col 0):", 
    print sheet.cell_type(1, 0)
    exceltime = sheet.cell_value(1, 0)
    print "Time in Excel format:",
    print exceltime
    print "Convert time to a Python datetime tuple, from the Excel float:",
    print xlrd.xldate_as_tuple(exceltime, 0)

JSON

JSON format is the type of of data that is very similar to python dictionary.

Because of this, load it into python dictionary, has been very simple

In [5]:
import json

def get_from_file(filename):
    with open(filename, "r") as f:
        return json.loads(f.read())

XML

XML has two types. Let's take a look at the first one.

In [6]:
XML1="""    <properties><batch>
    <k1>v1</k1>
    <k2>v2</k2>
    <k3>v3</k3>
    <k4>v4</k4>
    </batch>
    <batch>
    </batch>
    </properties>
    """

And this is the second one, from Openstreet Map data.

In [12]:
XML2 = """
<osm generator="Overpass API" version="0.6">
<note>The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note>
<meta osm_base="2014-11-05T11:38:02Z" />
  <node changeset="20029239" id="29938967" lat="-6.1803929" lon="106.8226699" timestamp="2014-01-16T08:18:23Z" uid="646006" user="Irfan Muhammad" version="13" />
  <node changeset="20029239" id="29938968" lat="-6.1803972" lon="106.8231199" timestamp="2014-01-16T08:18:23Z" uid="646006" user="Irfan Muhammad" version="28" />
</osm>
"""

For XML,Python has few modules, ones of which is xml.minidom and xml.etree.cElementTree. Let's use each of them.

First i'm going to handle first xml with minidom.

In [11]:
from xml.dom import minidom
def parse_xml(xmlread):
    xParsed = minidom.parseString(xmlread)
    keys = ['k1','k2','k3','k4']
    content = []
    for batch in xParsed.getElementsByTagName("batch"):
        d = {}
        for key in keys:
            try:
                d[key] = batch.getElementsByTagName(key)[0].childNodes[0].nodeValue
            except IndexError:
                pass
        content.append(d)
    return content

print parse_xml(XML1)
[{'k3': u'v3', 'k2': u'v2', 'k1': u'v1', 'k4': u'v4'}, {}]

The second one, we use ElementTree module

In [17]:
import xml.etree.cElementTree as ET
def parse(xmlread):
    tree = ET.fromstring(xmlread)   
    listtree = list(tree.iter())
    for elem in listtree:
        if elem.tag == "node":
            print elem.attrib
parse(XML2)
{'changeset': '20029239', 'uid': '646006', 'timestamp': '2014-01-16T08:18:23Z', 'lon': '106.8226699', 'version': '13', 'user': 'Irfan Muhammad', 'lat': '-6.1803929', 'id': '29938967'}
{'changeset': '20029239', 'uid': '646006', 'timestamp': '2014-01-16T08:18:23Z', 'lon': '106.8231199', 'version': '28', 'user': 'Irfan Muhammad', 'lat': '-6.1803972', 'id': '29938968'}

So there you go. With these you can wrangling data across many sources, and many data data format.

There's a lot of data format out there, and i hope this will also give you some better intuition in handling different formats of data.

You can also download this blog post in soure link at the top of the post.