FOSS World News: Examining Data Using Pandas

Tuesday, May 22, 2018

Examining Data Using Pandas

You don't need to be a data scientist to use Pandas for some basic analysis.

Traditionally, people who program in Python use the data types that come with the language, such as integers, strings, lists, tuples and dictionaries. Sure, you can create objects in Python, but those objects typically are built out of those fundamental data structures.

If you're a data scientist working with Pandas though, most of your time is spent with NumPy. NumPy might feel like a Python data structure, but it acts differently in many ways. That's not just because all of its operations work via vectors, but also because the underlying data is actually a C-style array. This makes NumPy extremely fast and efficient, consuming far less memory for a given array of numbers than traditional Python objects would do.

The thing is, NumPy is designed to be fast, but it's also a bit low level for some people. To get more functionality and a more flexible interface, many people use Pandas, a Python package that provides two basic wrappers around NumPy arrays: one-dimensional Series objects and two-dimensional Data Frame objects.

I often describe Pandas as "Excel within Python", in that you can perform all sorts of calculations as well as sort data, search through it and plot it.

For all of these reasons, it's no surprise that Pandas is a darling of the data science community. But here's the thing: you don't need to be a data scientist to enjoy Pandas. It has a lot of excellent functionality that's good for Python developers who otherwise would spend their time wrestling with lists, tuples and dictionaries.

So in this article, I describe some basic analysis that everyone can do with Pandas, regardless of whether you're a data scientist. If you ever work with CSV files (and you probably do), I definitely recommend thinking about using Pandas to open, read, analyze and even write to them. And although I don't cover it in this article, Pandas handles JSON and Excel very well too.

Creating Data Frames

Although it's possible to create a data frame from scratch using Python data structures or NumPy arrays, it's more common in my experience to do so from a file. Fortunately, Pandas can load data from a variety of file formats.

Before you can do anything with Pandas, you have to load it. In a Jupyter notebook, do:


%pylab inline
import pandas as pd

For example, Python comes with a csv module that knows how to handle files in CSV (comma-separated value) format. But, then you need to iterate over the file and do something with each of those lines/rows. I often find it easier to use Pandas to work with such files. For example, here's a CSV file: