A Thorough Intro to Pandas for Data Analysis

Data structures: Series and DataFrame

Matías Battocchia
9 min readOct 24, 2022

I created the present material based on my experience of a couple of years using Pandas. I intend to provide explanations that I wish I had had handy from the beginning. The concepts outlined here, once understood, can make of Pandas an enjoyable experience.

Pandas is a Python library that provides several tools for manipulating and analysing tabular data obtained from different sources, including formats such as SQL, CSV, XLS, JSON and more!

Pandas can efficiently deal with data. It allows selecting, modifying, inserting, removing, slicing, indexing, splitting, handling missing data, merging datasets and creating new sets of data with ease.

AI image generated with the OpenAI text2img model.

Data structures

Pandas offers two key structures to manage data.

But first let’s see two structures that Python uses natively, among many other structures:

list: an array of items that indexes and retrieves values by position — from zero to list-length-minus-one (bear in mind that Python starts counting from zero). The index of each item is its position, an integer.

dict: a hash table, hash map or dictionary that maps labels to values. In other words, that indexes and retrieves values by label instead of by position. The index of each item is a label, usually a string.

These two built-in Python data structures are great for many purposes but not suitable to work with tabular data. Thus, Pandas provides Series and DataFrame types.

Series: can behave as list and as dict at the same time. It functions like a column of a table and the elements can be located both by position (row number) and by label (row name).

DataFrame: is a tabular structure that has both rows and columns — similar to a spreadsheet, in two axes. Items, cell values, can be found both by position (row and column number) and by label (row and column name). An entire column can be retrieved by stating the position or label of such column: and that will be a Series.

A word on indexes. They are important! These mechanisms act under the umbrella of many data structures, allowing us to get or set values efficiently. Data structures are designed and created to store values in clever ways depending on each situation. Without indexes, access to data is slow (computationally expensive) and sometimes infeasibly slow for the problem to be solved.

Let's see some examples of the aforementioned structures:

Python list

Creation:

name_of_list = [value_A, value_B]

Notice the comma-separated values and the use of square brackets to create the list.

We can obtain one value according to its position (by using the index operator []).

value = name_of_list[position]

Python dict

Creation:

name_of_dict = {
label_A: value_A,
label_B: value_B,
}

Notice the comma-separated pairs of label:value and the use of curly brackets to create the dict.

We can retrieve just the index labels with dict.keys().

Or just its values (as if it were a list) with dict.values().

Even though the elements of the dict do have a position, they can only be accessed by label —by using the indexing operator []— and not by position.

value = name_of_dict[label]

Pandas Series

First, we must import Pandas.

A simple way to create a Series is by passing a dict to the Series constructor.

The chemical symbols (H, He, Li) on the left are not a column but an index. The elements (hydrogen, helium, lithium) are the values of the Series and they constitute a column. Series are one-column structures.

We may want to see just the index by using the Series.index attribute.

Or only the values with the Series.values attribute.

We can select a value using its label as we would do with a dict.

We can add a value, again, as we would do with a dict.

On a dict we can only select one element but on a Series we can select more than one by providing a list of labels.

Notice the double square brackets to ask for a list of labels.

The returned value will be another, smaller, Series.

We could also select a value with the Series.at[] label indexer.

Why more than one way of selecting? This will be answered soon, now we just want to show that such thing exists.

It is possible to select by position as we can do with list.

list cannot do this — we can get from Series more than one value by passing a list of positions. The result will be a Series.

More excitingly, we could a set a list of booleans (of the same length of the Series) to select several elements. Neither list nor dict are capable doing of this.

Note that the retrieved elements were selected according to the positions of the True values in the list of booleans, namely, positions 1 and 2.

Memorise these tricks!, what applies to Series will also work for DataFrame.

There exists as well a Series.iat[] position indexer (tip: iat is for integer at).

Series can be created from a list also.

HEADS UP: This index lacks the chemical symbols and instead numbers (0, 1, 2) are shown. When we create a Series from a dict, the labels of the dict will become index labels but when created from a list, the list only provides values in an order but without any labels, therefore, Series will use the positions of the elements to make labels by default.

Let's check it!

By label (at[] indexer).

By position (iat[] indexer).

Elements can always be accessed by label or by position, in this case labels coincide with position, both returning lithium, the same element. It is redundant but it is important to know that labels and positions are not the same thing.

The previous example can be reminiscent of a dict that has integers as labels.

Let's see an example of labels not being the same as positions (even though sometimes they can match). We are going to sort the Series in alphabetical order with the Series.sort_values() method.

Look at the index. Now the labels do not match the positions since the labels stick to the values whereas positions are always relative to the order.

Let's see what happens when we make a selection using the same label and position.

It is hydrogen, the second item of the Series, which happens to have the label 0.

It is helium, whose label is 1 but is the first row of the Series.

Wondering which is the output of the indexing operator []?

It is hydrogen, the element with label 0, in case you were expecting helium, the item with position 0.

HEADS UP! That is why we will prefer at[] and iat[]. Despite being more verbose these indexers make our intentions explicit and less prone to errors.

Finally, here is an alternative way of creating a Series with a list while specifying labels with a list of names.

Pandas DataFrame

If Series is like a single column, DataFrame is a table. Series has only an index for its rows. It is one-dimensional — think of the index as the number line in elementary mathematics. DataFrame in turn has indexes for each of both rows and columns —similarly to 2D plots having two axes— thus we say it is bi-dimensional.

It is important to mention that a DataFrame can have any number of columns, including just one. In such case it would be still a DataFrame and not a Series; we could, for example, add another column to it later — an action that does not apply to a Series.

In order to obtain a value from a cell, one must access the row and the column. We will later see this in detail.

There are many more ways of creating a DataFrame than a Series. We will pick one that resumes from the last example of Series. Instead of just a list we will provide a list of lists: the primary list contains as many rows the table will have, each secondary list will have many values as columns there will be.

As with the previous example of Series, we used the index argument to set up labels for the rows; what is new here is that we also specified labels for the columns through the columns argument.

Note that the index of the rows is called index whereas the index of the columns is called columns.🤪

We do the following in order to obtain the labels of the rows, as we did with Series.

Or this in order to obtain the labels of the columns, which is an index too.

And this, to obtain the table values.

The indexing operator [] presents several different behaviours for DataFrame, which we will see. One of them is that if you provide a column name, that column will be retrieved as a Series.

Notice that Series can have a name. The ones returned from DataFrame keep the name of the column with them.

If the indexing operator [] is useful for setting values in Series, in DataFrame it sets entire columns!

Now there is a new label in the columns index: category. It was not necessary to specify row-index labels since they were inferred from the order of items in the list.

To wrap up this post, here is how to access a single value from the table. We need to provide two coordinates: row and column.

We can do it by label (at[row, column] indexer).

Or by position (iat[row, column] indexer).

Another image generated using the OpenAI text2img model.

That's all!

Stay tuned for the next post on this series. Having a good understanding of the basic blocks of Pandas will be crucial to grasp the vast amount of features that Pandas offers for working with data.

Pandas articles that I like

--

--

Matías Battocchia

I studied at Universidad de Buenos Aires. I live in Mendoza, Argentina. Interests: data, NLP, blockchain.