A Thorough Intro to Pandas for Data Analysis II

Inspection, filtering and edition

Matías Battocchia
9 min readNov 11, 2022

In the previous part of this tutorial we saw the two data structures that Pandas provides to manage data: Series and DataFrame, along with native Python structures.

Let’s see now how to work with them.

Data loading

Pandas can import from and export to many different formats: CSV, JSON, SQL, XLS, ODF, etc.

Pandas has top-level reader functions such as pandas.read_csv() that return a DataFrame. The corresponding writer methods are accessed as DataFrame.to_csv(). Here is a table containing available readers and writers.

# load a JSON file
df = pd.read_json("test.json")

# save to JSON file
df.to_json("test.json")

For the following examples we will use a dataset with information about tarot cards and images of the Rider-Waite-Smith deck in a CSV file.

Inspection

The first thing you want to do after loading a dataset is to take a quick look at what it contains. There are several methods to do that.

DataFrame.head() will show us the first 5 rows (or another number that we write between the brackets).

In order to display the cards in the DataFrame and make it so cool-looking, we had to add some lines of code for the visualization that we are not showing you here.

DataFrame.tail() will display the last 5 rows or another number that we write between the brackets.

The DataFrame.shape attribute will show the amount of rows and columns in that order.

DataFrame.sample() will show one random row —a very useful feature— or another number that we specify between the brackets.

DataFrame.info() will provide information about both indexes (rows and columns) of the whole dataset: names of columns, data types of columns (dtypes), amount of non-null values, RAM-memory usage.

The dataset has six columns. One of them is numeric (int64), and the other five are of a generic type (object), that represents a jumble of sundry types of elements, strings (str) in most cases.

DataFrame.describe() displays basic figures about numeric columns by default, even those about which it would be almost pointless to make any statistical work. Here we use the include='all' option to include string columns as well.

Selecting and filtering

The methods seen so far will give information about the whole DataFrame. Let's see now how to select just certain parts of it and create a new DataFrame.

Indexing operator [ ]

This indexing operator has many uses depending on the arguments. Here we recommend to use only the following two and forget about the other possibilities of [].

1. Selecting columns by label
As seen in the previous post, by passing only one column we will obtain a Series.

Important tip: as an alternative to using the indexing operator [] if we wanted to see values from only one column, we could access it as if it were an attribute of the DataFrame.

Passing a list of columns will retrieve a DataFrame. The order in which we arrange the selected columns can be different to the original order.

2. Selecting rows with boolean masks
In the previous part of this tutorial we did something like this to filter values:

df[ [False, True, True, False] ]

This array of True and False values is called boolean mask. According to the position of the boolean values, the mask will hide the elements that were assigned False and will show the True ones. To do this, we create a list of booleans that must have the same length than the number of rows of our DataFrame or Series. We can know its length with the function len.

In the next part of this tutorial we will see smarter and simpler ways of producing boolean masks by querying the dataset. This is just a teaser:

loc[ ] and iloc[ ]

Let's suppose that we want to filter both indexes (rows and columns) simultaneously. In the previous chapter of this tutorial we saw the use of at[] and iat[] indexers to access one value either by label or by position both for Series and DataFrame. In the case of Series, label/row position was enough, but for DataFrame we had to provide coordinates for row and column.

There exists a pair of similar indexers, loc[] (locate) and iloc[] (integer locate). The only difference with at[] and iat[] is that they allow us to access more than one value.

By design, it is not possible to combine labels with positions with any indexer, for example, position of rows and labels of columns, something that we might want to do.

For the sake of better demonstrating how to explore the DataFrame by label or by position, we are going to replace the current index (integer labels that match the order of the rows) with the column “name” by using the DataFrame.set_index() method.

Now, the previous index with integers is lost. If we had wanted to maintain it, we should have called DataFrame.reset_index() first, in order to convert it into a column.

Indexing by label with loc[ ]

To obtain the value of one single cell (a scalar), by label we use:

df.loc[row_label, column_label]

We would obtain the same result with DataFrame.at[].

If we provide the values of more than one row or more than one column, we will obtain a Series. We could not do that with DataFrame.at[].

df.loc[list_of_row_labels, column_label]

df.loc[row_label, list_of_column_labels]

Values of more than one row and more than one column will yield another DataFrame.

df.loc[list_of_row_labels, list_of_column_labels]

So far, so good. What if we want a few columns and all rows, or vice versa?

If we select rows without specifying columns, we will get all of them.

df.loc[one_row_label] # Just a Series

df.loc[list_of_row_labels] # A DataFrame

Conversely, in order to obtain some columns and all the rows, the syntax is the following.

df.loc[:, one_column_label] # Just a Series

df.loc[:, list_of_column_labels] # A DataFrame

Note the colon : in the place of the rows; this is because the indexer usually feeds on two elements but if one is missing it will assume that we are only filtering rows — the first spot in the indexer is always for rows. We need to tell the indexer that we are referring to columns somehow — the second spot in the indexer.

loc[]will even work with masks!

Indexing by position with iloc[ ]

When indexing by position, labels are ignored. This indexer has the same interface than its counterpart, so we will not add too many examples here. Where you saw row labels or columns labels in the examples for loc replace them with row positions and column positions.

Despite not being mandatory, the special colon : notation also works to select all the columns. And of course, if it applies to iloc it also does to loc.

df.iloc[one_row_position, :] # Just a Series

df.iloc[list_of_row_positions, :] # A DataFrame

Slicing

We will try to shed some light on the mysterious colon here:

df.loc[:, ['image', 'suit']]

We already saw how to use lists with loc and iloc. But if the elements that we wanted to obtain were consecutive (e.g. 4, 5, 6, 7), it would be easier to just pass the beginning and the end of such sequence by using a slice 4:7.

df.loc[first_row:last_row]

It would perform similarly with iloc.

slice does not replace list but can be used wherever lists are used within indexers for Series and DataFrame. Slices and lists can play together, for example slice the rows, list the columns.

As we saw in previous instances, specifying columns is not mandatory if we only care about rows.

IMPORTANT: When slicing with iloc, the last row does not enter the selection. That is not the case when slicing with loc.

And if the starting element is the very first label/position, it could be omitted.

Something similar goes for the final value when that value is the last one.

Guess what happens when we need a selection from the first element up to the last one.

Boom! The whole dataset. The mystery of : has been solved.

Here we summarise all what we have seen up to this point with the next example.

Lastly, bear in mind what we said before: do not mix positions with labels.

Assigning values

loc and iloc can also be used to add or change values.

Let's suppose that we want to modify erroneous information about some cards and we also want to add missing data.

The information in the column "reversed keywords" is missing for the three cards. Additionally we find that they also have wrong information in the column "suit".

Or we may want to add data of one row in two columns. We have gathered some keywords for the King of Wands.

Done!

Thanks for reaching the end. It is all for the time being. In the next part of this tutorial we will see more things that we can do with the index, such as using more than one (multi-index).

Used methods in this article

Input/output

  • read_csv()
  • to_csv()

Exploration

  • describe()
  • info()
  • shape
  • head()
  • tail()
  • sample()

Indexing

  • set_index() column to index
  • reset_index() index to column

Homework

Use the help() Python built-in function to get an insight of these unseen handy methods.

  • sort_index() sort by index
  • sort_values() sort by columns
  • rename() change row/column labels
help(pd.DataFrame.sort_index)

--

--

Matías Battocchia

I studied at Universidad de Buenos Aires. I live in Mendoza, Argentina. Interests: data, NLP, blockchain.