Lecture 5 : Data manipulation in Python

April, 2022 - François HU

Master of Science - EPITA

This lecture is available here: https://curiousml.github.io/

image-3.png

Table of contents

General introduction

  1. Introduction to DataFrames
  2. Data representation
  3. Data manipulation
  4. Concatenate dataframes
  5. Descriptive statistics with Pandas
  6. Data visualization with Pandas

Exercices

General introduction (a little long)

Data can be represented in various forms: txt, csv, xls (excel), json, ... . In python, given a specific extension (.txt for example), we have many adequate modules for importing data. For "classical" files such as txt files, Python has some useful built-in commands for importing and handling them: we can open for example a txt file as write or read mode with the command open.

Write and add mode

The information is always written in the form of strings and always added at the end of the file, which grows until all the information is written. The writing is always done according to the following same scheme.

  1. creation or opening of the file: when the file is opened, the file in which the information will be written is created if it does not exist or cleaned up if it already exists;
  2. writing thanks to the method write of f (TextIOWrapper object);
  3. closing: closing allows other programs to read what you have placed in this file.

read mode

The reading of a file allows to find the stored information. It takes place according to the same principle, namely :

  1. opening the file in read mode;

  2. reading directly iterating over the file object or using the readlines method;

  3. closing.

However, there is a difference when reading a file: it is done line by line, whereas writing does not necessarily follow a line-by-line division.

Remark: the with command handles the opening and the closing processes. Alternatively (although not recommended) we can write (for write mode):

f = open ("file_name.txt", "w") # opening
...                             # writing
...                             # writing
f.close ()                      # closing

external packages

With the above Python built-in processes, importing and manipulating more "complex" types of data becomes too hard. For instance, let us import a csv file with the above method and store the values in a list. You can download the iris dataset here. Iris dataset is one of the best known toy database in the pattern recognition literature. The dataset contains 3 classes (of 50 instances each):

Each class refers to a type of iris plant.

As you can see, each line represent a string leading us to handle string objects instead of the wanted values. In this case it is recommended to use external packages.

1. Introduction to Dataframes

This lecture explore how to represent and manipulate data and more preciselly datasets. Simply put, a dataset is just a collection of data often represented by tables where:

The most well-known package in Python for handling efficiently data as a two-dimensional table is pandas which provides a container for tables, called Dataframe.

The main features of Pandas and its dataframe are:

Like always, in a terminal (e.g. anaconda prompt), you can install the package pandas with the command:

pip install pandas

We note that pandas is frequently renamed as pd.

Below you will find the main differences between list, array and dataframe:

image.png

2. Data representation

Reading a dataframe

Reading an existing dataframe with the method read_csv (see documentation for more details).

Creating a dataframe

there are many ways for creating a dataframe from scratch:

  1. specify the data (python list or numpy array), the index and the column
  2. specify feature by feature (columns) thanks to a dictionary

Like dictionaries, it is possible to add a new column (e.g. column_name) with values values:

df[column_name] = values

Viewing a dataframe

Instead of viewing the table in whole, pandas provides different methods for sneaking at it.

Concerning the tables information, dataframe object contains many useful attributes:

Or in a more compact way, the method:

In particular, info indicates the categorical variables (which are not treated by describe).

One can also sort the rows indices according to its name or a column's values

3. Data manipulation

Let us consider the iris dataset as our toy dataset for this section. As a reminder:

Updating row and column label

it is possible to rename the row and column labels

The DataFrame object has the attribute columns. We can reassign it easily with a list.

For the row labels, the pandas DataFrame object offers many methods for updating it (see documentation with help command for more details about input arguments):

and finally like columns, reassign the index attribute of the DataFrame object.

Natural indexing

Like previous data structures (e.g. python lists or numpy arrays) natural indexing is performed with []. This indexes the columns of "dataframes" and the rows of "series".

Series is the data structure for a single column of a DataFrame: a DataFrame is actually stored in memory as a collection of Series.

You may want to extract several columns or several rows.

Remark: selecting with [[]] always return a dataframe.

Label based indexing and slicing: method .loc[]

Label based indexing is an enhancement of natural indexing, accessible with .loc[]. Indexing has to be thought as a matrix but with labels instead of positions. Hence, the rows are indexed first (instead of the columns with []).

position based indexing and slicing: method .iloc[]

Interger location (or position) based indexing is done with .iloc[]. It is similar to .loc[] but considering only integer positions instead of labels.

Remark: endpoints are not included (similarly to numpy arrays).

Boolean indexing and slicing

Similarly to Numpy arrays, dataframes can be indexed with Boolean variables thanks to .loc[].

The isin method enables to do a selection through and existence condition:

Remark: it is possible to do a random indexing with the method sample

Adding and deleting items

Let us consider a copy of the first 10 rows of iris.

Adding a column:

Adding a row:

Deleting rows and colums

4. Concatenating dataframes

thanks to the function concat of pandas, it is easy to concatenate pandas objects along a particular axis.

Remark: as always (see lecture on scientific computing), axis=0 is for index and axis=1 is for columns

Let us concatenate df (a copy of the first 10 rows of iris) with the following dataframes:

5. Descriptive statistics

A dataframe comes with many methods for descriptive statistics (a non-exhausive lists):

Remark: we can find these methods on numpy arrays as well.

Let us study only the dataframe df with the first 4 columns.

Aggregation: compute a summary statistic for each group. Some examples:

The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

SELECT mean(col1), min(col2), max(col3), median(col4), max(col4) - min(col4), col5
FROM Table
GROUP BY col5

We can do these aggregations with pandas:

6. Data visualization with Pandas

A dataframe also comes with many methods for data visualization (see lecture 5). These methods are based on the package matplotlib and therefore the customization of lecture 5 can be applied here (see documentation for more details).

Here, we illustrate just a few of them:

Line plot

Histogram

Scatter plot

Exercices

Exercice 1:

image.png

image-2.png

Exercice 2:

With the dataset imported from exercice 1, generate the following figure:

image.png

Exercice 3: Additional plots

Thanks to the dataset iris_plus previously generated,

image-4.png

image-5.png

Exercice 4:

From the dataset iris (or iris_plus), find out the average values of SepalLength and SepalWidth of all three species.