Lecture 4 : Data visualization with Python

April, 2022 - François HU

Master of Science - EPITA

This lecture is available here: https://curiousml.github.io/

image-4.png

Table of contents

Introduction to Matplotlib

  1. Pyplot interface
  2. Line plot
  3. Scatter plot
  4. Histogram
  5. Global customization
  6. Subplots
  1. Object-oriented interface
  2. [optional] Other plots: a quick overview

Exercices

Introduction to Matplotlib

Data visualization (or DataViz) is a process that allows you to understand data (e.g. patterns, trends or correlations) by representing it in a graphic form. For that purpose, there are many python packages available: in this notebook, we will visualize data thanks to Matplotlib which is the most used Python package for plotting.

Matplotlib is quite popular in Python thanks to its low-level coding which offers lots of freedom. Note that many "advanced" data visualization packages are built on top of Matplotlib. For instance:

1. Pyplot interface

Pyplot is a submodule of matplotlib where it contains a collection of functions enables you to create or modify figures. Pyplot is very good for creating basic graphs like line charts, bar charts, histograms and many more. All the pyplot commands make changes and modify the same figure so that the state (i.e., the figure) is preserved through various function calls (i.e., the methods that modify the figure).

As usual, one can install the package Matplotlib with the command

pip install matplotlib

in anaconda prompt. After installing the package matplotlib one can import it alongside with the submodule pyplot and then rename it plt (frequently used) with the command:

In a notebook, the code command %matplotlib configures the package that you will use to draw a figure. It performs a number of processes to prepare the display of the figure. It is recommended here to used it with the argument inline, which indicates that the package is integrated in Notebook. This directive must be included at the very beginning of your script, even before the package import directives.

2. Line plot

A line plot is a graph that uses lines to connect individual data points. A line plot displays quantitative values over a specified interval. This is particularly useful to visualize a (mathematical) function (e.g. sine, cosine, exponential or our own function). Let us plot a sine curve. For that purpose we will use one of the most popular function in matplotlib.pyplot: plot (see documentation for more information)

Toy example

At the end of the last line, we add ; in order to prevent returning additional output.

Simple customization

We can customize (non-exhaustive, see documentation for more details):

3. Scatter plot

Instead of curves, we want to create a scatter plot, a graph in which the values of two variables are plotted along two axes

Toy example

Let us first generate random points in the space $[0,1] \times [0,1]$

In Matplotlib we can use the scatter method for creating a scatter plot

Simple customization

The scatter plot can be customized. For instance we can customize (non-exhaustive, see documentation for more details):

4. Histogram

A histogram is a graphical display of numerical data by showing the number of data points that fall within a specified range of values (called "bins").

In Matplotlib we can create a Histogram using the hist method

Toy example

Simple customization

For histograms, we can customize (non-exhaustive, see documentation for more details):

We can also let the method hist to return a probability density instead of the raw count with the argument density=True

5. Global customization

More generally, every figure can be customized. In a nutshell one can customize (non exhaustive):

In pyplot, the methods for drawing a graph or editing a label apply by default to the last current state (last instance of a subplot or last instance of an axis for example). As a consequence, you must design your codes as a sequence of instructions (for example, you must not separate instructions that refer to the same graph in two different Notebook cells).

Remark: here, let us use the sine and cosine plot of the section 2 as a toy example.

Adding axes and title

Adding another line

Adding a legend

Customize the figure size

Customize the Axes

6. Subplots

Let us clarify the differences between several terms that will be used later. We call:

image.png

Up until now, we have seen a single figure containing a single Axes. Let us add more Axes to the current figure. To do so there are many approaches available in matplotlib. We will give one of the most well-known approach (among many others) based on the pyplot interface:

Subplots with the pyplot interface

We can create a figure with subplots with the function subplot of pyplot

subplot(nrows, ncols, index, **kwargs)

Calling this function will automatically (if the figure is not generated yet) create a figure containing nrows * ncols grid of Axes where the current Axes will be in the index-th position.

Important remark: Thus far we have seen pyplot based approaches for plotting. This approach has the advantage of being quick and easy to generate. However when a plotting becames more complex, it is recommended to use the object-oriented interface.

7. Object-oriented interface

Instead of using the submodule pyplot we can create the Figure and the set of Axes as explicit objects: we call it the object-oriented (OO) approach. This method produces a more robust and customizable way of plotting. Indeed, these (figure and axes) objects are stored and can be used or modified even after their visualization.

A more "cleaner" way to setup your figure will be as follows:

Plotting a curve

Scatter plot

Plotting histograms

Important remark: since it is an object, we can re-visualize it (or even update it) even after the previous cell execution

Subplots with the object-oriented approach

We present here two approaches to do subplots:

  1. with the function subplots
  1. with the method add_subplot of matplotlib.figure
add_subplot(nrows, ncols, index, **kwargs)

8. [optional] Other plots: a quick overview

Pie plot (or bar chart)

Bar plot

3D plot

Exercices

Exercice 1: line plot

  1. In the same Axes, plot the sine curve (in green) and the logarithmic curve (in red) in the interval $(0, 8]$

  2. then, in the same figure, color in steelblue the area between the sine and the logarithmic curve.

You should have the following graph:

image-3.png

Exercice 2: scatter plot

The following code plot a circle:

import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

theta = np.linspace(0, 2*np.pi, 100)
r = np.sqrt(1.0)
x1 = r*np.cos(theta)
x2 = r*np.sin(theta)

ax.plot(x1, x2)
ax.set_aspect(1);

Generate 500 random points in the space $[-1, 1]\times[-1, 1]$ such that:

You should have (approximately) the following figure

image-2.png

Exercice 3: subplots

Plot side by side the graphs produced in Exercice 1 and 2.

You should have:

image-2.png