*April, 2022 - François HU*

*Master of Science - EPITA*

*This lecture is available here: https://curiousml.github.io/*

Data visualization (or DataViz) is a process that allows you to understand data (e.g. patterns, trends or correlations) by representing it in a graphic form. For that purpose, there are many python packages available: in this notebook, we will visualize data thanks to Matplotlib which is the most used Python package for plotting.

Matplotlib is quite popular in Python thanks to its low-level coding which offers lots of freedom. Note that many "advanced" data visualization packages are built on top of Matplotlib. For instance:

- Seaborn for statistical data visualization (see next lecture)
- the plotting methods in Pandas DataFrames (see the next lecture)
- Plotly for creating interactive plots

**Pyplot** is a submodule of matplotlib where it contains a collection of functions enables you to create or modify figures. Pyplot is very good for creating basic graphs like line charts, bar charts, histograms and many more. All the pyplot commands make changes and modify the same figure so that the state (i.e., the figure) is preserved through various function calls (i.e., the methods that modify the figure).

We will see how the

**pyplot interface**works in section 3, 4, 5 and 6.In section 7 (and section 8) we will see another interface: the

**object-oriented interface**. This interface is generally more flexible than the pyplot interface.

As usual, one can install the package Matplotlib with the command

`pip install matplotlib`

in anaconda prompt. After installing the package matplotlib one can import it alongside with the submodule `pyplot`

and then rename it `plt`

(frequently used) with the command:

In [1]:

```
import matplotlib.pyplot as plt
```

In a notebook, the code command `%matplotlib`

configures the package that you will use to draw a figure. It performs a number of processes to prepare the display of the figure. It is recommended here to used it with the argument `inline`

, which indicates that the package is integrated in Notebook. This directive must be included at the very beginning of your script, even before the package import directives.

In [2]:

```
%matplotlib inline
import matplotlib.pyplot as plt # for plotting
import numpy as np # for array
```

A line plot is a graph that uses lines to connect individual data points. A line plot displays quantitative values over a specified interval. This is particularly useful to visualize a (mathematical) function (e.g. sine, cosine, exponential or our own function). Let us plot a sine curve. For that purpose we will use one of the most popular function in `matplotlib.pyplot`

: `plot`

(see documentation for more information)

In [3]:

```
# our toy example
x = np.arange(0, 2*np.pi, 0.1) # define the horizontal axis (or x axis)
y1 = np.sin(x) # define the vertical axis (or y axis)
y2 = np.cos(x) # define the vertical axis (or y axis)
```

In [4]:

```
plt.plot(x, y1); # plot the sine curve. The chosen interval is [0, 2 pi). `;`
# plt.show() # this line is used in IDEs for showing the plot. We don't need it in notebooks
```

In [5]:

```
plt.plot(x, y2); # plot the cosine curve. The chosen interval of analysis is [0, 2 pi)
# plt.show() # this line is used in IDEs for showing the plot. We don't need it in notebooks
```

At the end of the last line, we add `;`

in order to prevent returning additional output.

We can customize (non-exhaustive, see documentation for more details):

- the color of the line;
- the "alpha" (degree of transparency) of the line;
- the style of the line and its width.

In [6]:

```
# changing linestyle, color and linewidth
plt.plot(x, y1, linestyle='dashed', color="red", linewidth=10);
```

In [7]:

```
# changing linestyle, color and linewidth
plt.plot(x, y2, linestyle='dashdot', color='blue', linewidth=2);
```

In [8]:

```
# generating random points in the space [0, 1] x [0, 1]
import numpy as np
n = 50
x_scatter = np.random.rand(n)
y_scatter = np.random.rand(n)
```

In Matplotlib we can use the `scatter`

method for creating a scatter plot

In [9]:

```
plt.scatter(x_scatter, y_scatter);
```

The scatter plot can be customized. For instance we can customize (non-exhaustive, see documentation for more details):

- the color of the points;
- the "alpha" (degree of transparency) of the points;
- the marker style and its size.

In [10]:

```
plt.scatter(x_scatter, y_scatter, c="red", alpha=0.4, marker="o", s=200);
```

In [11]:

```
size = np.random.rand(n)
size = np.exp(size) * 200
plt.scatter(x_scatter, y_scatter, c="green", alpha=0.4, marker="o", s=size);
```

A **histogram** is a graphical display of numerical data by showing the number of data points that fall within a specified range of values (called "bins").

In Matplotlib we can create a Histogram using the `hist`

method

In [12]:

```
import numpy as np
n = 5000 # number of points
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
sample = mu + sigma * np.random.randn(n) # we just have some random points around mu=100 and with a deviation of sigma = 15
plt.hist(sample);
```

For histograms, we can customize (non-exhaustive, see documentation for more details):

- the number of equal-width bins;
- the "alpha" (degree of transparency) of the bars;
- the color of the bars.

We can also let the method `hist`

to return a probability density instead of the raw count with the argument `density=True`

In [13]:

```
import numpy as np
n = 5000
mu = 100
sigma = 15
sample = mu + sigma * np.random.randn(n)
num_bins = 50
plt.hist(sample, num_bins, density=True, facecolor='red', alpha=0.2);
```

More generally, every figure can be customized. In a nutshell one can customize (non exhaustive):

- axes labels and title;
- legend;
- additional information (e.g. texts, arrows, another curve, ...);
- overall look of your matplotlib plot (e.g. size of the figure, having subfigures, ...).

In pyplot, the methods for drawing a graph or editing a label apply by default to the last current state (last instance of a subplot or last instance of an axis for example). As a consequence, you must design your codes as a sequence of instructions (for example, you must not separate instructions that refer to the same graph in two different Notebook cells).

**Remark:** here, let us use the sine and cosine plot of the section 2 as a toy example.

In [14]:

```
plt.plot(x, y1)
plt.grid()
plt.xlabel("x") # horizontal label
plt.ylabel("sin(x)") # vertical label
plt.title("Sine curve"); # title of the figure
```

In [15]:

```
plt.plot(x, y1)
plt.plot(x, y2) # adding cosine curve
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.title("Sine and cosine curves");
```

In [16]:

```
# adding a legend thx to the argument `label` of plot and the function `legend` of pyplot
plt.plot(x, y1, label="sin") # adding a label for the legend
plt.plot(x, y2, label="cos") # adding a label for the legend
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.title("Sine and cosine curves")
plt.legend(); # adding legend
```

In [17]:

```
# choosing the size of the figure
plt.figure(figsize=(15, 5)) # where figsize sets width x height in inches
plt.plot(x, y1, label="sin")
plt.plot(x, y2, label="cos")
plt.grid()
plt.xlabel("x")
plt.ylabel("y")
plt.title("Sine and cosine curves")
plt.legend();
```