April, 2022 - François HU
Master of Science - EPITA
This lecture is available here: https://curiousml.github.io/
Data can be represented in various forms: txt, csv, xls (excel), json, ... . In python, given a specific extension (.txt for example), we have many adequate modules for importing data. For "classical" files such as txt files, Python has some useful built-in commands for importing and handling them: we can open for example a txt file as write or read mode with the command open
.
The information is always written in the form of strings and always added at the end of the file, which grows until all the information is written. The writing is always done according to the following same scheme.
write
of f
(TextIOWrapper
object);# if we want to write in a .txt file in Python
# "w" is for write mode, we import the file "file_name.txt" as f and after the "with" block, the program close automatically
with open ("file_name.txt", "w") as f:
f.write("writing whatever I want in this file...")
f.write("and adding another information. ")
f.write("Let us skip two lines: \n\n")
f.write("Let us add tabulates: \t\t")
f.write("End.\n")
# "a" is for add mode, in the same file "file_name.txt", let us add more informations
with open ("file_name.txt", "a") as f:
f.write("\nAdding an information without erasing the previous informations")
The reading of a file allows to find the stored information. It takes place according to the same principle, namely :
opening the file in read mode;
reading directly iterating over the file object or using the readlines
method;
closing.
However, there is a difference when reading a file: it is done line by line, whereas writing does not necessarily follow a line-by-line division.
# if we want to read in a .txt file in Python
with open ("file_name.txt", "r") as f:
for ligne in f:
print(ligne)
writing whatever I want in this file...and adding another information. Let us skip two lines: Let us add tabulates: End. Adding an information without erasing the previous informations
Remark: the with
command handles the opening and the closing processes. Alternatively (although not recommended) we can write (for write mode):
f = open ("file_name.txt", "w") # opening
... # writing
... # writing
f.close () # closing
With the above Python built-in processes, importing and manipulating more "complex" types of data becomes too hard. For instance, let us import a csv
file with the above method and store the values in a list. You can download the iris
dataset here. Iris dataset is one of the best known toy database in the pattern recognition literature. The dataset contains 3 classes (of 50 instances each):
"Iris-setosa"
;"Iris-versicolor"
;"Iris-virginica"
Each class refers to a type of iris plant.
table = []
with open ("data/iris.csv", "r") as f:
table = f.readlines()
##equivalently you can write ...
#for lines in f:
# table.append(lines)
table[:5] # let's look at the first 5 rows
['Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species\n', '1,5.1,3.5,1.4,0.2,Iris-setosa\n', '2,4.9,3.0,1.4,0.2,Iris-setosa\n', '3,4.7,3.2,1.3,0.2,Iris-setosa\n', '4,4.6,3.1,1.5,0.2,Iris-setosa\n']
As you can see, each line represent a string leading us to handle string objects instead of the wanted values. In this case it is recommended to use external packages.
This lecture explore how to represent and manipulate data and more preciselly datasets. Simply put, a dataset is just a collection of data often represented by tables where:
The most well-known package in Python for handling efficiently data as a two-dimensional table is pandas which provides a container for tables, called Dataframe.
The main features of Pandas and its dataframe are:
Like always, in a terminal (e.g. anaconda prompt), you can install the package pandas
with the command:
pip install pandas
We note that pandas is frequently renamed as pd
.
import pandas as pd
Below you will find the main differences between list, array and dataframe:
Reading an existing dataframe with the method read_csv
(see documentation for more details).
# we gave the file path, how the values are separated in the file and the column chosen for being the observations index
iris = pd.read_csv('data/Iris.csv', sep=',', index_col="Id")
iris
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 5 columns
there are many ways for creating a dataframe from scratch:
import numpy as np
df1 = pd.DataFrame(data = np.random.rand(6,4),
index = ["u", "v", "w", "x", "y", "z"], # if not specified, it will be 0, 1, .. 5
columns = ["a", "b", "c", "d"]) # if not specified, it will be 0, 1, .. 5
df1
a | b | c | d | |
---|---|---|---|---|
u | 0.204438 | 0.670869 | 0.737901 | 0.384048 |
v | 0.719042 | 0.061850 | 0.285012 | 0.197061 |
w | 0.339598 | 0.792641 | 0.428429 | 0.253828 |
x | 0.138391 | 0.081931 | 0.396658 | 0.427210 |
y | 0.857614 | 0.135360 | 0.602730 | 0.196086 |
z | 0.698373 | 0.578144 | 0.732524 | 0.644210 |
df2 = pd.DataFrame({'a' : 1., # Single item
'b' : "same", # Single item
'c' : np.random.rand(5), # Multiple item
'd' : pd.Categorical(["test", "train", "train", "train", "test"])}) # Multiple item
df2
a | b | c | d | |
---|---|---|---|---|
0 | 1.0 | same | 0.131254 | test |
1 | 1.0 | same | 0.513057 | train |
2 | 1.0 | same | 0.094079 | train |
3 | 1.0 | same | 0.235618 | train |
4 | 1.0 | same | 0.301023 | test |
Like dictionaries, it is possible to add a new column (e.g. column_name
) with values values
:
df[column_name] = values
# let us add `e` to df1
df1['e'] = np.random.randint(0, 2, size=df1.shape[0])
df1
a | b | c | d | e | |
---|---|---|---|---|---|
u | 0.204438 | 0.670869 | 0.737901 | 0.384048 | 1 |
v | 0.719042 | 0.061850 | 0.285012 | 0.197061 | 1 |
w | 0.339598 | 0.792641 | 0.428429 | 0.253828 | 0 |
x | 0.138391 | 0.081931 | 0.396658 | 0.427210 | 0 |
y | 0.857614 | 0.135360 | 0.602730 | 0.196086 | 0 |
z | 0.698373 | 0.578144 | 0.732524 | 0.644210 | 0 |
Instead of viewing the table in whole, pandas provides different methods for sneaking at it.
iris.head(n=3) # first 3 rows / observations of the table
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
iris.tail(n=3) # last 3 rows / observations of the table
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
Id | |||||
148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
iris.values[:3] # values are in a numpy array (let us view only the first 3 rows)
array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'], [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], [4.7, 3.2, 1.3, 0.2, 'Iris-setosa']], dtype=object)
Concerning the tables information, dataframe object contains many useful attributes:
# iris data
print(iris.index, end="\n\n") # index of the table
print(iris.columns, end="\n\n") # columns of the table
print(iris.shape, end="\n\n") # shape of the table
print(iris.dtypes, end="\n\n") # data types of each variable
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], dtype='int64', name='Id', length=150) Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species'], dtype='object') (150, 5) SepalLengthCm float64 SepalWidthCm float64 PetalLengthCm float64 PetalWidthCm float64 Species object dtype: object
# df1 data
print(df1.index, end="\n\n") # index of the table
print(df1.columns, end="\n\n") # columns of the table
print(df1.shape, end="\n\n") # shape of the table
print(df1.dtypes, end="\n\n") # data types of each variable
Index(['u', 'v', 'w', 'x', 'y', 'z'], dtype='object') Index(['a', 'b', 'c', 'd', 'e'], dtype='object') (6, 5) a float64 b float64 c float64 d float64 e int64 dtype: object
Or in a more compact way, the method:
info
gives general information concerning the dataframe and describe
gives quantitative information concerning the dataframe. In particular, info indicates the categorical variables (which are not treated by describe).
iris.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 150 entries, 1 to 150 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SepalLengthCm 150 non-null float64 1 SepalWidthCm 150 non-null float64 2 PetalLengthCm 150 non-null float64 3 PetalWidthCm 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 7.0+ KB
iris.describe()
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
One can also sort the rows indices according to its name or a column's values
df1.sort_index(ascending=False)
a | b | c | d | e | |
---|---|---|---|---|---|
z | 0.698373 | 0.578144 | 0.732524 | 0.644210 | 0 |
y | 0.857614 | 0.135360 | 0.602730 | 0.196086 | 0 |
x | 0.138391 | 0.081931 | 0.396658 | 0.427210 | 0 |
w | 0.339598 | 0.792641 | 0.428429 | 0.253828 | 0 |
v | 0.719042 | 0.061850 | 0.285012 | 0.197061 | 1 |
u | 0.204438 | 0.670869 | 0.737901 | 0.384048 | 1 |
df1.sort_values(by='a')
a | b | c | d | e | |
---|---|---|---|---|---|
x | 0.138391 | 0.081931 | 0.396658 | 0.427210 | 0 |
u | 0.204438 | 0.670869 | 0.737901 | 0.384048 | 1 |
w | 0.339598 | 0.792641 | 0.428429 | 0.253828 | 0 |
z | 0.698373 | 0.578144 | 0.732524 | 0.644210 | 0 |
v | 0.719042 | 0.061850 | 0.285012 | 0.197061 | 1 |
y | 0.857614 | 0.135360 | 0.602730 | 0.196086 | 0 |
Let us consider the iris
dataset as our toy dataset for this section. As a reminder:
iris.head()
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
it is possible to rename the row and column labels
The DataFrame object has the attribute columns
. We can reassign it easily with a list.
print(iris.columns)
iris.columns = ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth", "Species"]
print(iris.columns)
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species'], dtype='object') Index(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'], dtype='object')
For the row labels, the pandas DataFrame object offers many methods for updating it (see documentation with help
command for more details about input arguments):
reset_index
which resets an index of a Data Frame. reset_index() method sets a list of integers ranging from 0 to length of data as an index;set_index
which sets the DataFrame index (a.k.a row labels) using existing (one or more) columns;and finally like columns, reassign the index
attribute of the DataFrame object.
iris.head(3)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
# you can reset index with:
iris.reset_index(inplace=True) # the Id is now a new column of the dataframe
iris.head(3)
Id | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
# you can reassign the "Id" columns as a Dataframe index:
iris.set_index("Id", drop=True, inplace=True)
iris.head(3)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
# Let us rename the index:
# rk: alternatively we can use `reindex` which creates a new index and reindex the dataframe
iris.index = ["lab " + str(i) for i in range(len(iris))]
iris.head()
## in the same logic but more compact:
#iris.index = "lab " + iris.index.map(str)
#head(iris)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Like previous data structures (e.g. python lists or numpy arrays) natural indexing is performed with []
. This indexes the columns of "dataframes" and the rows of "series".
Series is the data structure for a single column of a DataFrame
: a DataFrame is actually stored in memory as a collection of Series.
iris['SepalWidth'] # iris is a "DataFrame" and the result of this execution is a "Series"
lab 0 3.5 lab 1 3.0 lab 2 3.2 lab 3 3.1 lab 4 3.6 ... lab 145 3.0 lab 146 2.5 lab 147 3.0 lab 148 3.4 lab 149 3.0 Name: SepalWidth, Length: 150, dtype: float64
s = iris['SepalWidth'] # iris is a "DataFrame" and the result of this execution is a "Series"
s[2]
3.2
You may want to extract several columns or several rows.
iris[['SepalWidth', 'PetalLength']]
SepalWidth | PetalLength | |
---|---|---|
lab 0 | 3.5 | 1.4 |
lab 1 | 3.0 | 1.4 |
lab 2 | 3.2 | 1.3 |
lab 3 | 3.1 | 1.5 |
lab 4 | 3.6 | 1.4 |
... | ... | ... |
lab 145 | 3.0 | 5.2 |
lab 146 | 2.5 | 5.0 |
lab 147 | 3.0 | 5.2 |
lab 148 | 3.4 | 5.4 |
lab 149 | 3.0 | 5.1 |
150 rows × 2 columns
Remark: selecting with [[]]
always return a dataframe.
iris[['SepalWidth']]
SepalWidth | |
---|---|
lab 0 | 3.5 |
lab 1 | 3.0 |
lab 2 | 3.2 |
lab 3 | 3.1 |
lab 4 | 3.6 |
... | ... |
lab 145 | 3.0 |
lab 146 | 2.5 |
lab 147 | 3.0 |
lab 148 | 3.4 |
lab 149 | 3.0 |
150 rows × 1 columns
.loc[]
¶Label based indexing is an enhancement of natural indexing, accessible with .loc[]
. Indexing has to be thought as a matrix but with labels instead of positions. Hence, the rows are indexed first (instead of the columns with []
).
iris.loc["lab 3"] # Single row
SepalLength 4.6 SepalWidth 3.1 PetalLength 1.5 PetalWidth 0.2 Species Iris-setosa Name: lab 3, dtype: object
iris.loc[:, 'SepalWidth'].head() # Single column
lab 0 3.5 lab 1 3.0 lab 2 3.2 lab 3 3.1 lab 4 3.6 Name: SepalWidth, dtype: float64
iris.loc[['lab 3', 'lab 16']] # Multiple rows
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
iris.loc['lab 3':'lab 16'] # Row slicing
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
lab 10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
lab 11 | 4.8 | 3.4 | 1.6 | 0.2 | Iris-setosa |
lab 12 | 4.8 | 3.0 | 1.4 | 0.1 | Iris-setosa |
lab 13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
lab 14 | 5.8 | 4.0 | 1.2 | 0.2 | Iris-setosa |
lab 15 | 5.7 | 4.4 | 1.5 | 0.4 | Iris-setosa |
lab 16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
iris.loc['lab 3':'lab 16':3] # Row slicing (with step 3)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
lab 12 | 4.8 | 3.0 | 1.4 | 0.1 | Iris-setosa |
lab 15 | 5.7 | 4.4 | 1.5 | 0.4 | Iris-setosa |
iris.loc['lab 3':'lab 16':3, :'PetalWidth'] # Row and column slicing
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 |
lab 12 | 4.8 | 3.0 | 1.4 | 0.1 |
lab 15 | 5.7 | 4.4 | 1.5 | 0.4 |
.iloc[]
¶Interger location (or position) based indexing is done with .iloc[]
. It is similar to .loc[]
but considering only integer positions instead of labels.
Remark: endpoints are not included (similarly to numpy arrays).
iris.iloc[:2]
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
iris.iloc[10:20:2, ::2]
SepalLength | PetalLength | Species | |
---|---|---|---|
lab 10 | 5.4 | 1.5 | Iris-setosa |
lab 12 | 4.8 | 1.4 | Iris-setosa |
lab 14 | 5.8 | 1.2 | Iris-setosa |
lab 16 | 5.4 | 1.3 | Iris-setosa |
lab 18 | 5.7 | 1.7 | Iris-setosa |
Similarly to Numpy arrays, dataframes can be indexed with Boolean variables thanks to .loc[]
.
iris.loc[iris['PetalWidth'] > 2.3] # Row selection
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 100 | 6.3 | 3.3 | 6.0 | 2.5 | Iris-virginica |
lab 109 | 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
lab 114 | 5.8 | 2.8 | 5.1 | 2.4 | Iris-virginica |
lab 136 | 6.3 | 3.4 | 5.6 | 2.4 | Iris-virginica |
lab 140 | 6.7 | 3.1 | 5.6 | 2.4 | Iris-virginica |
lab 144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
iris.loc[iris['PetalWidth'] > 2.3, ["PetalWidth", "Species"]] # Row and column selection
PetalWidth | Species | |
---|---|---|
lab 100 | 2.5 | Iris-virginica |
lab 109 | 2.5 | Iris-virginica |
lab 114 | 2.4 | Iris-virginica |
lab 136 | 2.4 | Iris-virginica |
lab 140 | 2.4 | Iris-virginica |
lab 144 | 2.5 | Iris-virginica |
The isin
method enables to do a selection through and existence condition:
# we select only rows with "Iris-virginica" OR 'Iris-virginica' species
iris.loc[ iris['Species'].isin(['Iris-setosa', 'Iris-virginica']) ]
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
lab 145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
lab 146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
lab 147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
lab 148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
lab 149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
100 rows × 5 columns
Remark: it is possible to do a random indexing with the method sample
iris.sample(n=5) # 5 random sampling
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
lab 107 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica |
lab 92 | 5.8 | 2.6 | 4.0 | 1.2 | Iris-versicolor |
lab 144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
Let us consider a copy of the first 10 rows of iris
.
df = iris.iloc[:10].copy()
df
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
Adding a column:
df['RandomNumbers'] = np.random.rand(10)
df
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | RandomNumbers | |
---|---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 0.181950 |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 0.929935 |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 0.060427 |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 0.132539 |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 0.026432 |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | 0.800114 |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa | 0.635281 |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa | 0.677129 |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa | 0.132401 |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa | 0.589566 |
Adding a row:
df.loc['duplicate last row'] = df.iloc[-1,:]
df
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | RandomNumbers | |
---|---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 0.181950 |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 0.929935 |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 0.060427 |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 0.132539 |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 0.026432 |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | 0.800114 |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa | 0.635281 |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa | 0.677129 |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa | 0.132401 |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa | 0.589566 |
duplicate last row | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa | 0.589566 |
Deleting rows and colums
df.drop('RandomNumbers', inplace=True, axis=1)
df.drop('duplicate last row', inplace=True, axis=0)
df
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
thanks to the function concat
of pandas, it is easy to concatenate pandas objects along a particular axis.
Remark: as always (see lecture on scientific computing), axis=0
is for index and axis=1
is for columns
Let us concatenate df
(a copy of the first 10 rows of iris
) with the following dataframes:
# new (very giant!) species
df_row = pd.DataFrame({
'SepalLength': [10, 20, 30],
'SepalWidth': [10, 20, 30],
'PetalLength': [10, 20, 30],
'PetalWidth': [10, 20, 30],
'Species': ["Iris-giant", "Iris-giant", "Iris-giant"]
}, index=["new 1", "new 2", "new 3"])
df_row
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
new 1 | 10 | 10 | 10 | 10 | Iris-giant |
new 2 | 20 | 20 | 20 | 20 | Iris-giant |
new 3 | 30 | 30 | 30 | 30 | Iris-giant |
# new information: "Age" and "Country" of the iris
df_col = pd.DataFrame({
'Age': np.random.randint(0, 10, 10),
'Country': ["France", "Italy", "Spain", "China", "US", "France", "Spain", "Spain", "France", "Japan"]
}, index=df.index)
df_col
Age | Country | |
---|---|---|
lab 0 | 9 | France |
lab 1 | 2 | Italy |
lab 2 | 7 | Spain |
lab 3 | 5 | China |
lab 4 | 8 | US |
lab 5 | 7 | France |
lab 6 | 8 | Spain |
lab 7 | 6 | Spain |
lab 8 | 0 | France |
lab 9 | 5 | Japan |
# concatenate with axis=0
pd.concat((df, df_row), axis=0)
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
new 1 | 10.0 | 10.0 | 10.0 | 10.0 | Iris-giant |
new 2 | 20.0 | 20.0 | 20.0 | 20.0 | Iris-giant |
new 3 | 30.0 | 30.0 | 30.0 | 30.0 | Iris-giant |
# concatenate with axis=1 and assign it to the variable name `data`
data = pd.concat((df, df_col), axis=1)
data
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | Age | Country | |
---|---|---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 9 | France |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 2 | Italy |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 7 | Spain |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 5 | China |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 8 | US |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | 7 | France |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa | 8 | Spain |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa | 6 | Spain |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa | 0 | France |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa | 5 | Japan |
# for a more advanced concatenation (merge inner/outer join, ...), see the documentation
help(pd.concat)
Help on function concat in module pandas.core.reshape.concat: concat(objs: 'Iterable[NDFrame] | Mapping[Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity: 'bool' = False, sort: 'bool' = False, copy: 'bool' = True) -> 'FrameOrSeriesUnion' Concatenate pandas objects along a particular axis with optional set logic along the other axes. Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number. Parameters ---------- objs : a sequence or mapping of Series or DataFrame objects If a mapping is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised. axis : {0/'index', 1/'columns'}, default 0 The axis to concatenate along. join : {'inner', 'outer'}, default 'outer' How to handle indexes on other axis (or axes). ignore_index : bool, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level. levels : list of sequences, default None Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys. names : list, default None Names for the levels in the resulting hierarchical index. verify_integrity : bool, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation. sort : bool, default False Sort non-concatenation axis if it is not already aligned when `join` is 'outer'. This has no effect when ``join='inner'``, which already preserves the order of the non-concatenation axis. .. versionchanged:: 1.0.0 Changed to not sort by default. copy : bool, default True If False, do not copy data unnecessarily. Returns ------- object, type of objs When concatenating all ``Series`` along the index (axis=0), a ``Series`` is returned. When ``objs`` contains at least one ``DataFrame``, a ``DataFrame`` is returned. When concatenating along the columns (axis=1), a ``DataFrame`` is returned. See Also -------- Series.append : Concatenate Series. DataFrame.append : Concatenate DataFrames. DataFrame.join : Join DataFrames using indexes. DataFrame.merge : Merge DataFrames by indexes or columns. Notes ----- The keys, levels, and names arguments are all optional. A walkthrough of how this method fits in with other tools for combining pandas objects can be found `here <https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html>`__. Examples -------- Combine two ``Series``. >>> s1 = pd.Series(['a', 'b']) >>> s2 = pd.Series(['c', 'd']) >>> pd.concat([s1, s2]) 0 a 1 b 0 c 1 d dtype: object Clear the existing index and reset it in the result by setting the ``ignore_index`` option to ``True``. >>> pd.concat([s1, s2], ignore_index=True) 0 a 1 b 2 c 3 d dtype: object Add a hierarchical index at the outermost level of the data with the ``keys`` option. >>> pd.concat([s1, s2], keys=['s1', 's2']) s1 0 a 1 b s2 0 c 1 d dtype: object Label the index keys you create with the ``names`` option. >>> pd.concat([s1, s2], keys=['s1', 's2'], ... names=['Series name', 'Row ID']) Series name Row ID s1 0 a 1 b s2 0 c 1 d dtype: object Combine two ``DataFrame`` objects with identical columns. >>> df1 = pd.DataFrame([['a', 1], ['b', 2]], ... columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.DataFrame([['c', 3], ['d', 4]], ... columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1, df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects with overlapping columns and return everything. Columns outside the intersection will be filled with ``NaN`` values. >>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], ... columns=['letter', 'number', 'animal']) >>> df3 letter number animal 0 c 3 cat 1 d 4 dog >>> pd.concat([df1, df3], sort=False) letter number animal 0 a 1 NaN 1 b 2 NaN 0 c 3 cat 1 d 4 dog Combine ``DataFrame`` objects with overlapping columns and return only those that are shared by passing ``inner`` to the ``join`` keyword argument. >>> pd.concat([df1, df3], join="inner") letter number 0 a 1 1 b 2 0 c 3 1 d 4 Combine ``DataFrame`` objects horizontally along the x axis by passing in ``axis=1``. >>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']], ... columns=['animal', 'name']) >>> pd.concat([df1, df4], axis=1) letter number animal name 0 a 1 bird polly 1 b 2 monkey george Prevent the result from including duplicate index values with the ``verify_integrity`` option. >>> df5 = pd.DataFrame([1], index=['a']) >>> df5 0 a 1 >>> df6 = pd.DataFrame([2], index=['a']) >>> df6 0 a 2 >>> pd.concat([df5, df6], verify_integrity=True) Traceback (most recent call last): ... ValueError: Indexes have overlapping values: ['a']
A dataframe comes with many methods for descriptive statistics (a non-exhausive lists):
Remark: we can find these methods on numpy arrays as well.
Let us study only the dataframe df
with the first 4 columns.
df = df.drop(["Species"], axis=1, inplace=False)
df
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 |
lab 5 | 5.4 | 3.9 | 1.7 | 0.4 |
lab 6 | 4.6 | 3.4 | 1.4 | 0.3 |
lab 7 | 5.0 | 3.4 | 1.5 | 0.2 |
lab 8 | 4.4 | 2.9 | 1.4 | 0.2 |
lab 9 | 4.9 | 3.1 | 1.5 | 0.1 |
df.median() # Median of numeric columns
# alternatively: df.median(axis=0)
SepalLength 4.9 SepalWidth 3.3 PetalLength 1.4 PetalWidth 0.2 dtype: float64
df.median(axis=1) # Median of rows (numeric objects only)
lab 0 2.45 lab 1 2.20 lab 2 2.25 lab 3 2.30 lab 4 2.50 lab 5 2.80 lab 6 2.40 lab 7 2.45 lab 8 2.15 lab 9 2.30 dtype: float64
df['SepalLength'].value_counts()
4.9 2 4.6 2 5.0 2 5.1 1 4.7 1 5.4 1 4.4 1 Name: SepalLength, dtype: int64
df.max(axis=0)
SepalLength 5.4 SepalWidth 3.9 PetalLength 1.7 PetalWidth 0.4 dtype: float64
Aggregation: compute a summary statistic for each group. Some examples:
iris.groupby("Species").sum() # group sums
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
Species | ||||
Iris-setosa | 250.3 | 170.9 | 73.2 | 12.2 |
Iris-versicolor | 296.8 | 138.5 | 213.0 | 66.3 |
Iris-virginica | 329.4 | 148.7 | 277.6 | 101.3 |
iris.groupby("Species").mean() # group means
SepalLength | SepalWidth | PetalLength | PetalWidth | |
---|---|---|---|---|
Species | ||||
Iris-setosa | 5.006 | 3.418 | 1.464 | 0.244 |
Iris-versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
Iris-virginica | 6.588 | 2.974 | 5.552 | 2.026 |
The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:
SELECT mean(col1), min(col2), max(col3), median(col4), max(col4) - min(col4), col5
FROM Table
GROUP BY col5
We can do these aggregations with pandas:
iris.groupby("Species").agg(
SLmean = pd.NamedAgg(column="SepalLength", aggfunc=np.mean),
SWmin = pd.NamedAgg(column="SepalWidth", aggfunc=np.min),
PLmax = pd.NamedAgg(column="PetalLength", aggfunc=np.max),
PWmed = pd.NamedAgg(column="PetalWidth", aggfunc=np.median),
PWdif = pd.NamedAgg(column="PetalWidth", aggfunc=lambda x: max(x)-min(x))
)
SLmean | SWmin | PLmax | PWmed | PWdif | |
---|---|---|---|---|---|
Species | |||||
Iris-setosa | 5.006 | 2.3 | 1.9 | 0.2 | 0.5 |
Iris-versicolor | 5.936 | 2.0 | 5.1 | 1.3 | 0.8 |
Iris-virginica | 6.588 | 2.2 | 6.9 | 2.0 | 1.1 |
A dataframe also comes with many methods for data visualization (see lecture 5). These methods are based on the package matplotlib
and therefore the customization of lecture 5 can be applied here (see documentation for more details).
Here, we illustrate just a few of them:
iris
;iris_plus
generated below.iris.head()
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
# generate iris_plus
countries = ["France", "Italy", "Spain", "China", "US", "Japan"]
probabilities = [0.35, 0.2, 0.05, 0.1, 0.1, 0.2]
extension = pd.DataFrame({
'Age': np.random.randint(0, 10, len(iris)),
'Country': np.random.choice(countries, size=len(iris), replace=True, p=probabilities)
}, index=iris.index)
iris_plus = pd.concat((iris, extension), axis=1)
iris_plus.head()
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | Age | Country | |
---|---|---|---|---|---|---|---|
lab 0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 3 | Italy |
lab 1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 9 | China |
lab 2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 8 | France |
lab 3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 4 | France |
lab 4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 2 | Italy |
ax = iris.plot(figsize=(10, 6)) # Columns vs index
ax.set_title("Line plot w.r.t. species");
iris.plot(subplots=True, figsize=(10, 8)); # Columns vs index
ax = iris.plot.hist(alpha=0.5, figsize=(10, 6))
ax.set_title("Histogram")
ax.set_xlabel("in cm");
iris.plot.hist(alpha=0.5, subplots=True, figsize=(10, 8));
iris.plot.scatter(x="SepalLength", y="PetalLength");
defra_consumption
(you can first download it from my website). You should have the following first 5 rows:How many rows/columns, quantitative/qualitative features do we have in this dataset ?
Extract the following subtable:
With the dataset imported from exercice 1, generate the following figure:
Thanks to the dataset iris_plus
previously generated,
From the dataset iris
(or iris_plus
), find out the average values of SepalLength
and SepalWidth
of all three species.