Lecture 5 : Data manipulation in Python¶
November, 2021 - François HU
Master of Science - EPITA
This lecture is available here: https://curiousml.github.io/
General introduction (a little long) ¶
Data can be represented in various forms: txt, csv, xls (excel), json, ... . In python, given a specific extension (.txt for example), we have many adequate modules for importing data. For "classical" files such as txt files, Python has some useful built-in commands for importing and handling them: we can open for example a txt file as write or read mode with the command open
.
Write and add mode¶
The information is always written in the form of strings and always added at the end of the file, which grows until all the information is written. The writing is always done according to the following same scheme.
- creation or opening of the file: when the file is opened, the file in which the information will be written is created if it does not exist or cleaned up if it already exists;
- writing thanks to the method
write
off
(TextIOWrapper
object); - closing: closing allows other programs to read what you have placed in this file.
# if we want to write in a .txt file in Python
# "w" is for write mode, we import the file "file_name.txt" as f and after the "with" block, the program close automatically
with open ("file_name.txt", "w") as f:
f.write("writing whatever I want in this file...")
f.write("and adding another information. ")
f.write("Let us skip two lines: \n\n")
f.write("Let us add tabulates: \t\t")
f.write("End.\n")
# "a" is for add mode, in the same file "file_name.txt", let us add more informations
with open ("file_name.txt", "a") as f:
f.write("\nAdding an information without erasing the previous informations")
read mode¶
The reading of a file allows to find the stored information. It takes place according to the same principle, namely :
opening the file in read mode;
reading directly iterating over the file object or using the
readlines
method;closing.
However, there is a difference when reading a file: it is done line by line, whereas writing does not necessarily follow a line-by-line division.
# if we want to read in a .txt file in Python
with open ("file_name.txt", "r") as f:
for ligne in f:
print(ligne)
writing whatever I want in this file...and adding another information. Let us skip two lines: Let us add tabulates: End. Adding an information without erasing the previous informations
Remark: the with
command handles the opening and the closing processes. Alternatively (although not recommended) we can write (for write mode):
f = open ("file_name.txt", "w") # opening
... # writing
... # writing
f.close () # closing
external packages¶
With the above Python built-in processes, importing and manipulating more "complex" types of data becomes too hard. For instance, let us import a csv
file with the above method and store the values in a list. You can download the iris
dataset here. Iris dataset is one of the best known toy database in the pattern recognition literature. The dataset contains 3 classes (of 50 instances each):
"Iris-setosa"
;"Iris-versicolor"
;- and
"Iris-virginica"
Each class refers to a type of iris plant.
table = []
with open ("data/iris.csv", "r") as f:
table = f.readlines()
##equivalently you can write ...
#for lines in f:
# table.append(lines)
table[:5] # let's look at the first 5 rows
['Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species\n', '1,5.1,3.5,1.4,0.2,Iris-setosa\n', '2,4.9,3.0,1.4,0.2,Iris-setosa\n', '3,4.7,3.2,1.3,0.2,Iris-setosa\n', '4,4.6,3.1,1.5,0.2,Iris-setosa\n']
As you can see, each line represent a string leading us to handle string objects instead of the wanted values. In this case it is recommended to use external packages.
1. Introduction to Dataframes ¶
This lecture explore how to represent and manipulate data and more preciselly datasets. Simply put, a dataset is just a collection of data often represented by tables where:
- each column of a table represents a variable (e.g. height, weight, age, grade, ...)
- and each row of a table represents an observation (just one case) of a variable.
The most well-known package in Python for handling efficiently data as a two-dimensional table is pandas which provides a container for tables, called Dataframe.
The main features of Pandas and its dataframe are:
- reading data from csv and Excel files;
- giving names to variables and index to observations;
- providing methods for visualization and descriptive statistics.
Like always, in a terminal (e.g. anaconda prompt), you can install the package pandas
with the command:
pip install pandas
We note that pandas is frequently renamed as pd
.
import pandas as pd
Below you will find the main differences between list, array and dataframe: