The correction is available on this website : https://curiousml.github.io/
This part is an exploration of the popular packages in python : NumPy and Matplotlib
import numpy as np
Q1.
The chessboard can be generated as follows :
A = np.zeros((8,8))
A[1::2, ::2] = 1
A[::2, 1::2] = 1
print(A)
Q2.
The generation of the matrix can be coded as :
# we choose the Fortran-like order and not the C-like order
M = np.arange(1, 21).reshape((4, 5), order='F')
print(M)
The code for the extraction is the following :
M[:,[1, -1, 2]][[1, 2, 0],:]
No question here, but you can run each line of code of the TP in order to become familiar with the package.
import matplotlib.pyplot as plt
Q3. let us generate 15 real numbers drawn uniformly on $[-1, 1]$. These numbers define our array x
import numpy as np
a = -1
b = 1
m = 15
x = a + (b - a) * np.random.random_sample(m)
print(x)
Q4. Let us define the array y defined by the tp
def g(x):
return((3/2)*x**3 - x**2 -(3/4)*x + 1)
y = g(x) + (1/20)*np.random.randn(m)
Q5. We are going to visualize in the plan the data created
plt.plot(x,y,'bo')
plt.show()
small remark : if we only observe this graph we can have an intuition of what family of predictors that corresponds. But (unfortunately ?) in reality, we rarely have a single descriptive feature.
Q6. With the same process, generate a sample test of size $m' = 30$ named $x_{test}$ and $y_{test}$
m_test = 30
x_test = 2*np.random.random_sample(m_test)-1
y_test = g(x_test) + .05*np.random.randn(m_test)
We define a predictor $f$ of linear regression to leas squares.
from sklearn.linear_model import LinearRegression
f = LinearRegression()
The predictor needs a two dimensional descriptive data $X$ as input :
X = x[:, np.newaxis]
X_test = x_test[:, np.newaxis]
Train our model :
f.fit(X, y)
Q7. Let's visualise the performance :
xplot = np.linspace(-1,1,500).reshape(-1,1)
plt.plot(x,y,'bo')
plt.plot(xplot,f.predict(xplot))
plt.show()
small remark 1 : we can certainly do better...
small remark 2 : we can better evaluate the performance than visualy. Let's quantify the performance !
Q8. We have the mean error of train and the mean error of test :
print("for train set :", sum((y-f.predict(X))**2)/m)
print("for test set :", sum((y_test-f.predict(X_test))**2)/m_test)
We remark that the predictor is much more accurate on the training set than the test set. It means that the predictor overfit on the training set.
Q9. We are going to use a new predictor and see his performance. This time we are going to use a polynomial regression we degree 2.
## a
from sklearn.preprocessing import PolynomialFeatures
psi = PolynomialFeatures(2,include_bias=False).fit_transform
f = LinearRegression().fit(psi(X),y)
## b
plt.plot(x,y,'bo')
plt.plot(xplot,f.predict(psi(xplot)))
plt.show()
## c
print("for train set :", sum((y-f.predict(psi(X)))**2)/m)
print("for test set :", sum((y_test-f.predict(psi(X_test)))**2)/m_test)
We observe a boost in performance compared to the linear regression and we seem to overfit (on the data) less than linear regression.
Q10. Let us generalize the computation to any maximum degree of polynomials $n\geq 1$
## a
def reg_poly(n):
psi = PolynomialFeatures(n,include_bias=False).fit_transform
return LinearRegression().fit(psi(X),y), psi
## b
plt.plot(x,y,'bo')
poly = {}
psi = {}
for n in [3,4,13,14]:
poly[n], psi[n] = reg_poly(n)
plt.plot(xplot,poly[n].predict(psi[n](xplot)))
plt.axis([-1, 1, -1.5, 1.5])
plt.legend(['data','n = 3', 'n = 4', 'n = 13', 'n = 14'])
plt.show()
## c
print("intercept for n = 3", poly[3].intercept_)
print()
print("coefficients for n = 3", poly[3].coef_)
For $n=3$, these coefficients seem to be similar to the coefficients of the function $g$.
d) For $n=14$, there is a polynomial that passes through each of the 15 points (Lagrange interpolation polynomial). This polynomial obviously minimizes the empirical risk, so it is the one given by the minimization of the empirical risk.
## e
mean_train_error = []
mean_test_error = []
n_max = 14
for n in range(1,n_max+1):
f, psi = reg_poly(n)
mean_train_error.append(sum((f.predict(psi(X))-y)**2)/m)
mean_test_error.append(sum((f.predict(psi(X_test))-y_test)**2)/m_test)
plt.plot(range(1,n_max+1),mean_train_error)
plt.plot(range(1,n_max+1),mean_test_error)
plt.axis([0, n_max, 0, .2])
plt.show()
The higher the $n$ is, the more complex the resulting polyomials are (and so the learning error is lower).
The test error decreases first. Then, when n is large, it increases: this is called overfitting. The high complexity of the polynomials makes it easier to stick to the learning points, but does not allow good generalization to new points.
Q11. We know that with $n=14$, the predictor overfit the data. one of the most popular method for preventing overfitting is to desensitize your model to the training data with regularization like LASSO regularization. Let us view this more in details.
from sklearn.linear_model import Lasso
def reg_poly_lasso(n, alpha=1):
psi = PolynomialFeatures(n,include_bias=False).fit_transform
return Lasso(alpha).fit(psi(X),y), psi
for alpha in [10**a for a in range(-6,1)]:
f, psi = reg_poly_lasso(14,alpha)
plt.plot(x,y,'bo')
plt.plot(xplot,f.predict(psi(xplot)))
plt.axis([-1, 1, -1.5, 1.5])
print('Coefficients:',f.coef_)
print()
plt.show()
For large alpha values, the algorithm produces polynomials with many zero or close to zero coefficients. This limits the complexity of the polynomials and thus the overfitting.
Q12. (Q13.)
mean_train_error = []
mean_test_error = []
nonzero_coefficients = []
degree = []
alpha_range = [10**a for a in range(-6,1)]
for alpha in alpha_range:
f, psi = reg_poly_lasso(14,alpha)
mean_train_error.append(sum((f.predict(psi(X))-y)**2)/m)
mean_test_error.append(sum((f.predict(psi(X_test))-y_test)**2)/m_test)
plt.plot(np.log10(alpha_range),mean_train_error)
plt.plot(np.log10(alpha_range),mean_test_error)
plt.axis([-6,0,0,0.5])
plt.show()
The alpha parameter plays a role similar to n in question 10 (but in the opposite direction).
When alpha decreases, the polynomials obtained are more and more complex, and the learning error decreases. But when alpha is small enough, the test error increases again : there is overfitting.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target
Q14. Let us split the data into training sample and test sample. We can simply use the sklearn.model_selection function train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)
# test train spit at 75% - 25%
TEST : First, we are going to train a SVM and give his score on test sample as follows :
from sklearn.svm import LinearSVC
svm = LinearSVC(C=1).fit(X_train,y_train)
print('Score on test sample :', svm.score(X_test,y_test))
Q15. Let us train again a SVM predictor with the hyperparameter C calibrated with cross-validation
from sklearn.model_selection import validation_curve
C_range = np.logspace(-6,-2,50)
train_scores, valid_scores = validation_curve(LinearSVC(), X_train, y_train, "C", C_range, cv=10)
train_scores_mean = np.mean(train_scores,axis=1)
valid_scores_mean = np.mean(valid_scores,axis=1)
plt.plot(np.log10(C_range),train_scores_mean,label="scores d'apprentissage")
plt.plot(np.log10(C_range),valid_scores_mean,label="scores de validation")
plt.legend()
plt.xlabel('log(C)')
plt.ylabel('score')
plt.show()
After having set our hyperparameter 𝐶 correctly, we observe an improvement in the score!
C_best = C_range[np.argmax(valid_scores_mean)]
svm = LinearSVC(C=C_best).fit(X_train,y_train)
print('Score:', svm.score(X_test,y_test))
y_test_predict = svm.predict(X_test)
for more information on the hyperparameter C : https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel
Q16. Compute the precision and recall from the coefficients of the confusion matrix :
from sklearn.metrics import confusion_matrix
confusion_matrix_test = confusion_matrix(y_test,y_test_predict)
print(confusion_matrix_test)
tp = confusion_matrix_test[0,0]
fp = confusion_matrix_test[1,0]
fn = confusion_matrix_test[0,1]
recall = tp/(tp+fn)
precision = tp/(tp+fp)
print('Recall :', recall)
print('Precision :', precision)
from sklearn.metrics import recall_score, precision_score
print('Recall: ',recall_score(y_test,y_test_predict,pos_label=0))
print('Precision: ',precision_score(y_test,y_test_predict,pos_label=0))
Q17. We are going to define a function false_positive_rate
def false_positive_rate(y_true,y_predict,pos_label):
return np.sum(y_true[y_predict == pos_label] != pos_label)/np.sum(y_true != pos_label)
print(false_positive_rate(y_test,y_test_predict,0))
Q18. Define the function modified_predictor
def modified_predictor(X,tau):
return (svm.decision_function(X) >= tau).astype('int')
Q19.
Q20. Let us find the minimum and maximum values for $\tau$ and trace the corresponding ROC curve
decision_function_train = svm.decision_function(X_train)
tau_range = np.linspace(np.min(decision_function_train),np.max(decision_function_train),100)
recalls = []
fprs = []
for tau in tau_range:
y_train_predict = modified_predictor(X_train,tau)
recalls.append(recall_score(y_train,y_train_predict,pos_label=0))
fprs.append(false_positive_rate(y_train,y_train_predict,0))
plt.plot(fprs,recalls)
plt.show()
Q21. Let us determine the $\tau$ that gives a better recall :
recalls_array = np.array(recalls)
fprs_array = np.array(fprs)
good_enough_recalls_index = (recalls_array >= .95)
tau_best = (tau_range[good_enough_recalls_index])[np.argmin(fprs_array[good_enough_recalls_index])]
y_test_predict = modified_predictor(X_test,tau_best)
from sklearn.metrics import accuracy_score
print('Accuracy score: ', accuracy_score(y_test, y_test_predict))
print('Recall: ', recall_score(y_test, y_test_predict,pos_label=0))
print('Precision: ', precision_score(y_test, y_test_predict,pos_label=0))
print('False positive rate: ', false_positive_rate(y_test, y_test_predict,pos_label=0))
The accuracy is lower than the previous svm but we have a better recall (Recall at 0.96) !
import numpy as np # to manipulate numerical values
import pandas as pd # to manipulate dataframes
import matplotlib.pyplot as plt # to display graphics
import seaborn as sns # graphic specially made for data vizualization
Let us prepare the dataset :
data = pd.read_csv("wine_dataset.csv", sep = ";")
data.head(5)
This data contains :
print(data.shape)
6497 examples (rows) and 13 features (columns).
The new data is the following :
from sklearn.preprocessing import MinMaxScaler
X = data[["chlorides", "total sulfur dioxide"]]
y = data["color"]
X = MinMaxScaler(feature_range=(-1, 1)).fit(X).transform(X)
print(X[:5,])
print(y[:5])
Q22. Separate the dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y)
We will first perform a logistic regression.
from sklearn.linear_model import LogisticRegressionCV
Q23. Let's display its score on the test sample
logreg = LogisticRegressionCV(cv=5, Cs = np.logspace(-6,6,20)).fit(X_train, y_train)
logreg.score(X_test, y_test)
Since we only have two explanatory variables, we can represent in the plan the inputs, as well as the boundary of logreg predictor decision.
plt.scatter(X[:,0],X[:,1],c=y,edgecolors='k')
figure = plt.gca()
xlim = figure.get_xlim()
ylim = figure.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 100)
yy = np.linspace(ylim[0], ylim[1], 100)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
P = logreg.decision_function(xy).reshape(XX.shape)
plt.contour(XX, YY, P, levels=[0], alpha=1)
plt.show()
Q24. Let us define the function plot_decision_boundary
def plot_decision_boundary(predictor):
fig, ax = plt.subplots()
ax.scatter(X[:,0],X[:,1],c=y,edgecolors='k')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 100)
yy = np.linspace(ylim[0], ylim[1], 100)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
P = predictor.decision_function(xy).reshape(XX.shape)
ax.contour(XX, YY, P, levels=[0], alpha=1)
return None
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
grid = {'C':np.logspace(-6,6,20)}
svc = GridSearchCV(LinearSVC(),cv=5,param_grid=grid,verbose=2)
print('Score :',svc.fit(X_train,y_train).score(X_test,y_test))
plot_decision_boundary(svc)
print(svc.best_params_)
As shown by the plot, we can propose an improvement by using SVMs with kernels.
Q25. SVM with polynomial kernels
from sklearn.svm import SVC
svcpoly = dict()
for d in range(2, 9):
svcpoly[d] = SVC(kernel = "poly", C=1, coef0=1, degree=d, gamma="auto").fit(X_train, y_train)
plot_decision_boundary(svcpoly[d])
plt.title("SVM with polynomial kernel : degree " + str(d))
visually, svm with polynomial kernel whose degree is 3 (or maybe 4) seems to be a good predictor. Let us quantify this observation.
Q26. train a kernel SVM polynomial with degree as hyperparameter to tune by CV (using GridSearchCV)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
grid = {'degree':np.arange(2,9)}
svc = GridSearchCV(SVC(kernel="poly",C=1,coef0=1,gamma="auto"),cv=5,param_grid=grid,verbose=2)
print('Score :',svc.fit(X_train,y_train).score(X_test,y_test))
plot_decision_boundary(svc)
svc.get_params
So as a conclusion, we see that the svm with polynomial kernel whith degree 3 is indeed the best predictor (with the other hyperparameters fixed). It can also be noticed that the score is higher than before.