Feature extraction and PCA 4 – Predictions Don’t Grow on Trees, or Do They?

We have gathered 1,288 images of people’s faces, and each one has 1,850 features (pixels) that identify that person. Here’s the code we used – an example of one of our faces can be seen in Figure 11.22:
plt.imshow(X[100].reshape((h, w)), cmap=plt.cm.gray) lfw_people.target_names[y[100]] ‘George W Bush’

Figure 11.22 – A face from our dataset: George W. Bush

Great! To get a glimpse at the type of dataset we are looking at, let’s look at a few overall metrics:
# the label to predict is the id of the person target_names = lfw_people.target_names n_classes = target_names.shape[0]
print(“Total dataset size:”)
print(“n_samples: %d” % n_samples)
print(“n_features: %d” % n_features)
print(“n_classes: %d” % n_classes) Total dataset size:

n_samples: 1288
n_features: 1850
n_classes: 7

So, we have 1,288 images, 1,850 features, and 7 classes (people) to choose from. Our goal is to make a classifier that will assign the person’s face a name based on the 1,850 pixels given to us.

Let’s take a baseline and see how a logistic regression (a classifier that is based on linear regression) performs on our data without doing anything to our dataset.

I know we haven’t formally introduced logistic regressions before, but they are a very lightweight classifier that works off of very similar assumptions as linear regressions from the last chapter. All we need to know for now is that it performs classification!
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from time import time # for timing our work
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=1)
# get our training and test set
t0 = time()
# get the time now
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Predicting people’s names on the test set
y_pred = logreg.predict(X_test)
print( accuracy_score(y_pred, y_test), “Accuracy”) print( (time() – t0), “seconds” )

The output is as follows:
0.810559006211 Accuracy
6.31762504578 seconds

So, within 6.3 seconds, we were able to get 81% on our test set. Not too bad.

Now, let’s try this with our decomposed faces:


# split into a training and testing set
from sklearn.cross_validation import train_test_split
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled # dataset): unsupervised feature extraction / dimensionality reduction
n_components = 75
print(“Extracting the top %d eigenfaces from %d faces” % (n_components, X_train.shape[0]))
pca = decomposition.PCA(n_components=n_components, whiten=True).fit(X_train)
# This whiten parameter speeds up the computation of our extracted columns
# Projecting the input data on the eigenfaces orthonormal basis
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

The preceding code is collecting 75 extracted columns from our 1,850 unprocessed columns. These are our super faces. Now, let’s plug in our newly extracted columns into our logistic regression and compare:


t0 = time()
# Predicting people’s names on the test set WITH PCA logreg.fit(X_train_pca, y_train)
y_pred = logreg.predict(X_test_pca)
print accuracy_score(y_pred, y_test), “Accuracy” print (time() – t0), “seconds”
0.82298136646 Accuracy
0.194181919098 seconds

Wow! Not only was this entire calculation about 30 times faster than the unprocessed images, but the predictive performance also got better! This shows us that PCA and feature extraction, in general, can help us all around when performing ML on complex datasets with many columns. By searching for these patterns in the dataset and extracting new feature columns, we can speed up and enhance our learning algorithms.