Our goal is to predict whether or not a person gave a 5- or 1-star review based on the words they used in the review. Let’s set a baseline with logistic regression and see how well we can predict this binary category:
from sklearn.linear_model import LogisticRegression lr = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100) # Make our training and testing sets
vect = CountVectorizer(stop_words=’english’)
# Count the number of words but remove stop words like a, an, the, you, etc
X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test)
# transform our text into document term matrices
lr.fit(X_train_dtm, y_train) # fit to our training set
lr.score(X_test_dtm, y_test) # score on our testing set
The output is as follows:
0.91193737
So, by utilizing all of the words in our corpus, our model seems to have over a 91% accuracy. Not bad!
Let’s try only using the top 100 used words:
vect = CountVectorizer(stop_words=’english’, max_features=100) # Only use the 100 most used words
X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) print( X_test_dtm.shape) # (1022, 100)
lr.fit(X_train_dtm, y_train) lr.score(X_test_dtm, y_test)
The output is as follows:
0.8816
Note how our training and testing matrices have 100 columns. This is because I told our vectorizer to only look at the top 100 words. See also that our performance took a hit and is now down to 88% accuracy. This makes sense because we are ignoring over 4,700 words in our corpus.
Now, let’s take a different approach. Let’s import a PCA module and tell it to make us 100 new super columns and see how that performs:
from sklearn import decomposition
# We will be creating 100 super columns
vect = CountVectorizer(stop_words=’english’) # Don’t ignore any words
pca = decomposition.PCA(n_components=100) # instantate a pca object
X_train_dtm = vect.fit_transform(X_train).todense()
# A dense matrix is required to pass into PCA, does not affect the overall message
X_train_dtm = pca.fit_transform(X_train_dtm)
X_test_dtm = vect.transform(X_test).todense() X_test_dtm = pca.transform(X_test_dtm)
print( X_test_dtm.shape) # (1022, 100) lr.fit(X_train_dtm, y_train) lr.score(X_test_dtm, y_test)
The output is as follows:
.89628
Not only do our matrices still have 100 columns, but these columns are no longer words in our corpus. They are complex transformations of columns and are 100 new columns. Also, note that using 100 of these new columns gives us a better predictive performance than using the 100 top words!
Feature extraction is a great way to use mathematical formulas to extract brand-new columns that generally perform better than just selecting the best ones beforehand.
But how do we visualize these new super columns? Well, I can think of no better way than to look at an example using image analysis. Specifically, let’s make facial recognition software. OK? OK. Let’s begin by importing some faces given to us by scikit-learn:
from sklearn.datasets import fetch_lfw_people
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting) n_samples, h, w = lfw_people.images.shape
# for machine learning we use the 2 data directly (as relative pixel # positions info is ignored by this model)
X = lfw_people.data
y = lfw_people.target n_features = X.shape[1] X.shape (1288, 1850)