Feature extraction and PCA
A common problem when working with data, particularly when it comes to ML, is having an overwhelming number of columns and not enough rows to handle such a quantity of columns.
A great example of this is when we were looking at the send cash now example in our naïve Bayes example earlier. Remember we had literally 0 instances of texts with that exact phrase? In that case, we turned to a naïve assumption that allowed us to extrapolate a probability for both of our categories.
The reason we had this problem in the first place is because of something called the curse of dimensionality (COD). The COD basically says that as we introduce new feature columns, we need exponentially more rows (data points) to consider the increased number of possibilities.
Consider an example where we attempt to use a learning model that utilizes the distance between points on a corpus of text that has 4,086 pieces of text and that the whole thing has been count-vectorized using scikit-learn. Now, let’s do an experiment. I will first consider a single word as the only dimension of our text. Then, I will count how many pieces of text are within 1 unit of each other. For example, if 2 sentences both contain that word, they would be 0 units away and, similarly, if neither of them contains the word, they would be 0 units away from one another:
d = 1
# Let’s look for points within 1 unit of one another
X_first_word = X.iloc[:,:1]
# Only looking at the first column, but ALL of the rows
from sklearn.neighbors import NearestNeighbors
# this module will calculate for us distances between each point
neigh = NearestNeighbors(n_neighbors=4086)
neigh.fit(X_first_word)
# tell the module to calculate each distance between each point
A = neigh.kneighbors_graph(X_first_word, mode=’distance’).todense() # This matrix holds all distances (over 16 million of them)
num_points_within_d = (A < d).sum()
# Count the number of pairs of points within 1 unit of distance, 16,258,504
Important note
Note that we have 16,695,396 (4086*4086) distances to scan over.
So, 16.2 million pairs of texts are within a single unit of distance. Now, let’s try again with the first two words:
X_first_two_words = X.iloc[:,:2]
neigh = NearestNeighbors(n_neighbors=4086) neigh.fit(X_first_two_words)
A = neigh.kneighbors_graph(X_first_two_words, mode=’distance’).todense() num_points_within_d = (A < d).sum()
# num_points_within_d is now 16,161,970
By considering a single new column, we lost about 100,000 pairs of points that were within a single unit of distance. This is because we are adding space in between them for every dimension that we add. Let’s take this test a step further and calculate this number for the first 100 words and then plot the results:
num_columns = range(1, 100)
# Looking at the first 100 columns points = []
# We will be collecting the number of points within 1 unit for a graph
neigh = NearestNeighbors(n_neighbors=X.shape[0])
for subset in num_columns:
X_subset = X.iloc[:,:subset]
# look at the first column, then first two columns, then first three columns, etc
neigh.fit(X_subset)
A = neigh.kneighbors_graph(X_subset, mode=’distance’).todense() num_points_within_d = (A < d).sum()
# calculate the number of points within 1 unit points.append(num_points_within_d)
Now, let’s plot the number of points within 1 unit versus the number of dimensions we consider in Figure 11.21:
Figure 11.21 – The COD says that as we increase the number of feature columns in our dataset, data points become further away from each other due to the increase in high-dimensional space