Feature extraction and PCA 2 – Predictions Don’t Grow on Trees, or Do They?

Put another way, the COD states that as we increase the number of feature columns, we need exponentially more data to maintain the same level of model performance. This is because, in high-dimensional spaces, even the nearest neighbors can be very far away from a given data point, making it difficult to create good predictions. High dimensionality also increases the risk of overfitting as the model may start to fit to noise in the data rather than the actual signal.

Moreover, with more dimensions, the volume of the space increases so rapidly that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a reliable result, the amount of data needed to support the analysis often grows exponentially with the dimensionality.

We can see clearly that the number of points within a single unit of one another goes down dramatically as we introduce more and more columns. And this is only the first 100 columns!

All of this space that we add in by considering new columns makes it harder for the finite amount of points we have to stay happily within range of each other. We would have to add more points in order to fill in this gap. And that, my friends, is why we should consider using dimension reduction.

The COD is solved by either adding more data points (which is not always possible) or implementing dimension reduction. Dimension reduction is simply the act of reducing the number of columns in our dataset and not the number of rows. There are two ways of implementing dimension reduction:

  • Feature selection: This is the act of creating a subset of our column features and only using the best features
  • Feature extraction: This is the act of mathematically transforming our feature set into a new extracted coordinate system

We are familiar with feature selection as the process of saying the Embarked_Q column is not helping our decision tree. Let’s get rid of it and see how it performs. It is literally when we (or the machine) make the decision to ignore certain columns.

Feature extraction is a bit trickier.

In feature extraction, we are using usually fairly complicated mathematical formulas in order to obtain new super columns that are usually better than any single original column.

Our primary model for doing so is called PCA. PCA will extract a set number of super columns in order to represent our original data with much fewer columns. Let’s take a concrete example. Previously, I mentioned some text with 4,086 rows and over 18,000 columns. That dataset is actually a set of Yelp online reviews:


url = ‘../data/yelp.csv’
yelp = pd.read_csv(url, encoding=’unicode-escape’)
# create a new DataFrame that only contains the 5-star and 1-star reviews yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars == 5