Notebook

Feature Engineering¶

based on the book "Feature Extraction Foundations and Applications"
motivation: Learn the best that human experts are doing in feature engineering, and try to automate it
another promising direction to look at is Deep Learning and Unsupervised Feature Engineering

Introduction¶

It is usually the data type of a problem that determines the suitable technology (e.g., vector data, list data and etc.)
Feature Construction is one of the key steps in data analysis process, largely conditioning the success of any machine learning endeavor.
One should beware of NOT losing information at the feature construction stage. It may be a good idea to add the RAW FEATURES to the preprocessed data - AND use FEATURE SELECTION (or REGULARIZATION) in the machine stage, This is simplely because ADDING all those features comes at a price - it increases the dimensionality of the patterns and thereby immerses the relevant information into a sea of possibly irrelevant, noisy or redundant features - and thus increase the difficulty for machine learning models to search for the optimal solution (in a bigger search space).
data value type can be binary, categorical or continous
To understand the organization of the tutorial, there are four aspects of feature extraction:
- feature contruction
- feature subset generation (or search strategy)
- evaluation criterion definition (e.g. relevance index or predictive power)
- evaluation criterion estimation (or assessment method)

Preprocessing¶

Preprocessing of features can include - most of them are about removing dependences
- Standardization (each feature independently) - UNIFYING
  - why: Features can have different scales although they refer to comparable objects. This could cause troubles specially for distance-based models such as kernel methods (tree models are more robust in this case)
  - how: centering and scaling x_new = (x-mean)/std INDIVIDUALL for each feature
- Normalization (across different features) - COMPARING
  - why: Consider the case where $x$ is an image and $x_{i}$ are the number of pixels with color i, it makes sense to normalize $x$ by dividing it by the total number of counts in order to encode the distribution and remove the dependence on the size of the image.
  - how: $x'=x / ||x||$
- Signal Enhacement (signal / image processing) - DOMAIN
  - why: the signal-to-noise ratio may be improved by applying signal or image-prcocessing filters
  - how: basedline/background removal, de-noising, smoothing, or shapening. The Fourier transform, wavelet transforms and morphological image analysis are popular methods.
- Extraction of Local Features (sequential spatial or other structured data) - DOMAIN
  - why: The technique essentially encode problem-specific knowledge into the features.
  - how: For sequential, spatial or other structured data, specific techniques such as convolutional methods using hand-crafted kernels of syntactic and structurual methods are used.
- Linear/Nonlinear space embedding methods (dimensionality reduction) - REDUCTION
  - why: when dimensionality is very high, projecting or embedding the data into a lower dimensional space while retaining as much information as possible.
  - PCA, MDS (multidimensional scaling). The coordinates of the data points in the lower dimension might be used as features or simply as a means of data visualization.
- Nonlinear expansions (linear expansions are not useful in practice, as linear models will be able to catpure them) - EXTENSION
  - why: To increase the dimensionality when the problem is very complex and first order interactions (e.g., linear model) are not enough to derive good results.
  - how: this consists for instance in computing products of the original features $x_{i}$ to create monomials $x_{k_{1}}x_{k_{2}}...x_{k_{p}}$
- Feature Discretization - DISCRETIZATION & EXTENSION
  - why: Some algorithms do NOT handel well continuous data. And sometime discretization (e.g., by tri-clustering) actually performs a nonlinear expansion.
  - how: To discretize continuous values into a finite discrete set.

Feature Selection¶

filters: methods that select features without optimizing the performance of a predictor.
wrappers utilize a learning machine as a blackbox to score subsets of features according to their predictive power
embedded methods perform feature selection in the process of training and are usually specific to given learning machines
ensemble of wrappers/embedded methods : wrappers and embedded methods may be yield very different feature subsets under small perturbatioins of the dataset. One way of minimizing the effect is to use ensemble methods

Basic Concepts¶

Some basic techniques and discussions about how feature selection can be done and why it is this way

Individual Relevance Ranking
- pros: (1) fast (2) accurate when a lot features but only small set of data, where a full subset search could be prohibitive or overfitting (3) it works well when a feature is invidually relevant to target BY ITSELF and other feature does not help providing better prediction
- cons: its assumption could be too limited
  - features that are not individually relevant (to target) may become relevant in the context of others
  - features that are individually relevant (to target) may not all be useful because of possible redundancies
- techniques:
  - Pearson correlation coefficient
    - essentially the absolute value of cosine between x and y after being centered (mean substracted)
    - can be used for REGRESSION and BINARY-CLASSIFICATION
    - for MULTI-CLASSIFICATION problem, one can use instead Fisher-coefficient
  - Rotations in feature space (PCA) often simplify feature selection (e.g. 45 degree boundary becomes vertical and needed features number become from 2 to 1). PCA can be used to perform such linear transformation
  - discussion on several different types of correlations
Multivariate Feature Ranking
- pros: remedy to the cons of "Individual Relevance Ranking"
- justifications:
  - a helpful feature may be irrelevant by itself. e.g., two spear-shaped clusters aligned with 45-degree line, and the two clusters are fully/almost overlapped on each projection. A practical example: feature 1 represents a measurement in an image that is randomly offset by a local background change; feature 2 measures such local offset, which by itself is not informative but will improve separability if subtracted from feature 1.
  - two individually irrelevant features may become relevant when used together, e.g., XOR problem
  - multivariate methods can take into account feature redundancy and yield more compact subsets of features.
    - noise reduction can be achieved with features having identical projected distributions
    - Correlation does NOT imply Redundancy
- techniques: most multivariate methods rank subsets of features rather than individual features, still there exist multivariate relevance criteria to rank individual features according to their relevance in the context of others.
  - relief algorithm: The Relief algorithm uses an approach based on the K-nearest-neighbor algorithm. To evaluate the index, we first identify in the original feature space, for each example xi, the K closest ex- amples of the same class $\{xHk(i)\}$,k = 1...K (nearest hits) and the K closest examples of a different class {xMk(i)} (nearest misses.) Then, in projection on feature j, the sum of the distances between the examples and their nearest misses is compared to the sum of distances to their nearest hits. , we use the ratio of these two quantities to create an index independent of feature scale variations. The Relief method works for multi-class problems.
  - forward/backward greedy search - Forward selection yields better choice if we end up selecting only several most significant features. Backward elimination procedure may yield better performance but at the expense of possibly larger feature sets. However if the feature set is reduced too much, the performance may degrade abruptly. SO if you use backward search, make sure to use enough features (e.g., by cross validation)
  - Gram Schmidt orthogonalization (Partial Least Square) - forward selection
  - Random Forest Based - forward selection (selecting features in the process of building a classification or regression tree
  - recursive feature elimination SVM - backward elimination

Assessment Methods for Feature Relevance¶

NIPS 2003 Feature Selection Challenge¶

Summary¶

Eliminating meaningless features is note critical
A filter as simple as the Pearson correlation coefficient proves to be very effective