So you've heard of feature engineering, and you're wondering how this applies to you and your work? We've got you covered in this article. Raw data often contains irrelevant features that may negatively impact the accuracy of models in machine learning. We obviously don't want this!
This kind of data, is often referred to as 'noisy data' that may result from human error, program error, or system failures. Studies reveal that Data Scientists spend 80% of their time cleaning and preprocessing data. Feature engineering is one of the key techniques in this process, and it's what you need to become a better engineer so let's get into it.
What is feature engineering?
Feature engineering is the process of selecting, extracting, and organising various features from a raw data set using various data mining techniques. Besides getting the data ready to be fed into the model, feature engineering also significantly improves the performance and accuracy of the model. In addition to this, feature engineering also involves creating new variables or features from existing ones.
If you're wondering what a feature is, we'll tell you. A feature, also referred to as an attribute or variable, is a specific, measurable characteristic of the aspect under study. Selecting appropriate and independent features is a fundamental practice that significantly influences the accuracy of models that you're training.
Importance of feature engineering.
The primary goal of feature engineering is to improve model accuracy allowing organisations to solve a business problem in the best possible way. This also means that predictive algorithms will be more reliable and better placed to provide insight into a company's next course of action.
Fitting models to raw data (noisy data) is known to result in overfitting and overly complex models that often try to accommodate the errors in this data. Feature engineering, alongside other data preprocessing techniques, makes models simpler to code, maintain, and comprehend.
Feature engineering also significantly increases the authenticity of our data which translates to models that you can rely on to solve a particular problem.
Feature Engineering processes.
Feature engineering is a multi-step technique that mainly involves four processes. These processes are aimed at accomplishing different tasks, and they include:
Feature creation - This is the process of creating new variables from existing ones by using various methods such as division and addition. Feature creation can also involve getting rid of some features.
Feature selection - Data sets often contain several variables/features. While some are useful, some may be redundant and will only affect the accuracy of your models. Therefore, you need to select a subset of features that are relevant to the model you're building and will guarantee an accurate model. This process is known as feature selection.
Feature transformation - This is the process of modifying existing features using mathematical operations to increase accuracy and reduce the error range.
Feature extraction - An automated process under feature engineering aimed at reducing the dimensionality of the data set. This process reduces the number of features without affecting data quality, making it easier for data scientists to develop models more efficiently. Other techniques used alongside feature extraction include principal component analysis(PCA) and exploratory data analysis(EDA).
Feature Engineering Techniques.
Outliers are data points that deviate from the rest of the data points. They can result from an anomaly during data collection or can occur naturally. Handling outliers are part of the feature engineering process aimed at improving model performance.
The primary step in handling outliers is first detecting them and you can do this using visualisation tools such as the scatter plot. If you're fancy, you can also use mathematical techniques such as the Interquartile Range(IQR). To handle outliers try removal, capping or replacement, etc.
Missing values are typical of any real-world data set. Some of the reasons you could have missing values in your data include device malfunction, an error during data entry or failure of the respondent to record a response, etc. It happens to the best of us!
Missing values cause a bias and can also negatively impact the performance of a machine learning model. Imputation is one of the techniques used to handle missing values, and it involves substituting missing values with different values. It can be categorised into two, depending on the values we're imputing. These are:
- Numerical imputation involves replacing missing values with a default value, the mean or the median of the respective columns.
- Categorical imputation involves replacing missing values with the highest value in the respective column.
3. Logarithm Transformation.
Also known as log transformation, this is one of the most used types of transformation in machine learning. Data transformation can be done for many reasons. Among them is convenience, for example representing data in percentiles rather than the original values.
Log transformation, in particular, is widely used to deal with skewed data. Since it's much easier to handle symmetrical data, it's definitely necessary that we try to make it less skewed. In addition to this, algorithms easily understand data that has near-normal distribution.
4. Feature Splitting.
As the name suggests, feature splitting involves creating a new feature by splitting a pre-existing feature into two. This technique under feature engineering is known to make it easier for algorithms to understand patterns in data better. During the feature splitting process, it's also easy to cluster data into groups and discover more information about the data set you're working with. The feature splitting process can boost the performance of your models if executed the right way.
Data binning is a technique that involves clustering features into clusters known as 'bins'. This approach aims to reduce the impact of 'noisy' data that often results in overfitting. Overfitting is an unpleasant experience when training machine learning models where an algorithm would perfectly fit training data but perform poorly on unseen data.
Besides improving the performance of a model, binning can also be used to identify outliers and missing values.
And that's it! Well, at least for the basics of feature engineering - we'll be back soon with an even more in-depth look into this. You're also in luck: many modern tools exist that automate feature engineering, and one of such frameworks is known as Featuretools. Automated feature engineering is more efficient, less prone to errors, and integrates well with other build tools, so we hope you start to use it in your projects!
Like what you've read or want more like this? Let us know! Email us here or DM us: Twitter, LinkedIn, Facebook, we'd love to hear from you.