In this post, we introduce the basic process of data preprocessing (or data clearning). Data preprocessing is not a negligible and this topic is well discussed in the slides from Indiana and WashU.

Know Your Data

It is important to have a glance at your data before preprocessing or training, othewise you may find the impressive precision actually comes from the high rate of 1 in your data after you take several days to train the data.

Normalizing

Normalizing is the process of scaling the vector length to unit. Normalizing is an important step accodring to paper [1]. sklearn has provided the normalize to scale input vectors individually to unit norm (vector length).

Missing Data

Missing data is a very common problem for practical data. Pandas have a discussion on working with missing data.

Imbalanced Classes

The problem of imbalanced classes is very common in practical working, here we have a discussion on how to handle imblanced classes.

A python library can be found as imbalanced-learn.

Removing Outliers

Whether we should/need remove the outliers from the dataset is a tricky problem. Some comments here argue that outliers should not be removed from the data or we can conduct a sensitivity analysis given the existence of those outliers or not.

Some methods to remove the outliers from the data is summarized here.

A good practice is first make a boxplot of the data to have general image of the data.

Standard Deviation

A very simple method is to assume a normal distribution of all the samples and remove the outliers located beyond \(3\sigma\) away from the mean \(\mu\), where \(\sigma\) is the standard derivation of the sample data. That is

\[X =\{ x>\mu-3\sigma \text{ and }x < \mu+3\sigma | x\in X_{\text{raw}} \}\]

Boxplot

The boxplot is another very helpful method since it makes no distributional assumptions nor does it depend on a mean or standard deviation.

According to WashU, clustering is a good method to detect the outliers in the dataset. According to a paper, k-means can be a practical clustering method. sklearn has provide the KMeanas class to achieve K-Means clustering quickly.

Local Outlier Factor (LOF) is another classical method to perform outlier detection on moderately high dimensional datasets.

Ref.

[1] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.