In this post, we introduce the basic process of data preprocessing (or data clearning). Data preprocessing is not a negligible and this topic is well discussed in the slides from Indiana and WashU.

Know Your Data

It is important to have a glance at your data before preprocessing or training, othewise you may find the impressive precision actually comes from the high rate of 1 in your data after you take several days to train the data.


Normalizing is the process of scaling the vector length to unit. Normalizing is an important step accodring to paper [1]. sklearn has provided the normalize to scale input vectors individually to unit norm (vector length).

Missing Data

Missing data is a very common problem for practical data. Pandas have a discussion on working with missing data.

Imbalanced Classes

The problem of imbalanced classes is very common in practical working, here we have a discussion on how to handle imblanced classes.

A python library can be found as imbalanced-learn.

Removing Outliers

Whether we should/need remove the outliers from the dataset is a tricky problem. Some comments here argue that outliers should not be removed from the data or we can conduct a sensitivity analysis given the existence of those outliers or not.

Some methods to remove the outliers from the data is summarized here.

A good practice is first make a boxplot of the data to have general image of the data.

Standard Deviation

A very simple method is to assume a normal distribution of all the samples and remove the outliers located beyond \(3\sigma\) away from the mean \(\mu\), where \(\sigma\) is the standard derivation of the sample data. That is

\[X =\{ x>\mu-3\sigma \text{ and }x < \mu+3\sigma | x\in X_{\text{raw}} \}\]


The boxplot is another very helpful method since it makes no distributional assumptions nor does it depend on a mean or standard deviation.

According to WashU, clustering is a good method to detect the outliers in the dataset. According to a paper, k-means can be a practical clustering method. sklearn has provide the KMeanas class to achieve K-Means clustering quickly.

Local Outlier Factor (LOF) is another classical method to perform outlier detection on moderately high dimensional datasets.


[1] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.