Standardize

A function that performs column-based standardization on a NumPy array.

from mlxtend.preprocessing import standardize

Overview

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with

and .

where is the mean (average) and is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as

Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for the optimal performance of many machine learning algorithms.

One family of algorithms that is scale-invariant encompasses tree-based learning algorithms. Let's take the general CART decision tree algorithm. Without going into much depth regarding information gain and impurity measures, we can think of the decision as "is feature x_i >= some_val?" Intuitively, we can see that it really doesn't matter on which scale this feature is (centimeters, Fahrenheit, a standardized scale -- it really doesn't matter).

Some examples of algorithms where feature scaling matters are:

There are many more cases than I can possibly list here ... I always recommend you to think about the algorithm and what it's doing, and then it typically becomes obvious whether we want to scale your features or not.

In addition, we'd also want to think about whether we want to "standardize" or "normalize" (here: scaling to [0, 1] range) our data. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights "equally." As a rule of thumb I'd say: When in doubt, just standardize the data, it shouldn't hurt.

Example 1 - Standardize a Pandas DataFrame

import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=(range(6)))
s2 = pd.Series([10, 9, 8, 7, 6, 5], index=(range(6)))
df = pd.DataFrame(s1, columns=['s1'])
df['s2'] = s2
df
s1 s2
0 1 10
1 2 9
2 3 8
3 4 7
4 5 6
5 6 5
from mlxtend.preprocessing import standardize
standardize(df, columns=['s1', 's2'])
s1 s2
0 -1.46385 1.46385
1 -0.87831 0.87831
2 -0.29277 0.29277
3 0.29277 -0.29277
4 0.87831 -0.87831
5 1.46385 -1.46385

Example 2 - Standardize a NumPy Array

import numpy as np

X = np.array([[1, 10], [2, 9], [3, 8], [4, 7], [5, 6], [6, 5]])
X
array([[ 1, 10],
       [ 2,  9],
       [ 3,  8],
       [ 4,  7],
       [ 5,  6],
       [ 6,  5]])
from mlxtend.preprocessing import standardize
standardize(X, columns=[0, 1])
array([[-1.46385011,  1.46385011],
       [-0.87831007,  0.87831007],
       [-0.29277002,  0.29277002],
       [ 0.29277002, -0.29277002],
       [ 0.87831007, -0.87831007],
       [ 1.46385011, -1.46385011]])

Example 3 - Re-using parameters

In machine learning contexts, it is desired to re-use the parameters that have been obtained from a training set to scale new, future data (including the independent test set). By setting return_params=True, the standardize function returns a second object, a parameter dictionary containing the column means and standard deviations that can be re-used by feeding it to the params parameter upon function call.

import numpy as np
from mlxtend.preprocessing import standardize

X_train = np.array([[1, 10], [4, 7], [3, 8]])
X_test = np.array([[1, 2], [3, 4], [5, 6]])

X_train_std, params = standardize(X_train, 
                                  columns=[0, 1], 
                                  return_params=True)
X_train_std
array([[-1.33630621,  1.33630621],
       [ 1.06904497, -1.06904497],
       [ 0.26726124, -0.26726124]])
params
{'avgs': array([ 2.66666667,  8.33333333]),
 'stds': array([ 1.24721913,  1.24721913])}
X_test_std = standardize(X_test, 
                         columns=[0, 1], 
                         params=params)
X_test_std
array([[-1.33630621, -5.0779636 ],
       [ 0.26726124, -3.47439614],
       [ 1.87082869, -1.87082869]])

API

standardize(array, columns=None, ddof=0, return_params=False, params=None)

Standardize columns in pandas DataFrames.

Parameters

Notes

If all values in a given column are the same, these values are all set to 0.0. The standard deviation in the parameters dictionary is consequently set to 1.0 to avoid dividing by zero.

Returns

Examples

For usage examples, please see http://rasbt.github.io/mlxtend/user_guide/preprocessing/standardize/