Similarity

< Previous | Next | Content >

Normalization

In this section of Similarity tutorial, you will learn on how to put Distance or Similarity as Performance index into a range of 0 and 1 or [0, 1] in short. The process of transforming our index from its value into a range of 0 and 1 is called normalization . I will also briefly discuss about statistical normalization in this section.

Suppose the dissimilarity index is in the range of [ Normalize , Normalize ] and is not in the range of [0, 1]. We want to transform it into range of [0, 1]. Let us put notation Normalize to the original dissimilarity and Normalize to the normalized dissimilarity.

There are several ways to normalize an index. In principle, to aggregate a sequence of numbers into range of [0, 1] we need to make them positive and divide with something that is bigger than the nominator. Using this principle, we can make use any inequality to normalize the index. The following are simple transformations that can be used for wide range of application. Please take care of the condition of each transformation.

Check also below for Statistical Normalization and Normalizing Negative Data

Mathematical Normalization Methods

1. One way to normalize an index is to use this function

Normalize (1)

The value of Normalize value will be in the range of -1 to +1 for Normalize . Equation (1) can be easily transform to range [0, 1] by transformation

Normalize (2)

It gives

Normalize (3)

Normalize

Setting higher value of Normalize will make the graph between Normalize smoother as show in figure above. In general, when Normalize produce Normalize and if Normalize , then Normalize . For Normalize , it produce binary value of 0 and +1 with discontinuity when Normalize , thus Normalize shall not be used. The value of smoothing parameter Normalize depends on how smooth we want to set and how large the value of Normalize is. For Normalize , the values of Normalize in equation (3) can only asymptotically reach 0 or 1.

For example, Normalize and Normalize then Normalize

2. If we know the maximum and minimum value of our index, then transformation

Normalize (4)

It will change transform it into range of [0, 1]. If Normalize , then Normalize . If Normalize , then Normalize . A special care must be taken to avoid division by zero when Normalize is zero. If the value of our index is always zero or positive, and we know the maximum value of our index, then we can set Normalize and the equation (4) can be simplified into

Normalize (5)

The graph of Normalize is linear and depending on Normalize

Normalize

3. In case we know the value of our index is always zero or positive, but we do not know the maximum value of our index. Suppose the number of indices are fixed to be Normalize , then we can use total of the indices to replace the maximum value, become

Normalize (6)

The normalized value of (6) is smaller than (5) because Normalize . A special care must be taken to avoid division by zero when all indices are zero.

4. If our index can take negative value, we can normalize each indices by taking its relative absolute value or square value to the total:

Normalize (7)

or

Normalize (8)

5. Bray Curtis Normalization . If we have a pair of indices which always zero or positive and both cannot be zero at the same time, we can normalize them using absolute difference divided by the summation.

Normalize (9)

Removing the absolute sign will give range of Normalize as [-1, 1]. If Normalize , then Normalize . If one of the two indices is zero, then Normalize . For example, Normalize and Normalize , we get Normalize .

See also:
Canberra distance , Bray Curtis distance

6. To normalized ordinal value of comparison index, perform the following steps:

  1. Convert the ordinal value into rank (r = 1 to Normalize )
  2. Normalized the rank into standardized value of zero to one [0,1] by

Normalize (10)

Example

Normalize

See also: Distance for ordinal variables

7. We know from mathematics that for any positive values, arithmetic mean is always larger or equal to geometric mean . We can use this knowledge to normalize our index. Provided that Normalize , we have

Normalize (11)

For example Normalize and Normalize , we get Normalize

See also: Mean and Average

8. Another inequalities from mathematics theory said that absolute value of arithmetic mean is smaller or equal to quadratic mean . We can use this knowledge to normalize our index for any real value of Normalize

Normalize (12)

For example Normalize and Normalize , we get Normalize

Normalizing negative data

All above normalization will work well if your data is positive or zero. How if your data contain some negative numbers? For example, you have data -1, 3 and 4. The sum is 6. If you normalize it by the maximum value you will get

-1/6, ... , and 2/3. The sum of the three is still one but now you have negative number (-1/6) as part of your index. How to solve this problem?

The solution is simple: Shift your data by adding all numbers with the absolute of the most negative (minimum value of your data) such that the most negative one will become zero and all other number become positive. Then you can normalize your data as usual with any of above procedures.

For example:

Your data is -1, 3 and 4. The most negative number is -1, thus you add all numbers with +1 to become: 0, 4, 5 then normalize it become: 0, 4/9 and 5/9.

Statistical Normalization

Finally, I would like to give a note about another type of normalization which also called Statistical normalization. The purpose of statistical normalization is to convert a data derived from any Normal distribution into Normal distribution with mean zero and variance = 1.

The formula of statistical normalization is

Z = (X-u) /s

You have your data as vector X then you minus with the mean of the data, u, and divide this difference by the standard deviation , you will get another vector Z that has normal distribution with zero mean and unit variance (it is also called Standard Normal distribution, N(0,1) ). However, the range of the standard Normal distribution is not between [0,1]. The range of standard Normal distribution is about -3 to +3 (actually infinity to infinity but using -3 to +3 you already capture 99.9% of your data).

For example:

You have data:

X ,

your mean data is (2+5+3+1)/4 = 12/4 = 3, the standard deviation is s

The Z values are

Z

Rate this tutorial

< Content | Previous | Next >

This tutorial is copyrighted.

Preferable reference for this tutorial is

Teknomo, Kardi (2015) Similarity Measurement. http:\people.revoledu.comkardi tutorialSimilarity