| |||||||||||||||||
![]() |
![]() |
![]() |
|||||||||||||||
|
Distance for Nominal/Categorical Variable
Gender is a nominal variable with value of
One main characteristics of nominal or categorical variable is consistent labeling . Order is not important in nominal variable as long as it is consistent. As extreme example, we can change the label of the fruit I like above into
Gender labeling can be change as
As long as you consistently remember your own label and use it according to your own label definition, they are valid labels for nominal variable. Of course, you will label them as simple as possible rather than using the extreme labeling. To calculate distance between two objects represented by nominal variables, we need to consider the number of category in each variable. If the number of category is only two, we can use distance for binary variables such as simple matching, Jaccard's or Hamming distance. If the number of category is more than two, we need to transform these categories into a set of dummies variables that has binary value. There are two methods to transform a categorical nominal variable (with number of category more than 2) into dummy variables: Method 1: Assign each value of category as a binary dummy variable
The two methods produce different distances. In both methods, we should avoid preference to the higher number of categories. The distance is computed based on the original variables. Dummy variables that represent values of one original variable must be calculated first before combining this with other variables. The distance between two objects is the ratio of number of unmatched and total dummy variables. If
Method 1: Assign each value of category as a binary dummy variableWe assign each value of Mode as a binary dummy variable. The distance between two objects is the ratio of number of unmatched and total dummy variables.
For example, we have two variables: Gender and Mode. Gender has two values: 0 = male and 1 = female. Mode has three choices of public transport mode to go to school: Bus, Train and Van. Suppose we have three subjects: Alex (Male) uses bus, Brian (Male) uses Van and Cherry (Female) use Bus.
We assign each value of Mode as a binary dummy variable. Let set the first coordinate as Gender, while the second coordinate as Mode (Bus, Train, Van). We have
To compute the distance between object, we need to calculate it for each original variable.
Suppose we use Hamming distance (= length of different digits).
The distance between two objects is the ratio of number of unmatched and total dummy variables.
Method 2: Assign each value of category into several binary dummy variablesIf the number of categories is
Ceiling is round up integer away from zero.
For instance, mode of public transportation to school is Bus, Train and Van. We have 3 category, and we need 2 dummy variables because
When dummy variable DV1 is 1 and DV2 is 1, it is a bus. When dummy variable DV1 is 1 and DV2 is 0, it is a train and if dummy variable DV1 is 0 and DV2 is 1, it is a van. The assignment to dummy variable is somewhat arbitrary but consistent.
For example, we have two variables: Gender and Mode. Gender has two values: 0 = male and 1 = female. Mode has three choices of public transport mode to go to school: Bus, Train and Van. Suppose we have three subjects: Alex (Male) uses bus, Brian (Male) uses Van and Cherry (Female) use Bus.
We assign each value of Mode into two binary dummy variables. Let set the first coordinate as Gender, while the second coordinate as Mode (DV1, DV2). We have
To compute the distance between object, we need to calculate it for each original variable.
Suppose we use Hamming distance (= length of different digits).
The distance between two objects is the ratio of number of unmatched and total dummy variables.
Preferable reference for this tutorial is Teknomo, Kardi. Similarity Measurement. http:\\people.revoledu.com\kardi\ tutorial\Similarity\
|
|||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
© 2006 Kardi Teknomo. All Rights Reserved. Designed by CNV Media |
||||||||||||||||||||||||||||