question updated!!

i have 15 columns of categorical variables , want correlation among them. data set 20,000+ long , data set looks this:

state | job | hair_color | car_color | marital_status ny    | cs  | brown      | blue      | s fl    | mt  | black      | blue      | d ny    | md  | blond      | white     | m ny    | cs  | brown      | red       | s

notice 1st row , last row ny, cs, , s repeats. want find out kind of patterns. ny , cs highly correlated. need rank combination of values in columns. hope question make sense. please notice not counting ny or cs. finding out how many times ny , blond appears in same row. need values row. hope make sense.

i tried utilize cor() r since these categorical variables function doesn't work. how can work data set find correlation among them?

you may wish refer ways calculate similarity. suppose data is

d <- structure(list(state = structure(c(2l, 1l, 1l, 2l, 2l), .label = c("fl",  "ny"), class = "factor"), job = structure(c(2l, 1l, 4l, 3l, 2l ), .label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3l,  3l, 1l, 2l, 3l), .label = c("black", "blond", "brown"), class = "factor"),      car_color = structure(c(1l, 2l, 1l, 3l, 2l), .label = c("blue",      "red", "white"), class = "factor"), marital_status = structure(c(3l,      1l, 1l, 2l, 3l), .label = c("d", "m", "s"), class = "factor")), .names = c("state",  "job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(na,  -5l))

data:

> d   state job hair_color car_color marital_status 1    ny  cs      brown      blue              s 2    fl  bs      brown       red              d 3    fl  mt      black      blue              d 4    ny  md      blond     white              m 5    ny  cs      brown       red              s

we can calculate "dissimilarities" between observations:

library(cluster) daisy(d, metric = "euclidean")

output:

> daisy(d, metric = "euclidean") dissimilarities :     1   2   3   4 2 0.8             3 0.8 0.6         4 0.8 1.0 1.0     5 0.2 0.6 1.0 0.8  metric :  mixed ;  types = n, n, n, n, n  number of objects : 5

which tells observations 1 , 5 least dissimilar. many observations, impossible visually inspect dissimilarity matrix, can filter out pairs fall below threshold, e.g.

out <- daisy(d, metric = "euclidean") pairs <- expand.grid(2:5, 1:4) pairs <- pairs[pairs[,1]!=pairs[,2],] similars <- pairs[which(out<.8),]

given threshold of 0.8,

> similars   var1 var2 4    5    1 6    3    2 8    5    2

swift

Search This Blog

r - Correlation for multiple categorical variables tableau -

question updated!!

Comments

Post a Comment