question updated!!
i have 15 columns of categorical variables , want correlation among them. data set 20,000+ long , data set looks this:
state | job | hair_color | car_color | marital_status ny | cs | brown | blue | s fl | mt | black | blue | d ny | md | blond | white | m ny | cs | brown | red | s
notice 1st row , last row ny
, cs
, , s
repeats. want find out kind of patterns. ny , cs highly correlated. need rank combination of values in columns. hope question make sense. please notice not counting ny
or cs
. finding out how many times ny
, blond
appears in same row. need values row. hope make sense.
i tried utilize cor()
r since these categorical variables function doesn't work. how can work data set find correlation among them?
you may wish refer ways calculate similarity. suppose data is
d <- structure(list(state = structure(c(2l, 1l, 1l, 2l, 2l), .label = c("fl", "ny"), class = "factor"), job = structure(c(2l, 1l, 4l, 3l, 2l ), .label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3l, 3l, 1l, 2l, 3l), .label = c("black", "blond", "brown"), class = "factor"), car_color = structure(c(1l, 2l, 1l, 3l, 2l), .label = c("blue", "red", "white"), class = "factor"), marital_status = structure(c(3l, 1l, 1l, 2l, 3l), .label = c("d", "m", "s"), class = "factor")), .names = c("state", "job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(na, -5l))
data:
> d state job hair_color car_color marital_status 1 ny cs brown blue s 2 fl bs brown red d 3 fl mt black blue d 4 ny md blond white m 5 ny cs brown red s
we can calculate "dissimilarities" between observations:
library(cluster) daisy(d, metric = "euclidean")
output:
> daisy(d, metric = "euclidean") dissimilarities : 1 2 3 4 2 0.8 3 0.8 0.6 4 0.8 1.0 1.0 5 0.2 0.6 1.0 0.8 metric : mixed ; types = n, n, n, n, n number of objects : 5
which tells observations 1 , 5 least dissimilar. many observations, impossible visually inspect dissimilarity matrix, can filter out pairs fall below threshold, e.g.
out <- daisy(d, metric = "euclidean") pairs <- expand.grid(2:5, 1:4) pairs <- pairs[pairs[,1]!=pairs[,2],] similars <- pairs[which(out<.8),]
given threshold of 0.8,
> similars var1 var2 4 5 1 6 3 2 8 5 2
Comments
Post a Comment