i have rather large collection of n d-dimensional vectors integer coordinates (d 50), except in cases coordinates special sentinel "don't care" value, i'll denote *. i'm trying find efficient algorithm merging vectors compare equal 1 another, "equal" means "each pair of coordinates in vectors match, assuming * entries can match anything." example, given these row vectors:
[* 1 2 * *] [1 1 * 2 *] [2 1 3 * 1] [2 * 3 * *] [1 * 3 4 *]
we'd break them these clusters:
[* 1 2 * *], [1 1 * 2 *] [2 1 3 * *], [2 * 3 * *] [1 * 3 4 *]
the requirement clusters need have vectors in them pairwise equivalent. it's possible there many ways cluster things meeting these criteria, loose definition of "not many" there should "not many" clusters.
it's possible comparing vectors pairwise , pairing off ones match, repeating until vectors clustered. option (the 1 we're using) start first vector in own cluster, each vector in order check if matches existing clusters or whether needs go own. these approaches either quadratic or run in time proportional product of number of vectors times number of clusters, our application isn't fast enough.
are there efficient algorithms solving particular problem?
so looks problem np-hard - , not np-hard, it's 1 of np-hard problems that's going hard approximate answer to.
to show this, here's quick reduction chromatic number problem (given graph, what's minimum number of colors needed color it?) problem. i'll illustrate example. imagine have graph:
-- b -- c | | | d -- e -- f
i'm going start graph , form set of vectors "don't cares" each cluster corresponds set of nodes no edges running between them (an independent set). overall, each cluster represent set of nodes can given same color, finding minimal clustering corresponds finding chromatic number of graph.
since there 6 nodes in graph, each vector have 6 components. we'll form 1 vector per node in graph. example, vector node be
[ 0 1 * 1 * * ]
the columns of vector, in order, refer nodes a, b, c, d, e, , f. 0 in column means "this vector a." 1's in columns b , d mean "there edges node nodes b , d." *'s in other columns mean "there aren't edges running between node , other nodes." if form set of vectors nodes in graph, we'll this:
b c d e f a: [ 0 1 * 1 * * ] b: [ 1 0 1 * 1 * ] c: [ * 1 0 * * 1 ] d: [ 1 * * 0 1 * ] e: [ * 1 * 1 0 1 ] f: [ * * 1 * 1 0 ]
now, imagine trying cluster these vectors. notice if take 2 adjacent nodes (say, b , e), vectors don't match, because b vector has 0 in b column , e vector has 1 in b column. however, if take nodes aren't adjacent, match, since
- the columns nodes have 0's in 1 vector , *'s in other, since each vector tagged node , there's no edge running between them, and
- each other column in each vector either 1 or *.
overall, means cluster of vectors corresponds collection of nodes have no edges running between them, finding minimal clustering gives coloring of graph. in case, notice original graph can 2-colored making a, c, , e same color , b, d, , f color. if @ vectors, can clustered a, c, , e , b, d, , f:
b c d e f a: [ 0 1 * 1 * * ] c: [ * 1 0 * * 1 ] e: [ * 1 * 1 0 1 ] b: [ 1 0 1 * 1 * ] d: [ 1 * * 0 1 * ] f: [ * * 1 * 1 0 ]
more generally, given graph n nodes, form n vectors of form, each vector corresponds node, has 0 in column node, 1 in column each adjacent node, , * in other columns. minimum number of clusters corresponds chromatic number of original graph, , reduction can done in polynomial time.
the problem problem of finding minimum chromatic number of graph np-hard , no approximation algorithm known gets close constant factor approximation. result, there's no approximation algorithm original problem described unless p = np.
it's possible special case of problem need solve happens easier approximate, might post follow-up question more specifics of i'm trying do.
Comments
Post a Comment