python - How to cluster data that has been split into a binary vector based on their path with scikitlearn? -


i have large list of urls, , want group them based on similarity 1 later uses. have grouped them based on domain. tried @ paths , file. however, method used took long because iterated through every character in word.

i have split every url's path words make up, , put them data frame. data frame's columns have every unique word appears in url's path, , every row filled out 0s , 1s, based on whether word appears or not. thought way group them use data frame scikit learn can use kmeans.

edit:

|                         | r | python | eve | submit | |-------------------------|---|--------|-----|--------| | reddit.com/r/python     | 1 | 1      | 0   | 0      | | reddit.com/r/eve        | 1 | 0      | 1   | 0      | | reddit.com/r/eve/submit | 1 | 0      | 1   | 1      | 


Comments