python - How to combine n-grams into one vocabulary in Spark? -


wondering if there built-in spark feature combine 1-, 2-, n-gram features single vocabulary. setting n=2 in ngram followed invocation of countvectorizer results in dictionary containing 2-grams. want combine frequent 1-grams, 2-grams, etc 1 dictionary corpus.

you can train separate ngram , countvectorizer models , merge using vectorassembler.

from pyspark.ml.feature import ngram, countvectorizer, vectorassembler pyspark.ml import pipeline   def build_ngrams(inputcol="tokens", n=3):      ngrams = [         ngram(n=i, inputcol="tokens", outputcol="{0}_grams".format(i))         in range(1, n + 1)     ]      vectorizers = [         countvectorizer(inputcol="{0}_grams".format(i),             outputcol="{0}_counts".format(i))         in range(1, n + 1)     ]      assembler = [vectorassembler(         inputcols=["{0}_counts".format(i) in range(1, n + 1)],         outputcol="features"     )]      return pipeline(stages=ngrams + vectorizers + assembler) 

example usage:

df = spark.createdataframe([   (1, ["a", "b", "c", "d"]),   (2, ["d", "e", "d"]) ], ("id", "tokens"))  build_ngrams().fit(df).transform(df)  

Comments