wondering if there built-in spark feature combine 1-, 2-, n-gram features single vocabulary. setting n=2 in ngram followed invocation of countvectorizer results in dictionary containing 2-grams. want combine frequent 1-grams, 2-grams, etc 1 dictionary corpus.
you can train separate ngram , countvectorizer models , merge using vectorassembler.
from pyspark.ml.feature import ngram, countvectorizer, vectorassembler pyspark.ml import pipeline def build_ngrams(inputcol="tokens", n=3): ngrams = [ ngram(n=i, inputcol="tokens", outputcol="{0}_grams".format(i)) in range(1, n + 1) ] vectorizers = [ countvectorizer(inputcol="{0}_grams".format(i), outputcol="{0}_counts".format(i)) in range(1, n + 1) ] assembler = [vectorassembler( inputcols=["{0}_counts".format(i) in range(1, n + 1)], outputcol="features" )] return pipeline(stages=ngrams + vectorizers + assembler) example usage:
df = spark.createdataframe([ (1, ["a", "b", "c", "d"]), (2, ["d", "e", "d"]) ], ("id", "tokens")) build_ngrams().fit(df).transform(df)
Comments
Post a Comment