wondering if there built-in spark feature combine 1-, 2-, n-gram features single vocabulary. setting n=2
in ngram
followed invocation of countvectorizer
results in dictionary containing 2-grams. want combine frequent 1-grams, 2-grams, etc 1 dictionary corpus.
you can train separate ngram
, countvectorizer
models , merge using vectorassembler
.
from pyspark.ml.feature import ngram, countvectorizer, vectorassembler pyspark.ml import pipeline def build_ngrams(inputcol="tokens", n=3): ngrams = [ ngram(n=i, inputcol="tokens", outputcol="{0}_grams".format(i)) in range(1, n + 1) ] vectorizers = [ countvectorizer(inputcol="{0}_grams".format(i), outputcol="{0}_counts".format(i)) in range(1, n + 1) ] assembler = [vectorassembler( inputcols=["{0}_counts".format(i) in range(1, n + 1)], outputcol="features" )] return pipeline(stages=ngrams + vectorizers + assembler)
example usage:
df = spark.createdataframe([ (1, ["a", "b", "c", "d"]), (2, ["d", "e", "d"]) ], ("id", "tokens")) build_ngrams().fit(df).transform(df)
Comments
Post a Comment