step_tokenfilter creates a specification of a recipe step that
will convert a tokenlist to be filtered based on frequency.
step_tokenfilter( recipe, ..., role = NA, trained = FALSE, columns = NULL, max_times = Inf, min_times = 0, percentage = FALSE, max_tokens = 100, res = NULL, skip = FALSE, id = rand_id("tokenfilter") ) # S3 method for step_tokenfilter tidy(x, ...)
A recipe object. The step will be added to the sequence of operations for this recipe.
One or more selector functions to choose variables.
Not used by this step since no new variables are created.
A logical to indicate if the recipe has been baked.
A list of tibble results that define the
encoding. This is
An integer. Maximal number of times a word can appear before getting removed.
An integer. Minimum number of times a word can appear before getting removed.
A logical. Should max_times and min_times be interpreded as a percentage instead of count.
An integer. Will only keep the top max_tokens tokens after filtering done by max_times and min_times. Defaults to 100.
The words that will be keep will be stored here once
this preprocessing step has be trained by
A logical. Should the step be skipped when the
recipe is baked by
A character string that is unique to this step to identify it.
An updated version of
recipe with the new step added
to the sequence of existing steps (if any).
This step allow you to limit the tokens you are looking at by filtering
on their occurrence in the corpus. You are able to exclude tokens if they
appear too many times or too fews times in the data. It can be specified
as counts using
min_times or as percentages by setting
TRUE. In addition one can filter to only use the top
max_tokens used tokens. If
max_tokens is set to
Inf then all the tokens
will be used. This will generally lead to very large datasets when then
tokens are words or trigrams. A good strategy is to start with a low token
count and go up according to how much RAM you want to use.
step_tokenize() to turn character into tokenlist.
library(recipes) library(modeldata) data(okc_text) okc_rec <- recipe(~ ., data = okc_text) %>% step_tokenize(essay0) %>% step_tokenfilter(essay0) okc_obj <- okc_rec %>% prep() juice(okc_obj, essay0) %>% slice(1:2)#> # A tibble: 2 x 1 #> essay0 #> <tknlist> #> 1 [83 tokens] #> 2 [13 tokens]juice(okc_obj) %>% slice(2) %>% pull(essay0)#> <textrecipes_tokenlist> #>  [13 tokens] #> # Unique Tokens: 7tidy(okc_rec, number = 2)#> # A tibble: 1 x 3 #> terms value id #> <chr> <int> <chr> #> 1 essay0 NA tokenfilter_TxTTvtidy(okc_obj, number = 2)#> # A tibble: 1 x 3 #> terms value id #> <quos> <list> <chr> #> 1 essay0 <int > tokenfilter_TxTTv