Under the hood - tokenlistSource:
library(textrecipes) #> Loading required package: recipes #> Loading required package: dplyr #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union #> #> Attaching package: 'recipes' #> The following object is masked from 'package:stats': #> #> step
textrecipes have been using lists of character vectors to carry around the tokens. A simple S3 vector class has been implemented with the vctrs package to handle that list of tokens, henceforth to be known as a
If you are only using this package for preprocessing then you most likely won’t even notice that this change has happened. However if you are thinking of contributing to textrecipes then knowing about
tokenlists will be essential.
tokenlist is based around a simple list of character vectors, and has 3 attributes,
tokens attribute is a vector of the unique tokens contained in the data list. This attribute is calculated automatically when using
tokenlist(). If a function is applied to the tokenlist where the resulting unique tokens can be derived then
new_tokenlist() can be used to create a tokenlist with known
pos attribute are used in the same way. They default to
NULL but can be filled depending on which engine is being used in
step_tokenize(). The attribute is a list of characters in the exact shape and size as the tokenlist and should have a one-to-one relationship.
If a specific element is removed in the tokenlist then the corresponding element in
pos should be removed.