step_text_normalization creates a specification of a recipe step that will perform Unicode Normalization

step_text_normalization(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
normalization_form = "nfc",
skip = FALSE,
id = rand_id("text_normalization")
)

# S3 method for step_text_normalization
tidy(x, ...)

## Arguments

recipe A recipe object. The step will be added to the sequence of operations for this recipe. One or more selector functions to choose which variables will be transformed. See recipes::selections() for more details. For the tidy method, these are not currently used. Not used by this step since no new variables are created. A logical to indicate if the recipe has been baked. A list of tibble results that define the encoding. This is NULL until the step is trained by recipes::prep.recipe(). A single character string determining the Unicode Normalization. Must be one of "nfc", "nfd", "nfkd", "nfkc", or "nfkc_casefold". Defaults to "nfc". See stringi::stri_trans_nfc() for more details. A logical. Should the step be skipped when the recipe is baked by recipes::bake.recipe()? While all operations are baked when recipes::prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations. A character string that is unique to this step to identify it. A step_text_normalization object.

## Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

step_texthash() for feature hashing.

## Examples

if (requireNamespace("stringi", quietly = TRUE)) {
library(recipes)

sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n"))

okc_rec <- recipe(~ ., data = sample_data) %>%
step_text_normalization(text)

okc_obj <- okc_rec %>%
prep()

juice(okc_obj, text) %>%
slice(1:2)

juice(okc_obj) %>%
slice(2) %>%
pull(text)

tidy(okc_rec, number = 1)
tidy(okc_obj, number = 1)
}#> # A tibble: 1 x 3
#>   terms      normalization_form id
#>   <quos>     <chr>              <chr>
#> 1 text       nfc                text_normalization_n1s9f