step_text_normalization creates a specification of a recipe step that will perform Unicode Normalization

step_text_normalization(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  columns = NULL,
  normalization_form = "nfc",
  skip = FALSE,
  id = rand_id("text_normalization")
)

# S3 method for step_text_normalization
tidy(x, ...)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

One or more selector functions to choose which variables will be transformed. See recipes::selections() for more details. For the tidy method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

A logical to indicate if the recipe has been baked.

columns

A list of tibble results that define the encoding. This is NULL until the step is trained by recipes::prep.recipe().

normalization_form

A single character string determining the Unicode Normalization. Must be one of "nfc", "nfd", "nfkd", "nfkc", or "nfkc_casefold". Defaults to "nfc". See stringi::stri_trans_nfc() for more details.

skip

A logical. Should the step be skipped when the recipe is baked by recipes::bake.recipe()? While all operations are baked when recipes::prep.recipe() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

x

A step_text_normalization object.

Value

An updated version of recipe with the new step added to the sequence of existing steps (if any).

See also

step_texthash() for feature hashing.

Examples

if (requireNamespace("stringi", quietly = TRUE)) { library(recipes) sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n")) okc_rec <- recipe(~ ., data = sample_data) %>% step_text_normalization(text) okc_obj <- okc_rec %>% prep() juice(okc_obj, text) %>% slice(1:2) juice(okc_obj) %>% slice(2) %>% pull(text) tidy(okc_rec, number = 1) tidy(okc_obj, number = 1) }
#> # A tibble: 1 x 3 #> terms normalization_form id #> <quos> <chr> <chr> #> 1 text nfc text_normalization_n1s9f