Changelog • textrecipes

textrecipes (development version)

textrecipes 1.1.0

CRAN release: 2025-03-18

Improvements

The following steps has gained the argument sparse. When set to "yes", they will produce sparse vectors. (#277)
- step_dummy_hash()
- step_texthash()
- step_tf()
- step_tfidf()

textrecipes 1.0.7

CRAN release: 2025-01-23

Improvements

Documentation for tidy methods for all steps has been improved to describe the return value more accurately. (#262)
Calling ?tidy.step_*() now sends you to the documentation for step_*() where the outcome is documented. (#261)
step_textfeatures() has been made faster and more robust. (#265)

Bug Fixes

Fixed bug in step_clean_levels() where it would produce NAs for character columns. (#274)

textrecipes 1.0.6

CRAN release: 2023-11-15

textfeatures has been removed from Suggests. (#255)
step_textfeatures() no longer returns a politeness feature. (#254)

textrecipes 1.0.5

CRAN release: 2023-10-20

step_untokenize() and step_normalization() now returns factors instead of strings. (#247)

textrecipes 1.0.4

CRAN release: 2023-08-17

Improvements

step_clean_names() now throw an informative error if needed non-standard role columns are missing during bake(). (#235)
The keep_original_cols argument has been added to step_tokenmerge. This change should mean that every step that produces new columns has the keep_original_cols argument. (#242)
Many internal changes to improve consistency and slight speed increases.

Bug Fixes

Fixed bug where step_dummy_hash() and step_texthash() would add new columns before old columns. (#235)
Fixed bug where vocabulary_size wasn’t tunable in step_tokenize_bpe(). (#239)

textrecipes 1.0.3

CRAN release: 2023-04-14

Improvements

Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs.

Bug Fixes

Fixed bug where step_tf() wasn’t tunable for weight argument.

textrecipes 1.0.2

CRAN release: 2022-12-21

Setting token = "tweets" in step_tokenize() have been deprecated due to tokenizers::tokenize_tweets() being deprecated. (#209)
step_sequence_onehot(), step_dummy_hash(), step_dummy_texthash() now return integers. step_tf() returns integer when weight_scheme is "binary" or "raw count".
All steps now have required_pkgs() methods.

textrecipes 1.0.1

CRAN release: 2022-10-06

Examples no longer include if (require(...)) code.

textrecipes 1.0.0

CRAN release: 2022-07-02

Indicate which steps support case weights (none), to align documentation with other packages.

textrecipes 0.5.2

CRAN release: 2022-05-04

Remove use of okc_text in vignette
Fix bug in printing of tokenlists

textrecipes 0.5.1

CRAN release: 2022-03-29

step_tfidf() now correctly saves the idf values and applies them to the testing data set.
tidy.step_tfidf() now returns calculated IDF weights.

textrecipes 0.5.0

CRAN release: 2022-03-20

New steps

step_dummy_hash() generates binary indicators (possibly signed) from simple factor or character vectors.
step_tokenize() has gotten a couple of cousin functions step_tokenize_bpe(), step_tokenize_sentencepiece() and step_tokenize_wordpiece() which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147).

Improvements and Other Changes

Added all_tokenized() and all_tokenized_predictors() to more easily select tokenized columns (#132).
Use show_tokens() to more easily debug a recipe involving tokenization.
Reorganize documentation for all recipe step tidy methods (#126).
Steps now have a dedicated subsection detailing what happens when tidy() is applied. (#163)
All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
step_ngram() has been given a speed increase to put it in line with other packages performance.
step_tokenize() will now try to error if vocabulary size is too low when using engine = "tokenizers.bpe" (#119).
Warning given by step_tokenfilter() when filtering failed to apply now correctly refers to the right argument name (#137).
step_tf() now returns 0 instead of NaN when there aren’t any tokens present (#118).
step_tokenfilter() now has a new argument filter_fun will takes a function which can be used to filter tokens. (#164)
tidy.step_stem() now correctly shows if custom stemmer was used.
Added keep_original_cols argument to step_lda, step_texthash(), step_tf(), step_tfidf(), step_word_embeddings(), step_dummy_hash(), step_sequence_onehot(), and step_textfeatures() (#139).

Breaking Changes

Steps with prefix argument now creates names according to the pattern prefix_variablename_name/number. (#124)

textrecipes 0.4.1

CRAN release: 2021-07-11

Bug fixes

Fixed a bug in step_tokenfilter() and step_sequence_onehot() that sometimes caused crashes in R 4.1.0.

textrecipes 0.4.0

CRAN release: 2020-11-12

Breaking Changes

step_lda() now takes a tokenlist instead of a character variable. See readme for more detail.

New Features

step_sequence_onehot() now takes tokenlists as input.
added {tokenizers.bpe} engine to step_tokenize().
added {udpipe} engine to step_tokenize().
added new steps for cleaning variable names or levels with {janitor}, step_clean_names() and step_clean_levels(). (#101)

textrecipes 0.3.0

CRAN release: 2020-07-08

stopwords package have been moved from Imports to Suggests.
step_ngram() gained an argument min_num_tokens to be able to return multiple n-grams together. (#90)
Adds step_text_normalization() to perform unicode normalization on character vectors. (#86)

textrecipes 0.2.3

CRAN release: 2020-05-22

textrecipes 0.2.2

CRAN release: 2020-05-10

step_word_embeddings() got a argument aggregation_default to specify value in cases where no words matches embedding.

textrecipes 0.2.1

CRAN release: 2020-05-04

textrecipes 0.2.0

CRAN release: 2020-04-14

step_tokenize() got an engine argument to specify packages other then tokenizers to tokenize.
spacyr have been added as an engine to step_tokenize().
step_lemma() has been added to extract lemma attribute from tokenlists.
step_pos_filter() has been added to allow filtering of tokens bases on their pat of speech tags.
step_ngram() has been added to generate ngrams from tokenlists.
step_stem() not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64)

textrecipes 0.1.0

CRAN release: 2020-03-05

step_word2vec() have been changed to step_lda() to reflect what is actually happening.
step_word_embeddings() has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional “meaning” space. (@jonthegeek, #20)
text2vec have been changed from Imports to Suggests.
textfeatures have been changed from Imports to Suggests.
step_tfidf() calculations are slightly changed due to flaw in original implementation https://github.com/dselivanov/text2vec/issues/280.

textrecipes 0.0.2

CRAN release: 2019-09-07

Custom stemming function can now be used in step_stem using the custom_stemmer argument.
step_textfeatures() have been added, allows for multiple numerical features to be pulled from text.
step_sequence_onehot() have been added, allows for one hot encoding of sequences of fixed width.
step_word2vec() have been added, calculates word2vec dimensions.
step_tokenmerge() have been added, combines multiple list columns into one list-columns.
step_texthash() now correctly accepts signed argument.
Documentation have been improved to showcase the importance of filtering tokens before applying step_tf() and step_tfidf().

textrecipes 0.0.1

CRAN release: 2018-12-17

First CRAN version