textrecipes 1.1.0
Improvements
- The following steps has gained the argument
sparse. When set to"yes", they will produce sparse vectors. (#277)
textrecipes 1.0.7
CRAN release: 2025-01-23
Bug Fixes
- Fixed bug in
step_clean_levels()where it would produce NAs for character columns. (#274)
textrecipes 1.0.5
CRAN release: 2023-10-20
-
step_untokenize()andstep_normalization()now returns factors instead of strings. (#247)
textrecipes 1.0.4
CRAN release: 2023-08-17
Improvements
step_clean_names()now throw an informative error if needed non-standard role columns are missing duringbake(). (#235)The
keep_original_colsargument has been added tostep_tokenmerge. This change should mean that every step that produces new columns has thekeep_original_colsargument. (#242)Many internal changes to improve consistency and slight speed increases.
Bug Fixes
Fixed bug where
step_dummy_hash()andstep_texthash()would add new columns before old columns. (#235)Fixed bug where
vocabulary_sizewasn’t tunable instep_tokenize_bpe(). (#239)
textrecipes 1.0.3
CRAN release: 2023-04-14
Improvements
Steps with tunable arguments now have those arguments listed in the documentation.
All steps that add new columns will now informatively error if name collision occurs.
Bug Fixes
- Fixed bug where
step_tf()wasn’t tunable forweightargument.
textrecipes 1.0.2
CRAN release: 2022-12-21
Setting
token = "tweets"instep_tokenize()have been deprecated due totokenizers::tokenize_tweets()being deprecated. (#209)step_sequence_onehot(),step_dummy_hash(),step_dummy_texthash()now return integers.step_tf()returns integer whenweight_schemeis"binary"or"raw count".All steps now have
required_pkgs()methods.
textrecipes 1.0.0
CRAN release: 2022-07-02
- Indicate which steps support case weights (none), to align documentation with other packages.
textrecipes 0.5.2
CRAN release: 2022-05-04
Remove use of okc_text in vignette
Fix bug in printing of tokenlists
textrecipes 0.5.1
CRAN release: 2022-03-29
step_tfidf()now correctly saves the idf values and applies them to the testing data set.tidy.step_tfidf()now returns calculated IDF weights.
textrecipes 0.5.0
CRAN release: 2022-03-20
New steps
step_dummy_hash()generates binary indicators (possibly signed) from simple factor or character vectors.step_tokenize()has gotten a couple of cousin functionsstep_tokenize_bpe(),step_tokenize_sentencepiece()andstep_tokenize_wordpiece()which wraps {tokenizers.bpe}, {sentencepiece} and {wordpiece} respectively (#147).
Improvements and Other Changes
Added
all_tokenized()andall_tokenized_predictors()to more easily select tokenized columns (#132).Use
show_tokens()to more easily debug a recipe involving tokenization.Reorganize documentation for all recipe step
tidymethods (#126).Steps now have a dedicated subsection detailing what happens when
tidy()is applied. (#163)All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect (#141).
step_ngram()has been given a speed increase to put it in line with other packages performance.step_tokenize()will now try to error if vocabulary size is too low when usingengine = "tokenizers.bpe"(#119).Warning given by
step_tokenfilter()when filtering failed to apply now correctly refers to the right argument name (#137).step_tf()now returns 0 instead of NaN when there aren’t any tokens present (#118).step_tokenfilter()now has a new argumentfilter_funwill takes a function which can be used to filter tokens. (#164)tidy.step_stem()now correctly shows if custom stemmer was used.Added
keep_original_colsargument tostep_lda,step_texthash(),step_tf(),step_tfidf(),step_word_embeddings(),step_dummy_hash(),step_sequence_onehot(), andstep_textfeatures()(#139).
Breaking Changes
- Steps with
prefixargument now creates names according to the patternprefix_variablename_name/number. (#124)
textrecipes 0.4.1
CRAN release: 2021-07-11
Bug fixes
- Fixed a bug in
step_tokenfilter()andstep_sequence_onehot()that sometimes caused crashes in R 4.1.0.
textrecipes 0.4.0
CRAN release: 2020-11-12
Breaking Changes
-
step_lda()now takes a tokenlist instead of a character variable. See readme for more detail.
New Features
-
step_sequence_onehot()now takes tokenlists as input. - added {tokenizers.bpe} engine to
step_tokenize(). - added {udpipe} engine to
step_tokenize(). - added new steps for cleaning variable names or levels with {janitor},
step_clean_names()andstep_clean_levels(). (#101)
textrecipes 0.3.0
CRAN release: 2020-07-08
- stopwords package have been moved from Imports to Suggests.
-
step_ngram()gained an argumentmin_num_tokensto be able to return multiple n-grams together. (#90) - Adds
step_text_normalization()to perform unicode normalization on character vectors. (#86)
textrecipes 0.2.2
CRAN release: 2020-05-10
-
step_word_embeddings()got a argumentaggregation_defaultto specify value in cases where no words matches embedding.
textrecipes 0.2.0
CRAN release: 2020-04-14
-
step_tokenize()got anengineargument to specify packages other then tokenizers to tokenize. -
spacyrhave been added as an engine tostep_tokenize(). -
step_lemma()has been added to extract lemma attribute from tokenlists. -
step_pos_filter()has been added to allow filtering of tokens bases on their pat of speech tags. -
step_ngram()has been added to generate ngrams from tokenlists. -
step_stem()not correctly uses the options argument. (Thanks to @grayskripko for finding bug, #64)
textrecipes 0.1.0
CRAN release: 2020-03-05
-
step_word2vec()have been changed tostep_lda()to reflect what is actually happening. -
step_word_embeddings()has been added. Allows for use of pre-trained word embeddings to convert token columns to vectors in a high-dimensional “meaning” space. (@jonthegeek, #20) - text2vec have been changed from Imports to Suggests.
- textfeatures have been changed from Imports to Suggests.
-
step_tfidf()calculations are slightly changed due to flaw in original implementation https://github.com/dselivanov/text2vec/issues/280.
textrecipes 0.0.2
CRAN release: 2019-09-07
- Custom stemming function can now be used in step_stem using the custom_stemmer argument.
-
step_textfeatures()have been added, allows for multiple numerical features to be pulled from text. -
step_sequence_onehot()have been added, allows for one hot encoding of sequences of fixed width. -
step_word2vec()have been added, calculates word2vec dimensions. -
step_tokenmerge()have been added, combines multiple list columns into one list-columns. -
step_texthash()now correctly acceptssignedargument. - Documentation have been improved to showcase the importance of filtering tokens before applying
step_tf()andstep_tfidf().
