This is an extremely simple tokenizer that simply splits text on spaces. It
also optionally applies the cleaning processes from
prepare_text
.
Arguments
- text
A character vector to clean.
- prepare
Logical; should the text be passed through
prepare_text
?- ...
Arguments passed on to
prepare_text
squish_whitespace
Logical scalar; squish whitespace characters (using
str_squish
)?remove_control_characters
Logical scalar; remove control characters?
remove_replacement_characters
Logical scalar; remove the "replacement character",
U-FFFD
?remove_diacritics
Logical scalar; remove diacritical marks (accents, etc) from characters?
space_cjk
Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
space_punctuation
Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
remove_terminal_hyphens
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
space_hyphens
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
space_abbreviations
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.