This is an extremely simple tokenizer that simply splits text on spaces. It
also optionally applies the cleaning processes from
prepare_text.
Arguments
- text
A character vector to clean.
- prepare
Logical; should the text be passed through
prepare_text?- ...
Arguments passed on to
prepare_textsquish_whitespaceLogical scalar; squish whitespace characters (using
str_squish)?remove_control_charactersLogical scalar; remove control characters?
remove_replacement_charactersLogical scalar; remove the "replacement character",
U-FFFD?remove_diacriticsLogical scalar; remove diacritical marks (accents, etc) from characters?
space_cjkLogical scalar; add spaces around Chinese/Japanese/Korean ideographs?
space_punctuationLogical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
remove_terminal_hyphensLogical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
space_hyphensLogical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
space_abbreviationsLogical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.