Split Text on Spaces — prepare_and

This is an extremely simple tokenizer that simply splits text on spaces. It also optionally applies the cleaning processes from prepare_text.

Usage

prepare_and_tokenize(text, prepare = TRUE, ...)

Arguments

text

A character vector to clean.

prepare

Logical; should the text be passed through prepare_text?

...

Arguments passed on to prepare_text

squish_whitespace: Logical scalar; squish whitespace characters (using str_squish)?
remove_control_characters: Logical scalar; remove control characters?
remove_replacement_characters: Logical scalar; remove the "replacement character", U-FFFD?
remove_diacritics: Logical scalar; remove diacritical marks (accents, etc) from characters?
space_cjk: Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
space_punctuation: Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
remove_terminal_hyphens: Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
space_hyphens: Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
space_abbreviations: Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.

Value

The text as a list of character vectors. Each element of each vector is roughly equivalent to a word.

Examples

prepare_and_tokenize("This is some text.")
#> [[1]]
#> [1] "This" "is"   "some" "text" "."   
#> 
prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
#> [[1]]
#> [1] "This"  "is"    "some"  "text."
#>