To keep punctuation during tokenization, it's convenient to add spacing around punctuation. This function does that, with options to keep certain types of punctuation together as part of the word.
Arguments
- text
A character vector to clean.
- space_hyphens
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
- space_abbreviations
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.
Value
A character vector the same length as the input text, with spaces added around punctuation characters.
Examples
to_space <- "This is some 'gosh-darn' $5 text. Isn't it lovely?"
to_space
#> [1] "This is some 'gosh-darn' $5 text. Isn't it lovely?"
space_punctuation(to_space)
#> [1] "This is some ' gosh - darn ' $ 5 text . Isn ' t it lovely ? "
space_punctuation(to_space, space_hyphens = FALSE)
#> [1] "This is some ' gosh-darn ' $ 5 text . Isn ' t it lovely ? "
space_punctuation(to_space, space_abbreviations = FALSE)
#> [1] "This is some ' gosh - darn ' $ 5 text . Isn't it lovely ? "