This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.
Usage
prepare_text(
text,
squish_whitespace = TRUE,
remove_terminal_hyphens = TRUE,
remove_control_characters = TRUE,
remove_replacement_characters = TRUE,
remove_diacritics = TRUE,
space_cjk = TRUE,
space_punctuation = TRUE,
space_hyphens = TRUE,
space_abbreviations = TRUE
)
Arguments
- text
A character vector to clean.
- squish_whitespace
Logical scalar; squish whitespace characters (using
str_squish
)?- remove_terminal_hyphens
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
- remove_control_characters
Logical scalar; remove control characters?
- remove_replacement_characters
Logical scalar; remove the "replacement character",
U-FFFD
?- remove_diacritics
Logical scalar; remove diacritical marks (accents, etc) from characters?
- space_cjk
Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
- space_punctuation
Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
- space_hyphens
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
- space_abbreviations
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.
Examples
piece1 <- " This is a \n\nfa\xE7ile\n\n example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
piece1,
"It has the bell character, \a, and the replacement character,",
intToUtf8(65533)
)
prepare_text(example_text)
#> [1] "This is a facile example . It has the bell character , , and the replacement character ,"
prepare_text(example_text, squish_whitespace = FALSE)
#> [1] " This is a facile example . It has the bell character , , and the replacement character , "
prepare_text(example_text, remove_control_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , \a , and the replacement character ,"
prepare_text(example_text, remove_replacement_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , , and the replacement character , �"
prepare_text(example_text, remove_diacritics = FALSE)
#> [1] "This is a façile example . It has the bell character , , and the replacement character ,"