This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.
Usage
prepare_text(
  text,
  squish_whitespace = TRUE,
  remove_terminal_hyphens = TRUE,
  remove_control_characters = TRUE,
  remove_replacement_characters = TRUE,
  remove_diacritics = TRUE,
  space_cjk = TRUE,
  space_punctuation = TRUE,
  space_hyphens = TRUE,
  space_abbreviations = TRUE
)Arguments
- text
- A character vector to clean. 
- squish_whitespace
- Logical scalar; squish whitespace characters (using - str_squish)?
- remove_terminal_hyphens
- Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". 
- remove_control_characters
- Logical scalar; remove control characters? 
- remove_replacement_characters
- Logical scalar; remove the "replacement character", - U-FFFD?
- remove_diacritics
- Logical scalar; remove diacritical marks (accents, etc) from characters? 
- space_cjk
- Logical scalar; add spaces around Chinese/Japanese/Korean ideographs? 
- space_punctuation
- Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)? 
- space_hyphens
- Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. 
- space_abbreviations
- Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. 
Examples
piece1 <- " This is a    \n\nfa\xE7ile\n\n    example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
  piece1,
  "It has the bell character, \a, and the replacement character,",
  intToUtf8(65533)
)
prepare_text(example_text)
#> [1] "This is a facile example . It has the bell character , , and the replacement character ,"
prepare_text(example_text, squish_whitespace = FALSE)
#> [1] " This is a    facile    example .  It has the bell character ,   ,  and the replacement character ,  "
prepare_text(example_text, remove_control_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , \a , and the replacement character ,"
prepare_text(example_text, remove_replacement_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , , and the replacement character , �"
prepare_text(example_text, remove_diacritics = FALSE)
#> [1] "This is a façile example . It has the bell character , , and the replacement character ,"