Prepare Text for Tokenization — prepare

This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.

Usage

prepare_text(
  text,
  squish_whitespace = TRUE,
  remove_terminal_hyphens = TRUE,
  remove_control_characters = TRUE,
  remove_replacement_characters = TRUE,
  remove_diacritics = TRUE,
  space_cjk = TRUE,
  space_punctuation = TRUE,
  space_hyphens = TRUE,
  space_abbreviations = TRUE
)

Arguments

text: A character vector to clean.
squish_whitespace: Logical scalar; squish whitespace characters (using str_squish)?
remove_terminal_hyphens: Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".
remove_control_characters: Logical scalar; remove control characters?
remove_replacement_characters: Logical scalar; remove the "replacement character", U-FFFD?
remove_diacritics: Logical scalar; remove diacritical marks (accents, etc) from characters?
space_cjk: Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?
space_punctuation: Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?
space_hyphens: Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.
space_abbreviations: Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.

Value

The character vector, cleaned as specified.

Examples

piece1 <- " This is a    \n\nfa\xE7ile\n\n    example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
  piece1,
  "It has the bell character, \a, and the replacement character,",
  intToUtf8(65533)
)
prepare_text(example_text)
#> [1] "This is a facile example . It has the bell character , , and the replacement character ,"
prepare_text(example_text, squish_whitespace = FALSE)
#> [1] " This is a    facile    example .  It has the bell character ,   ,  and the replacement character ,  "
prepare_text(example_text, remove_control_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , \a , and the replacement character ,"
prepare_text(example_text, remove_replacement_characters = FALSE)
#> [1] "This is a facile example . It has the bell character , , and the replacement character , �"
prepare_text(example_text, remove_diacritics = FALSE)
#> [1] "This is a façile example . It has the bell character , , and the replacement character ,"