Skip to contents

This is an extremely simple tokenizer that simply splits text on spaces. It also optionally applies the cleaning processes from prepare_text.

Usage

prepare_and_tokenize(text, prepare = TRUE, ...)

Arguments

text

A character vector to clean.

prepare

Logical; should the text be passed through prepare_text?

...

Arguments passed on to prepare_text

squish_whitespace

Logical scalar; squish whitespace characters (using str_squish)?

remove_control_characters

Logical scalar; remove control characters?

remove_replacement_characters

Logical scalar; remove the "replacement character", U-FFFD?

remove_diacritics

Logical scalar; remove diacritical marks (accents, etc) from characters?

space_cjk

Logical scalar; add spaces around Chinese/Japanese/Korean ideographs?

space_punctuation

Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)?

remove_terminal_hyphens

Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken".

space_hyphens

Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation.

space_abbreviations

Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation.

Value

The text as a list of character vectors. Each element of each vector is roughly equivalent to a word.

Examples

prepare_and_tokenize("This is some text.")
#> [[1]]
#> [1] "This" "is"   "some" "text" "."   
#> 
prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
#> [[1]]
#> [1] "This"  "is"    "some"  "text."
#>