Function reference
-
prepare_and_tokenize() - Split Text on Spaces
-
prepare_text() - Prepare Text for Tokenization
-
remove_control_characters() - Remove Non-Character Characters
-
remove_diacritics() - Remove Diacritical Marks on Characters
-
remove_replacement_characters() - Remove the Unicode Replacement Character
-
space_cjk() - Add Spaces Around CJK Ideographs
-
space_punctuation() - Add Spaces Around Punctuation
-
squish_whitespace() - Remove Extra Whitespace
-
tokenize_space() - Break Text at Spaces
-
validate_utf8() - Clean Up Text to UTF-8