Function reference
-
prepare_and_tokenize()
- Split Text on Spaces
-
prepare_text()
- Prepare Text for Tokenization
-
remove_control_characters()
- Remove Non-Character Characters
-
remove_diacritics()
- Remove Diacritical Marks on Characters
-
remove_replacement_characters()
- Remove the Unicode Replacement Character
-
space_cjk()
- Add Spaces Around CJK Ideographs
-
space_punctuation()
- Add Spaces Around Punctuation
-
squish_whitespace()
- Remove Extra Whitespace
-
tokenize_space()
- Break Text at Spaces
-
validate_utf8()
- Clean Up Text to UTF-8