To tokenize Chinese, Japanese, and Korean (CJK) characters, it's convenient to add spaces around the characters.
Examples
to_space <- intToUtf8(13312:13320)
to_space
#> [1] "㐀㐁㐂㐃㐄㐅㐆㐇㐈"
space_cjk(to_space)
#> [1] " 㐀 㐁 㐂 㐃 㐄 㐅 㐆 㐇 㐈 "
To tokenize Chinese, Japanese, and Korean (CJK) characters, it's convenient to add spaces around the characters.
to_space <- intToUtf8(13312:13320)
to_space
#> [1] "㐀㐁㐂㐃㐄㐅㐆㐇㐈"
space_cjk(to_space)
#> [1] " 㐀 㐁 㐂 㐃 㐄 㐅 㐆 㐇 㐈 "