Fast way to split Khmer word with Intl.Segmenter

Each word in Khmer script has no separator which means we cannot split it with space like English, so we have to use Intl.Segmenter (opens in a new tab) is a browser API for split graphemes and words.

It's useful for rendering text on screen to know when to wrap the text.

Requirements

The API is quite new, but major browsers have already supported it. You can see more on caniuse (opens in a new tab)

If you need a polyfill, you can use intl-segmenter-polyfill (opens in a new tab)

const segmenter = new Intl.Segmenter("km", { granularity: "word" });
const text = "កូនខ្មែរអាចធ្វើបាន";
const iterator = segmenter.segment(text);
 
// convert to an array
const array = [...iterator];
console.log(array);

The result will be

[
	{
		"segment": "កូន",
		"index": 0,
		"input": "កូនខ្មែរអាចធ្វើបាន",
		"isWordLike": true
	},
	{
		"segment": "ខ្មែរ",
		"index": 3,
		"input": "កូនខ្មែរអាចធ្វើបាន",
		"isWordLike": true
	},
	{
		"segment": "អាច",
		"index": 8,
		"input": "កូនខ្មែរអាចធ្វើបាន",
		"isWordLike": true
	},
	{
		"segment": "ធ្វើ",
		"index": 11,
		"input": "កូនខ្មែរអាចធ្វើបាន",
		"isWordLike": true
	},
	{
		"segment": "បាន",
		"index": 15,
		"input": "កូនខ្មែរអាចធ្វើបាន",
		"isWordLike": true
	}
]

Demo

The API is based on Unicode ICU project and it's using a large dictionary of Khmer words to be able to split words. The algorithm behind is called PrefixTree or Trie. This method is faster than any other methods like using Machine Learning. However, the result it is not 100% accurate.

JavaScript Library

I've made a library just for doing that, you can just import and use it. It comes with a Khmer language polyfill as well.

seanghay/split-khmer (opens in a new tab)

Let me know what you think and thanks for reading.

2023 © Seanghay Yath