๐ŸฐTokenization Functions

A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.

Tokenization is a way of separating a piece of text into smaller units called tokens.

-analytics vidhya

Table of Contents

tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

Parameters

  • text (String) The input text to be tokenized into words.

  • conditions (Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.

  • morphs (Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

Returns

  • Array An array of tokens.

Example

const text = "test ุฐูŽู‡ุจ ู…ุญูู…ุฏ ุงู„ู‰ ุงู„ุณูˆู‚";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ุฐูŽู‡ุจ', 'ู…ูุญู…ุฏ', 'ุงู„ู‰', 'ุงู„ุณูˆู‚' ]
console.log(tokenswithparams);  // Output: [ 'ุฐู‡ุจ', 'ู…ุญู…ุฏ', 'ุงู„ู‰', 'ุงู„ุณูˆู‚' ]

sentenceTokenize

Tokenizes a given text into sentences.

Parameters

  • text (String) The input text to be tokenized into sentences.

Returns

  • Array An array of sentences.

Example

const text = "ู„ุนู„ู… ุจุญุฑ ู„ุง ู†ู‡ุงูŠุฉ ู„ู‡ุŒ ูู…ู‡ู…ุง ุญุงูˆู„ ุทุงู„ุจ ุงู„ุนู„ู… ุงู„ุฅุจุญุงุฑ ููŠ ุดุชู‰ ุงู„ุนู„ูˆู…ุŒ ุณูŠูุฏุฑูƒ ุฃู†ู‘ู‡ ู…ุง ุฒุงู„ ุนู„ู‰ ุงู„ุดุงุทุฆ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'ู„ุนู„ู… ุจุญุฑ ู„ุง ู†ู‡ุงูŠุฉ ู„ู‡ุŒ',  'ูู…ู‡ู…ุง ุญุงูˆู„ ุทุงู„ุจ ุงู„ุนู„ู… ุงู„ุฅุจุญุงุฑ ููŠ ุดุชู‰ ุงู„ุนู„ูˆู…ุŒ',  'ุณูŠูุฏุฑูƒ ุฃู†ู‘ู‡ ู…ุง ุฒุงู„ ุนู„ู‰ ุงู„ุดุงุทุฆ.'  ]

tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

Parameters

  • text (String) The input text to be tokenized into words.

Returns

  • Array An array of objects, each containing a token and its start and end indices in the original text.

Example

const text = "ู„ุนู„ู… ุจุญุฑ ู„ุง ู†ู‡ุงูŠุฉ ู„ู‡.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
//  { token: 'ู„ุนู„ู…', start: 0, end: 4 },
//  { token: 'ุจุญุฑ', start: 5, end: 8 },
//  { token: 'ู„ุง', start: 9, end: 11 },
//  { token: 'ู†ู‡ุงูŠุฉ', start: 12, end: 17 },
//  { token: 'ู„ู‡', start: 18, end: 20 }
//  ]

Last updated