๐ŸฐTokenization Functions

A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.

circle-info

Tokenization is a way of separating a piece of text into smaller units called tokens.

-analytics vidhyaarrow-up-right

Table of Contents

tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

Parameters

  • text (String) The input text to be tokenized into words.

  • conditions (Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.

  • morphs (Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

Returns

  • Array An array of tokens.

Example

const text = "test ุฐูŽู‡ุจ ู…ุญูู…ุฏ ุงู„ู‰ ุงู„ุณูˆู‚";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ุฐูŽู‡ุจ', 'ู…ูุญู…ุฏ', 'ุงู„ู‰', 'ุงู„ุณูˆู‚' ]
console.log(tokenswithparams);  // Output: [ 'ุฐู‡ุจ', 'ู…ุญู…ุฏ', 'ุงู„ู‰', 'ุงู„ุณูˆู‚' ]

sentenceTokenize

Tokenizes a given text into sentences.

Parameters

  • text (String) The input text to be tokenized into sentences.

Returns

  • Array An array of sentences.

Example

tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

Parameters

  • text (String) The input text to be tokenized into words.

Returns

  • Array An array of objects, each containing a token and its start and end indices in the original text.

Example

Last updated