🍰Tokenization Functions

A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.

Tokenization is a way of separating a piece of text into smaller units called tokens.

-analytics vidhya

Table of Contents

tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

Parameters

  • text (String) The input text to be tokenized into words.

  • conditions (Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.

  • morphs (Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

Returns

  • Array An array of tokens.

Example

const text = "test ذَهب محُمد الى السوق";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams);  // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]

sentenceTokenize

Tokenizes a given text into sentences.

Parameters

  • text (String) The input text to be tokenized into sentences.

Returns

  • Array An array of sentences.

Example

const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'لعلم بحر لا نهاية له،',  'فمهما حاول طالب العلم الإبحار في شتى العلوم،',  'سيُدرك أنّه ما زال على الشاطئ.'  ]

tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

Parameters

  • text (String) The input text to be tokenized into words.

Returns

  • Array An array of objects, each containing a token and its start and end indices in the original text.

Example

const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
//  { token: 'لعلم', start: 0, end: 4 },
//  { token: 'بحر', start: 5, end: 8 },
//  { token: 'لا', start: 9, end: 11 },
//  { token: 'نهاية', start: 12, end: 17 },
//  { token: 'له', start: 18, end: 20 }
//  ]

Last updated