🍰Tokenization Functions
A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.
Table of Contents
tokenize
Tokenizes a given text into words, with optional conditions and morphological operations.
Parameters
- text(String) The input text to be tokenized into words.
- conditions(Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.
- morphs(Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.
Returns
- ArrayAn array of tokens.
Example
const text = "test ذَهب محُمد الى السوق";
const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams);  // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]sentenceTokenize
Tokenizes a given text into sentences.
Parameters
- text(String) The input text to be tokenized into sentences.
Returns
- ArrayAn array of sentences.
Example
const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'لعلم بحر لا نهاية له،',  'فمهما حاول طالب العلم الإبحار في شتى العلوم،',  'سيُدرك أنّه ما زال على الشاطئ.'  ]tokenizeWithLocation
Tokenizes a given text into words and provides the start and end indices of each token in the original text.
Parameters
- text(String) The input text to be tokenized into words.
Returns
- ArrayAn array of objects, each containing a token and its start and end indices in the original text.
Example
const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
//  { token: 'لعلم', start: 0, end: 4 },
//  { token: 'بحر', start: 5, end: 8 },
//  { token: 'لا', start: 9, end: 11 },
//  { token: 'نهاية', start: 12, end: 17 },
//  { token: 'له', start: 18, end: 20 }
//  ]Last updated