🍰Tokenization Functions
A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.
Table of Contents
tokenize
Tokenizes a given text into words, with optional conditions and morphological operations.
Parameters
text
(String) The input text to be tokenized into words.conditions
(Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.morphs
(Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.
Returns
Array
An array of tokens.
Example
const text = "test ذَهب محُمد الى السوق";
const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens); // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams); // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]
sentenceTokenize
Tokenizes a given text into sentences.
Parameters
text
(String) The input text to be tokenized into sentences.
Returns
Array
An array of sentences.
Example
const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [ 'لعلم بحر لا نهاية له،', 'فمهما حاول طالب العلم الإبحار في شتى العلوم،', 'سيُدرك أنّه ما زال على الشاطئ.' ]
tokenizeWithLocation
Tokenizes a given text into words and provides the start and end indices of each token in the original text.
Parameters
text
(String) The input text to be tokenized into words.
Returns
Array
An array of objects, each containing a token and its start and end indices in the original text.
Example
const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
// { token: 'لعلم', start: 0, end: 4 },
// { token: 'بحر', start: 5, end: 8 },
// { token: 'لا', start: 9, end: 11 },
// { token: 'نهاية', start: 12, end: 17 },
// { token: 'له', start: 18, end: 20 }
// ]
Last updated