Tokenization Functions

A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.

PreviousNormalization Functions NextComparison Functions

Last updated 2 years ago

Tokenization Functions

A set of functions to tokenize a given text into sentences or words, with optional conditions and morphological operations.

Tokenization is a way of separating a piece of text into smaller units called tokens.

tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

Parameters

text (String) The input text to be tokenized into words.
conditions (Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.
morphs (Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

Returns

Array An array of tokens.

Example

const text = "test ذَهب محُمد الى السوق";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams);  // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]

sentenceTokenize

Tokenizes a given text into sentences.

Parameters

text (String) The input text to be tokenized into sentences.

Returns

Array An array of sentences.

Example

const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'لعلم بحر لا نهاية له،',  'فمهما حاول طالب العلم الإبحار في شتى العلوم،',  'سيُدرك أنّه ما زال على الشاطئ.'  ]

tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

Parameters

text (String) The input text to be tokenized into words.

Returns

Array An array of objects, each containing a token and its start and end indices in the original text.

Example

const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
//  { token: 'لعلم', start: 0, end: 4 },
//  { token: 'بحر', start: 5, end: 8 },
//  { token: 'لا', start: 9, end: 11 },
//  { token: 'نهاية', start: 12, end: 17 },
//  { token: 'له', start: 18, end: 20 }
//  ]

PreviousNormalization Functions NextComparison Functions

Last updated 2 years ago

Tokenization is a way of separating a piece of text into smaller units called tokens.

tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

Parameters

text (String) The input text to be tokenized into words.
conditions (Array | Function ,optional) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.
morphs (Array | Function ,optional) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

Returns

Array An array of tokens.

Example

const text = "test ذَهب محُمد الى السوق";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams);  // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]

sentenceTokenize

Tokenizes a given text into sentences.

Parameters

text (String) The input text to be tokenized into sentences.

Returns

Array An array of sentences.

Example

const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'لعلم بحر لا نهاية له،',  'فمهما حاول طالب العلم الإبحار في شتى العلوم،',  'سيُدرك أنّه ما زال على الشاطئ.'  ]

tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

Parameters

text (String) The input text to be tokenized into words.

Returns

Array An array of objects, each containing a token and its start and end indices in the original text.

Example

const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
//  { token: 'لعلم', start: 0, end: 4 },
//  { token: 'بحر', start: 5, end: 8 },
//  { token: 'لا', start: 9, end: 11 },
//  { token: 'نهاية', start: 12, end: 17 },
//  { token: 'له', start: 18, end: 20 }
//  ]

Tokenization Functions

Tokenization Functions

Table of Contents

tokenize

Example

sentenceTokenize

tokenizeWithLocation

Table of Contents

tokenize

Example

sentenceTokenize

tokenizeWithLocation