# Tokenization Functions

{% hint style="info" %}
Tokenization is a way of separating a piece of text into smaller units called tokens.

-[analytics vidhya](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/#:~:text=Tokenization%20is%20a%20way%20of,n%2Dgram%20characters\)%20tokenization.)
{% endhint %}

### Table of Contents

* [tokenize](#tokenize)
* [sentenceTokenize](#sentencetokenize)
* [tokenizeWithLocation](#tokenizewithlocation)

### tokenize

Tokenizes a given text into words, with optional conditions and morphological operations.

**Parameters**

* `text` (String) The input text to be tokenized into words.
* `conditions` (Array | Function ,<mark style="color:yellow;">optional</mark>) An array of functions, each representing a condition that a token must meet. If a single function is provided, it will be wrapped in an array.
* `morphs` (Array | Function ,<mark style="color:yellow;">optional</mark>) An array of functions, each representing a morphological operation to be applied on the tokens. If a single function is provided, it will be wrapped in an array.

**Returns**

* `Array` An array of tokens.

#### Example

```javascript
const text = "test ذَهب محُمد الى السوق";

const tokens = tokenize(text);
const tokenswithparams = tokenize(text, [isArabicRange], [stripHarakat]);
console.log(tokens);  // Output: [ 'test','ذَهب', 'مُحمد', 'الى', 'السوق' ]
console.log(tokenswithparams);  // Output: [ 'ذهب', 'محمد', 'الى', 'السوق' ]
```

### sentenceTokenize

Tokenizes a given text into sentences.

**Parameters**

* `text` (String) The input text to be tokenized into sentences.

**Returns**

* `Array` An array of sentences.

**Example**

```javascript
const text = "لعلم بحر لا نهاية له، فمهما حاول طالب العلم الإبحار في شتى العلوم، سيُدرك أنّه ما زال على الشاطئ.";
const sentences = sentenceTokenize(text);
console.log(sentences);
// Output: [  'لعلم بحر لا نهاية له،',  'فمهما حاول طالب العلم الإبحار في شتى العلوم،',  'سيُدرك أنّه ما زال على الشاطئ.'  ]
```

### tokenizeWithLocation

Tokenizes a given text into words and provides the start and end indices of each token in the original text.

**Parameters**

* `text` (String) The input text to be tokenized into words.

**Returns**

* `Array` An array of objects, each containing a token and its start and end indices in the original text.

**Example**

<pre class="language-javascript"><code class="lang-javascript">const text = "لعلم بحر لا نهاية له.";
const tokensWithLocation = tokenizeWithLocation(text);
console.log(tokensWithLocation);
// Output: [
<strong>//  { token: 'لعلم', start: 0, end: 4 },
</strong>//  { token: 'بحر', start: 5, end: 8 },
//  { token: 'لا', start: 9, end: 11 },
//  { token: 'نهاية', start: 12, end: 17 },
//  { token: 'له', start: 18, end: 20 }
//  ]
</code></pre>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://mdanok.gitbook.io/arajs/text-functions/tokenization-functions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
