Search Arabic text with diacritics using SQLite tokenizer
SQLite is a popular database engine that is widely used in mobile apps, particularly those developed for Android, iOS, and macOS. It is also available for use in Windows and Linux.
How does FTS work?
FTS (Full-Text Search) is a feature in SQLite that enables users to conduct text-based searches on a database. It operates by dividing the text in a virtual FTS table into tokens and then comparing the tokens of a query with the stored tokens to return relevant results.
To illustrate, when inserting text into a virtual FTS table, the text is tokenized. Then, when performing a search using FTS, the search is compared to the stored tokens and the results are returned. For example, if we insert the text “SQLite is the best”, it will create tokens [SQLite, is, the, best] and then search for the word “is,” FTS will return “SQLite is the best” and any other texts containing the word “is.”
Our problem:
In GTAF, we were facing a problem with searching Arabic texts. Arabic can be read or written with or without diacritics, and the team needed a way to search for texts with or without diacritics. Those who don’t know about diacritics,
Arabic text with diacritics, ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ (Quran 1:2)
Arabic text without diacritics, الحمد لله رب العلمىن (Quran 1:2)
User Story:
As a user, I want to search with or without tashkeel and I want to get all the sentences from our db, which contains the words with tashkeel as the result.
e.g:
Searching with كتب or كَتَبَ
Result: كَتَبَ خَالِدٌ عَليٰ السَّبُّوْرَةِ،
What is Tokenizer?
To solve this problem, the team decided to develop a custom tokenizer for use with the FTS5 (Full-Text Search) feature of SQLite.
While creating a virtual table, we can specify the tokenizer we want to use. Like,
CREATE VIRTUAL TABLE table_name USING fts5(title, subtitle, text, tokenize=‘tokenizer_name’);
So, we can create a custom tokenizer, thus controlling how the tokens will be created. A tokenizer is a function that is called during the insertion, deletion, or querying of data in a virtual FTS table. It is responsible for creating or deleting tokens, as well as comparing the tokens of a query with the saved tokens to return the appropriate results. In this case, the team needed to create a custom tokenizer that could check for the presence of diacritics in a given text and remove them if necessary.
To create this custom tokenizer, the team had to find a way to handle Unicode in the C programming language, which does not support Unicode natively. They discovered that C uses two bytes to store Unicode characters, and used this information to create a function that could retrieve the Unicode code points from a text. This function, called xTokenize, was then used to split the text into words and tokenize them using a separate function called remove_diacritic. It provides *pText, which is the text needed to tokenize, and nText is the length of the text, which is needed to check if it is null-terminated.
The remove_diacritic function checked for the presence of diacritics in each word and removed them if necessary. This ensured that the same results were returned whether the search text contained diacritics or not. We need to call xToken() with the tokenized word after creating each token. So, We have split the text into words and have used a function called remove_diacritic() to tokenize the words. In remove_diacritic(), we are actually checking for the codepoints of the diacritics and removing them. The same thing happens when a query occurs in the database. It checks the tokens against the previously saved tokens during the insertion and returns us the result. As we are always using the same approach, it doesn’t matter whether the search text contains diacritics or not. It always returns the same result.
You can find a more clear idea by checking the actual code. You can find it here.
Overall, the use of a custom tokenizer in the FTS5 feature of SQLite allowed the GTAF team to solve their problem of searching for Arabic texts with or without diacritics. Not only that our database size decreased by a few Mbs. We no longer need to convert the database to a non diacritic version. You can find more information about this process and the code used in the linked article. In a future article, we will discuss how to use this custom tokenizer to create and use an FTS table on various platforms.
Android Example: https://github.com/GreentechApps/android-sqlite3-extension-loading-example
Flutter Example: https://github.com/GreentechApps/tokenizer_flutter_demo
Resources:
https://github.com/simonw/sqlite-fts5-trigram
https://github.com/abiliojr/fts5-snowball
https://github.com/nalgeon/sqlean/
https://stackoverflow.com/questions/10017328/unicode-stored-in-c-char
This article has been written by me, with help from Nabil Mosharraf Hossain and the Greentech Apps Foundation team.