Impressum

You are here: SE » ArabischTransliterationDMG

Automatic Transliteration from Arabic to German conforming to the standard of the "Deutsche Morgenländische Gesellschaft"

25 Apr 2024 - 03:12 | Version 10 | rh0457fu

worked on by: Rashid Harvey

Outline

The Arabic alphabet is fundamentally different from the Latin alphabet. So, people have developed transliteration and transcription systems to represent the Arabic writing system with Latin letters. There are several use-cases:

1. The user cannot write Arabic, either because he doesn't know how to or because he doesn't have an Arabic keyboard, for example.

2. The user understands Arabic and can read and write. However, he wants to represent it in a way that people can read it even though they do not know how to read Arabic.

The DMG-standard (DIN 31635) emerged in Germany in 1935 and has served as an inspiration to many other popular Latin transliteration systems. It is however mainly used by scientists studying the Arabic language.

The standard is mostly used to fulfill the second use case, as it is itself almost as complicated to write as Arabic itself: Most letters are transliterated 1-to-1 which requires many special characters not seen in normal use or on a regular keyboard. In that sense the standard is quite outdated as it was conceptualized when pen and paper were mostly used to write, but it is still widely taught and used in the German scientific field. However, it is still quite difficult and tedious to transliterate Arabic and so it was expressed to me that an automatic transliteration tool would be helpful and doesn't exist yet for DMG. After some consideration, I started working on it and, after some time, I felt like I would be able to create a good result, and so I registered my bachelor's thesis.

A first prototype is currently available here (https://transliteration.eu.pythonanywhere.com/)

Thesis Requirements

Firstly, the thesis should define what a valid transliteration is, depending on a fixed set of preferences. It might seem counter-intuitive to solve a problem that you define yourself. There even is a public official standard (https://www.aai.uni-hamburg.de/voror/medien/dmg.pdf). However, in practice, the rules are often adapted and different people and organizations develop different alternatives. The official standard is 90 years old, and a lot has changed in the meantime. So, determining and specifying what exactly constitutes a correct (and maybe even preferred) transliteration is necessary.

Then the thesis should contain an algorithm that can take a fully vocalized Arabic text and return a corresponding transliteration, except for the correct transliteration of Arabic names.

Furthermore, the thesis should contain the design of a usable application for the specific use case of a researcher that regularly operates the app in his work. All the core aspects of usability should be considered here: Easy to learn, efficient, error-resistant and satisfying.

Very important is a final evaluation. It should estimate the accuracy of the transliteration algorithm and the usability of the application. A necessary part of this is a user test. The evaluation should also give an outlook into the future: What is still missing and why? What could be improved and how? How could the algorithm be extended to serve more purposes?

Optionally, depending on how much time is left, there will be an implementation that mainly relies on AI. In a few tests, I have discovered that both ChatGPT as well as GitHub Copilot can very easily pick up on how the transliteration works if sufficient examples are provided. Pre-trained LLMs could be an easy win for transliteration, as they are flexible and have a built-in sense of semantics, however hallucinations and their nondeterministic nature might pose a challenge. Other approaches using AI should be considered as well. This section should also have a separate evaluation which includes a comparison to the non-AI algorithm and prospects for further development.

Planning

The next steps are the following

- Further improvements, bug fixes, etc. in correspondence with an expert

- Working on the usability of the application, like enabling different input methods and explaining the UI.

- Literature review (Finding and reading related works)

- Gathering necessary testing and training data (contacting libraries and universities, etc.)

- Testing different approaches to certain problems like NER, vocalization and prefixing.

- When the application is riper, iterative synchronous and asynchronous user tests; in-person as soon as possible (when I am in Berlin)

Reception so far

Anyway, it's very much an open topic in research. You could probably come up with a solution that would be of interest to people, even if it weren't flawless.

The drudgery of manual romanization is something that scholars in fields like Arabistik and Islamwissenschaft would love to put behind them.
-- Dr. Theodore S. Beers

Ich habe gerade Ihre Software ausprobiert und bin wirklich beeindruckt.
Es funktioniert schon sehr gut.
-- Doğa Akpınar

Vielen herzlichen Dank für Ihre Email und den Hinweis auf Ihre spannende Arbeit.
-- Dr. Till Grallert

das Projekt klingt sehr spannend

Ansonsten finde ich für eine BA-Arbeit völlig legitim, Vokalisierung im Input vorauszusetzen.

Alles in allem ist das ein schönes BA-Projekt und ich möchte Sie sehr dazu ermutigen, sich nicht entmutigen zu lassen von der Sprache oder der schlechten Forschungs- und Softwarelage.
-- Dr. Jonas Müller-Laackman

vielen Dank für Ihre Nachricht. Ihre Bachelorarbeit klingt nach einem spannenden Projekt!
-- Dr. Victoria Mummelthei

Ihre Anfrage ist sehr interessant für uns
-- Dr. Ruben Schenzle

Weekly Status

Week 1 (CW 13)

Activities

Writing this Wiki
Working on the UI and some bug fixes
Reading a few papers from ArabicNLP2023 to get a feeling for the type of topics that are accepted, which is mostly AI:
- Summarization, Translation (especially of dialects) and RAG/QA and LLMs in general
Looking into the Qalamos-Project for potential uses for this project
Correspondence with Dr. Theodore Beers

Results

I believe it might be difficult to submit a paper to ArabicNLP2024, but worth a shot. Drafts are explicitly allowed. Possible paper sizes are 2, 4, and 8 pages. So 2 pages might be a good size.
Qalamos transliterates its titles and the author names. These likely present very valuable data.

Theo caught a nice bug. He also referred me to Jonas Müller-Laackman which I already contacted and Till Grallert. He also gave me the information that "Most researchers in English-speaking countries are now using the IJMES standard (and libraries use ALA-LC)." The "now" is very interesting, for example. For ALA-LC, there is already a solution online (https://transliterate.arabicalphabet.net/). It also explicitly states how the tool works, which I have already taken some inspiration from by using Mishkal. Theo gave Mishkal a good grade: "It actually does a decent job most of the time." He then ended the email with: "The thing is, the problem space is almost unbounded. You could keep improving the program to account for more quirks of Arabic orthography, but there would always be the possibility that it would choke on some valid input that you haven't tested before. I think this will turn out to be a labor of love, if you stick with it. For a BA thesis project, you've done a lot already." Which is interesting as I just started writing my Bachelors thesis and still plan to improve week by week.

Next Steps

I will integrate a virtual Arabic keyboard to make it more accessible to people without an Arabic keyboard
I will fix a few more bugs like the one that Theo caught
I will revisit Named Entity Recognition. It is too slow to use it in practice as it stands. Also, it doesn't even try to attempt to fulfil the specification.
I will contact Qalamos to acquire easy API access to their data. Otherwise, I will write a scraper to gather the necessary data from their public website
I will dig even deeper into how the programming libraries work that I am using. Even though I have already dug really deep (no deliverables)
Apart from digging deeper, I will also try to recreate some of the code, as I have already done with some of the data. This is to be able to improve on the current project, as there is no development on them and the quality is very weak. I might even fork the projects on GitHub and use them instead of the publicly available ones
Lastly, if the time suffices, I will work on writing more tests. These are important to catch regressions, and I have been a bit sloppy with them too often in the past.

Problems

No real problems

Week 2 (CW 14) 01.04.2024 – 07.04.2024

Activities

Emails
UI

Results

The UI now has a virtual keyboard
I got a lot more feedback and a bug report
Also, a new, more modern, specification document from the university of Bamberg: Translit.pdf

Next Steps

Most of the things from before
Reworking the feedback process
Adding examples to the settings
Developing the tests jointly with a specification

Problems

Unfortunately, I didn't have a lot of resources to invest this week, partially due to a paper, a lab report and a trip
Also, adding a virtual keyboard was much more difficult than I had estimated.

Week 3 (CW 15) 08.04. - 15.04.

Activities

Emails
UI
Bug fixing
Writing tests
Gathering data

Results

Found a lot more libraries for POS-tagging and lemmatization on GitHub
Even better keyboard
One-click feedback
Added illustrative examples to settings
More encompassing tests using more real inputs
Some more data and models for NER (1, 2, 3, 4)
More project referrals:
- CtG (Closing the Gap in non-Latin script data): Possibly more transliteration data
- Rule-based IJMES to Arabic reverse transliteration in XSLT and in Python using a transitional representation called BetaCode
- How could this be useful? Maybe something can be learned from the algorithms or how they were implemented. The same applies for any test data etc.
- Also, if it is easy to make a DMG to IJMES conversion (which is unlikely), then these algorithms could be used to validate more automatically only using random Arabic text (Arabic → DMG → IJMES → Arabic)

Next Steps

NER
Vocalization
Writing the paper for ArabicNLP (deadline 05.2024). I believe the most interesting topic for the conference would be studying different transformers and LLMs for transliteration. Google Codelab is probably a good platform for developing this

Problems

It is hard for me to determine the optimal solutions for specific problems. Either the solution is rule-based and therefore generally of bad quality or it uses a DNN (RNN, GRU, LSTM, Transformer) which is categorically too resource-intensive. The most valuable resource being time, especially the response time. The complete algorithm should run in a few milliseconds. DNNs just can't provide this kind of efficiency.
The other problem is actually the qualitative assessments: Even between the rule-based or between the DNN, it is difficult for me to determine the optimal and most modern and general algorithms. Also, I have difficulties to determine, how to improve the quality further. Is it just data or do I need to adapt the rules to optimize them for this specific use-case?

Week 4 (CW 16) 16.04. - 22.04.

Activities

meetings
work on transliteration

Results

Resolved a few doubts and questions with Professor Prechelt
Had a meeting with my Arabic teacher and got a lot of feedback
Tried to get an appointment with Coranica project that have a Coranic transliteration
Better prefixing
Custom NER detection
Better hamzatul wasl and ta marbutah handling