Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/restore-vietnamese-diacritics-vie

Even though the Vietnamese alphabet is based on the Latin script, it has some extra diacritics. For example, aside from a, we also have ă,â,à,ả,ã,á,ạ,.... They help to produce an accurate representation of the Vietnamese language. Unlike in other languages such as English, if you can spell a word in Vietnamese, then you will know how to pronounce it.

Unfortunately, sometimes due to technical limitations, we have to type Vietnamese without diacritics. While it is possible to read Vietnamese without those diacritics and still more or less understand the meaning, it can lead to some amusing misunderstandings. A running joke in Vietnamese is that vợ đẻ (wife gives birth) and vỡ đê (the dike breaks) are both vo de without diacritics.

Today, we will use a Transformer model to restore diacritics for Vietnamese texts. Our approach will take inspiration from how machine translation works.

You can download all the sample code from the link below.

https://github.com/duongntbk/restore_vietnamese_diacritics

And you can download the model and the saved vectorization files from this link.

https://drive.google.com/drive/folders/1duBcp3YTsKeYz8xQThBsDjEUx3zRoLS3

Use a Transformer model to restore Vietnamese diacritics

The Transformer model

Traditionally, RNN models like LSTM or GRU dominated the NLP landscape. But after its introduction in 2017, the Transformer model has become the go-to choice for many NLP problems, one of which is machine translation. In fact, this is the same technology behind Google Translate.

Below is the architecture of an end-to-end Transformer model. This picture is taken from page 359 of the book Deep Learning with Python by Francois Chollet.

Transformer model

As we can see from the diagram, a Transformer model consists of an encoder and a decoder. They work together to map texts from a source to a target.

Transformer model and machine translation

The role of the encoder is to turn the source text into a set of vectors that form an encoded representation of the input. But at the same time, the encoder keeps this representation in a sequence format. This means our vectors are context-aware.

The role of the decoder is to generate the output. Given N tokens from the output, it will use those tokens and the encoded representation of the input to predict the N+1 token in the output. While doing this, it can identify which token in the input is most closely related to the next token it is predicting. This helps the decoder best utilize the whole context of the input. Perhaps you might wonder how we can get the first token from the output to kick-start the entire process. The answer is that we use a dummy token as a seed in the first step.

Assuming that we have padded all target texts with the seed word [Start] and the stop word [End] in training, below is how we use the Transformer model to translate I have a dream to Tôi có một giấc mơ.

  • Run I have a dream through the encoder to retrieve an encoded representation.
  • Run the encoded representation and the seed word [Start] through the decoder to (hopefully) get the first word Tôi.
  • Append Tôi to the seed word to get [Start] Tôi.
  • Run the encoded representation and the current output [Start] Tôi through the decoder again to get the word .
  • Repeat the two steps above until we meet the stop word [End]. If everything went as plan, the output should be [Start] Tôi có một giấc mơ [End].
  • Strip the seed word and the stop word to retrieve the final result: Tôi có một giấc mơ.

What does machine translation have to do with our problem?

After reading about machine translation with Transformer, I had this crazy idea. Maybe we can consider Vietnamese without diacritics as a whole new language and try to translate it back to proper Vietnamese. Compared to a normal machine translation problem, we have some big advantages.

  • Given a Vietnamese text, it’s dead easy to convert it to the diacritics-less version. Because of that, we can skip the proofreading step when creating a training dataset.
  • A normal Vietnamese text and its diacritics-less version always have the same number of words. We don’t need a stop word to train our model; just a seed word is enough.
  • A diacritics-less word can only be mapped to a limited number of Vietnamese words. For example, ma can only be mapped to ma/mà/mả/mã/má/mạ. Hopefully, our model can learn this limitation from the data.

Based on that, I believe that translating from diacritics-less Vietnamese to proper Vietnamese can achieve much higher accuracy than normal machine translation.

Prepare a training dataset

We will use the Old Newspapers dataset from Kaggle in today’s article. It has around 16 million sentences, but only 720,000 or so are in Vietnamese. This dataset is stored as a tsv file and is nearly 6 GB. I wrote this simple console application called VietnameseCrawler to read the whole corpus, extract every Vietnamese sentence, and export them along with their diacritics-less version to a new file.

We can run our application from the command line.

dotnet run <path to input file> <path to output file>

Below are some examples from the output file.

cac nhac sy do deu dang duoc vinh danh ca -> các nhạc sỹ đó đều đáng được vinh danh cả
cai ma ngay nay cho nao cung thieu -> cái mà ngày nay chỗ nào cũng thiếu
gia ve dao dong tu 78 den 158 usd -> giá vé dao động từ 78 đến 158 usd
khong cho phep tre em duoi 4 tuoi -> không cho phép trẻ em dưới 4 tuổi

The first training attempt

Preprocessing

As usual, the code to train models is written in Python, using the Keras framework. First, we need to load data from the text file. Then we split it into training set, validation set, and test set.

from data_loader import load_data

file_path = 'dataset/old-newspaper-vietnamese.txt'
train_pairs, val_pairs, test_pairs = load_data(file_path, limit=10000)

Note that we only load the first 10,000 sentences from the text file. This is because we want to run some experiments before committing to training on the full corpus.

We create a source vectorization and a target vectorization from the training dataset. These vectorization objects can be used to convert plain text data into tensors, so that we can train a deep learning model on them. Remember to save those vectorization objects to disk.

from data_loader import load_data, save_vectorization

source_vectorization, target_vectorization = create_vectorizations(train_pairs)
save_vectorization(source_vectorization, 'result/source_vectorization_layer.pkl')
save_vectorization(target_vectorization, 'result/target_vectorization_layer.pkl')

The next step is to convert our plain text dataset into tf.data.Dataset objects. We will have a separate tf.data.Dataset for the training set, validation set, and test set.

from data_loader import make_dataset

batch_size = 64
train_ds = make_dataset(train_pairs, source_vectorization, target_vectorization, batch_size)
val_ds = make_dataset(val_pairs, source_vectorization, target_vectorization, batch_size)
test_ds = make_dataset(test_pairs, source_vectorization, target_vectorization, batch_size) # We will use this in the evaluation step

Train model with the TransformerModel class

We train our model using the TransformerModel class. At first, we will keep the default settings of 256 dimensions in embed layer, 2048 dimensions in Dense layer of Transformer, 8 attention heads, and dropout with 0.5 rate.

from transformer_model import TransformerModel

transformer = TransformerModel(source_vectorization=source_vectorization,
    target_vectorization=target_vectorization)

Finally, we can build the model and start training. We will train for 50 epochs and only keep the model with the highest validation accuracy.

transformer.build_model(optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=50, validation_data=val_ds,
    callbacks=[
        tf.keras.callbacks.ModelCheckpoint(
            filepath='result/restore_diacritic.keras',
            save_best_only='True',
            monitor='val_accuracy'
        )
    ])

The first result

Since we trained on only 10,000 sentences, the model did not take long to coverage. On my low-end GPU, 50 epochs took around 30 minutes. However, the test accuracy and test loss are not very high. We reached 0.6837 test loss and 69.13% test accuracy. This means our model predicts the next token correctly 69.13% of the time.

transformer.evaluate(test_ds) # Print [0.6836796998977661, 0.691321611404419]

Let’s take a look at the loss/accuracy by epochs.

First model graph

We can see that the validation loss/accuracy is comparable to training loss/accuracy. This means our model is underfitting. The default settings do not have enough capacity to handle our translation.

Increase the model’s capacity

There are a few ways to increase the capacity of our model. We can increase the number of dimensions in the embed layer and dense layer of the Transformer, increase the number of attention heads, and reduce the dropout rate. Unfortunately, even the default settings are already pushing my GPU to its limit. I cannot increase the dimensions of the embed layer past 256. And 16 attention heads is the maximum amount my GPU can handle without throwing an out of memory exception.

Next, we will train two models on 300,000 sentences with the following settings.

  • Settings 1: 8,192 dimensions in Dense layer of Transformer, 8 attention heads, and dropout with a 0.2 rate.
  • Settings 2: 2,048 dimensions in Dense layer of Transformer, 16 attention heads, and dropout with a 0.2 rate.

The results are below.

The next 2 models graph

We achieved over 90% validation accuracy. This is to be expected because we trained on more data. Both settings above have similar accuracy, but Settings 1 runs 20% faster on my GPU. Moreover, we haven’t seen any overfitting in either case yet.

Training on the full corpus

Given the result in the last section, we will use Settings 1 to train a model on the full corpus. But this time, we remove the Dropout layer altogether.

train_pairs, val_pairs, test_pairs = load_data(file_path, limit=None)

# ...omitted

transformer = TransformerModel(source_vectorization=source_vectorization,
    target_vectorization=target_vectorization,
    dense_dim=8192, num_heads=8, drop_out=0)

This time, the model took more than 2 days to coverage. Below is the result.

Big model graph to restore Vietnamese diacritics

transformer.evaluate(test_ds) # [0.12277301400899887, 0.9405344128608704]

We have reached 94.05% test accuracy. I believe we can still improve our model by increasing its capacity. Unfortunately, this is already the limit of my GPU.

Some examples

Let’s try running some Vietnamese sentences through our model.

texts = [
    'ten toi la thai duong',
    'toi sinh ra o ha noi',
    'ngay mai troi se nang'
]

for text in texts:
    print(transformer.predict(text))
Original Diacritics-less English Model’s output
tên tôi là thái dương ten toi la thai duong my name is thai duong tên tôi là thái dương
tôi sinh ra ở hà nội toi sinh ra o ha noi I was born in hanoi tôi sinh ra ở hà nội
ngày mai trời sẽ nắng ngay mai troi se nang it’ll be sunny tomorrow ngày mai trời sẽ nắng

How to resume training

As mentioned earlier, our model took two days to coverage when training on the full corpus. Because of that, being able to pause and resume training is beneficial. Fortunately, this is easy with Keras.

from data_loader import load_vectorization_from_disk

source_vectorization = load_vectorization_from_disk('<path to source vectorization file>')
target_vectorization = load_vectorization_from_disk('<path to target vectorization file>')

# ...omitted

transformer = TransformerModel(source_vectorization=source_vectorization,
    target_vectorization=target_vectorization,
    model_path='<path to saved model>')
transformer.fit(train_ds, epochs=50, validation_data=val_ds,
    callbacks=[
        tf.keras.callbacks.ModelCheckpoint(
            filepath='result/restore_diacritic.keras',
            save_best_only='True',
            monitor='val_accuracy'
        )
    ])

Conclusion

Restoring Vietnamese diacritics with Transformer is an interesting problem with real-world application. I found this article from my alma mater (my classmates might remember Dr. Trang, who taught us Object-oriented programming in our 3rd year). They used Transformer in a hybrid approach to restore diacritics at the character level (instead of the word level). And they claimed 98.37% test accuracy. But they had access to a Tesla V100 PCIe GPU and also used a dataset 15 times bigger than mine.

I sure wish I had a Tesla V100 lying around :). Maybe I can achieve similar accuracy if I can increase the capacity of my model and train it on a bigger dataset. Either way, I feel that for a toy project, 94.05% test accuracy is not too bad.

A software developer from Vietnam and is currently living in Japan.

One Thought on “Restore Vietnamese diacritics with Transformer”

Leave a Reply