%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model with minimal effort. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.
The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).
Each entry in the dataset includes a review in Arabic and a rating between 1 and 5. We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label and assigning reviews with a rating of less than 3 a negative label.
(Disclaimer: I don't speak Arabic. Please forgive mistakes.)
# convert ratings to a binary format: pos=positive, neg=negative
import pandas as pd
df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\t', encoding='utf-16')
df = df[['rating', 'review']]
df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')
df.head()
rating | review | |
---|---|---|
0 | neg | “ممتاز”. النظافة والطاقم متعاون. |
1 | pos | استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل... |
2 | pos | استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر... |
3 | neg | “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ... |
4 | pos | جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا... |
Let's split out a training and validation set.
df_train = df.sample(frac=0.85, random_state=42)
df_test = df.drop(df_train.index)
len(df_train), len(df_test)
(89843, 15855)
With the Transformer API in ktrain, we can select any Hugging Face transformers
model appropriate for our data. Since we are dealing with Arabic, we will use AraBERT by the AUB MIND Lab instead of multilingual BERT (which is normally used by ktrain for non-English datasets in the alternative text_classifier API in ktrain). As you can see below, with only 1 epoch, we obtain a 96.37 accuracy on the validation set.
import ktrain
from ktrain import text
MODEL_NAME = 'aubmindlab/bert-base-arabertv01'
t = text.Transformer(MODEL_NAME, maxlen=128)
trn = t.preprocess_train(df_train.review.values, df_train.rating.values)
val = t.preprocess_test(df_test.review.values, df_test.rating.values)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32)
learner.fit_onecycle(5e-5, 1)
preprocessing train... language: ar train sequence lengths: mean : 24 95percentile : 67 99percentile : 120
Is Multi-Label? False preprocessing test... language: ar test sequence lengths: mean : 24 95percentile : 67 99percentile : 121
begin training using onecycle policy with max lr of 5e-05... Train for 2808 steps, validate for 496 steps 2808/2808 [==============================] - 1104s 393ms/step - loss: 0.1447 - accuracy: 0.9466 - val_loss: 0.1054 - val_accuracy: 0.9637
<tensorflow.python.keras.callbacks.History at 0x7f06344b84a8>
p = ktrain.get_predictor(learner.model, t)
Predicting label for the text
"The room was clean, the food excellent, and I loved the view from my room."
p.predict("الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.")
'pos'
Predicting label for:
"This hotel was too expensive and the staff is rude."
p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')
'neg'
# save model for later use
p.save('/tmp/arabic_predictor')
# reload from disk
p = ktrain.load_predictor('/tmp/arabic_predictor')
# still works as expected after reloading from disk
p.predict("الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.")
'pos'