LIEA-BERT: A Linguistically Enriched Framework for Hate Speech Detection in Low-Resource Thai

Tracking #: 909-1889

Authors:
NameORCID
Steve NwaiwuORCID logo https://orcid.org/0009-0001-0190-2167


Submission Type: 

Research Paper

Abstract: 

Hate speech detection in low-resource languages poses significant challenges due to the scarcity of annotated datasets and language-specific NLP tools. This study addresses these limitations by proposing a weakly supervised learning framework tailored to detect hate speech in Thai, a low-resource language with complex linguistic characteristics. We constructed a weakly labeled dataset by combining a curated lexicon of Thai toxicity terms with sentiment-labeled data, reducing the reliance on manual annotation. To enhance the robustness of supervision, we incorporated label smoothing to mitigate label noise and improve generalization. Our model is built upon multilingual BERT (mBERT) and refined using Linguistically Informed Embedding Alignment (LIEA), which enriches embeddings with phonological and syntactic features. To evaluate embedding alignment, we applied Proto-MAML, leveraging auxiliary tasks such as phoneme recognition and classification loss monitoring, which significantly enhanced the model’s representational capacity. The proposed approach achieved a validation accuracy of 99.65\% and a test accuracy of 97. 35\%, demonstrating a strong generalization on Thai hate speech detection. These findings highlight the effectiveness of integrating weak supervision with linguistically informed and meta-learning strategies in low-resource contexts.

Manuscript: 

Supplementary Files (optional): 

Tags: 

  • Reviewed

Data repository URLs: 

Date of Submission: 

Friday, March 28, 2025

Date of Decision: 

Friday, April 4, 2025


Nanopublication URLs:

Decision: 

Reject (Pre-Screening)