Abstract
Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.
Original language | English |
---|---|
Title of host publication | Proceedings of the 29th International Conference on Computational Linguistics |
Editors | Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na |
Place of Publication | Gyeongju |
Publisher | International Committee on Computational Linguistics |
Pages | 3530-3538 |
Number of pages | 9 |
Publication status | Published - Oct 2022 |
Externally published | Yes |
Event | The 29th International Conference on Computational Linguistics - Gyeongju, Korea, Republic of Duration: 12 Oct 2022 → 17 Oct 2022 Conference number: 29 https://coling2022.org/ |
Conference
Conference | The 29th International Conference on Computational Linguistics |
---|---|
Abbreviated title | coling2022 |
Country/Territory | Korea, Republic of |
City | Gyeongju |
Period | 12/10/22 → 17/10/22 |
Internet address |