K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment

Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yang Sok Kim, Hee-Geun Yoon, Soyeon Caren Han

Research output: Chapter in Book/Conference paperConference paperpeer-review

Abstract

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baselines on K-MHaS. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.
Original languageEnglish
Title of host publicationProceedings of the 29th International Conference on Computational Linguistics
EditorsNicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Place of PublicationGyeongju
PublisherInternational Committee on Computational Linguistics
Pages3530-3538
Number of pages9
Publication statusPublished - Oct 2022
Externally publishedYes
EventThe 29th International Conference on Computational Linguistics - Gyeongju, Korea, Republic of
Duration: 12 Oct 202217 Oct 2022
Conference number: 29
https://coling2022.org/

Conference

ConferenceThe 29th International Conference on Computational Linguistics
Abbreviated titlecoling2022
Country/TerritoryKorea, Republic of
CityGyeongju
Period12/10/2217/10/22
Internet address

Fingerprint

Dive into the research topics of 'K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment'. Together they form a unique fingerprint.

Cite this