TY - GEN
T1 - CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection.
AU - Weld, Henry
AU - Huang, Guanghao
AU - Lee, Jean
AU - Zhang, Tongshu
AU - Wang, Kunze
AU - Guo, Xinghong
AU - Long, Siqu
AU - Poon, Josiah
AU - Han, Soyeon Caren
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets. © 2021 Association for Computational Linguistics
AB - Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets. © 2021 Association for Computational Linguistics
UR - http://www.scopus.com/inward/record.url?scp=85123909046&partnerID=8YFLogxK
UR - https://dblp.org/rec/conf/acl/WeldHLZWGLPH21
U2 - 10.18653/V1/2021.FINDINGS-ACL.213
DO - 10.18653/V1/2021.FINDINGS-ACL.213
M3 - Conference paper
T3 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
SP - 2406
EP - 2416
BT - Findings of the Association for Computational Linguistics
A2 - Zong, Chengqing
A2 - Xia, Fei
A2 - Li, Wenjie
A2 - Navigli, Roberto
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Y2 - 1 August 2021 through 6 August 2021
ER -