TY - JOUR
T1 - Constructing specialised corpora through analysing domain representativeness of websites
AU - Wong, Wilson
AU - Liu, Wei
AU - Bennamoun, Mohammed
PY - 2011
Y1 - 2011
N2 - The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.
AB - The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.
U2 - 10.1007/s10579-011-9141-4
DO - 10.1007/s10579-011-9141-4
M3 - Article
SN - 1574-020X
VL - 45
SP - 209
EP - 241
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 2
ER -