Metodo

International Studies in Phenomenology and Philosophy

Series | Book | Chapter

226418

A comparative study of language modeling to instance-based methods, and feature combinations for authorship attribution

Olga FourkiotiSymeon SymeonidisAvi Arampatzis

pp. 274-286

Abstract

We present a comparative study of language modeling to traditional instance-based methods for authorship attribution, using several different basic units as features, such as characters, words, and other simple lexical measurements, as well as we propose the use of part-of-speech (POS) tags as features for language modeling. In contrast to many other studies which focus on small sets of documents written by major writers regarding several topics, we consider a relatively large corpus with documents edited by non-professional writers regarding the same topic. We find that language models based on either characters or POS tags are the most effective, while the latter provide additional efficiency benefits and robustness against data sparsity. Moreover, we experiment with linearly combining several language models, as well as employing unions of several different feature types in instance-based methods. We find that both such combinations constitute viable strategies which generally improve effectiveness. By linearly combining three language models, based respectively on character, word, and POS trigrams, we achieve the best generalization accuracy of 96%.

Publication details

Published in:

Kamps Jaap, Tsakonas Giannis, Manolopoulos Yannis, Iliadis Lazaros, Karydis Ioannis (2017) Research and advanced technology for digital libraries: 21st international conference on theory and practice of digital libraries, TPDL 2017, Thessaloniki, Greece, September 18-21, 2017. Dordrecht, Springer.

Pages: 274-286

DOI: 10.1007/978-3-319-67008-9_22

Full citation:

Fourkioti Olga, Symeonidis Symeon, Arampatzis Avi (2017) „A comparative study of language modeling to instance-based methods, and feature combinations for authorship attribution“, In: J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis & I. Karydis (eds.), Research and advanced technology for digital libraries, Dordrecht, Springer, 274–286.