Skip to content

Paper list for the paper "Authorship Attribution in the Era of Large Language Models: Problems, Methodologies, and Challenges"

License

Notifications You must be signed in to change notification settings

baixianghuang/survey-authorship

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges

Overview

This repository hosts the paper list from the paper "Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges." ACM SIGKDD Exploration (2024) [arXiv] [Project Website]

TLDR: This survey paper systematically categorizes authorship attribution in the era of LLMs into four problems: attributing unknown texts to human authors, detecting LLM-generated texts, identifying specific LLMs or human authors, and classifying texts as human-authored, machine-generated, or co-authored by both, while also highlighting key challenges and open problems.

As illustrated in figure below, authorship attribution can be systematically categorized into four problems. Each task presents unique challenges that necessitate corresponding solutions. Researchers continually adapt and refine attribution methods, transitioning from human-authored texts to LLM-generated content, and navigating the complex interweaving in human-LLM co-authored works.

BibTex

@article{huang2024authorshipattributionerallms,
    title   = {Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges},
    author  = {Baixiang Huang and Canyu Chen and Kai Shu},
    year    = {2024},
    journal = {arXiv preprint arXiv: 2408.08946},
    url     = {https://arxiv.org/abs/2408.08946}, 
}

Table of Content

Benchmarks and Detectors

The table below is a summary of Authorship Attribution Datasets and Benchmarks with LLM-Generated Text. Size is shown as the sum of LLM-generated and human-written texts (with the percentage of human-written texts in parentheses). Language is displayed using the two-letter ISO 639 abbreviation. Columns P2, P3, and P4 indicate whether the dataset supports problems described in Problem 2, 3, and 4, respectively.

Name Domain Size Length Language Model P2 P3 P4
TuringBench News 168,612 (5.2%) 100 to 400 words en GPT-1,2,3, GROVER, CTRL, XLM, XLNET, FAIR, TRANSFORMER-XL, PPLM
TweepFake Social media 25,572 (50.0%) less than 280 characters en GPT-2, RNN, Markov, LSTM, CharRNN
ArguGPT Academic essays 8,153 (49.5%) 300 words on average en GPT2-Xl, text-babbage-001, text-curie-001, davinci-001,002,003, GPT-3.5-Turbo
AuTexTification Tweets, reviews, news, legal, and how-to articles 163,306 (42.5%) 20 to 100 tokens en, es BLOOM, GPT-3
CHEAT Academic paper abstracts 50,699 (30.4%) 163.9 words on average en ChatGPT
GPABench2 Academic paper abstracts 2.385M (6.3%) 70 to 350 words en ChatGPT
Ghostbuster News, student essays, creative writing 23,091 (87.0%) 77 to 559 (median words per document) en ChatGPT, Claude
HC3 Reddit, Wikipedia, medicine, finance 125,230 (64.5%) 25 to 254 words en, zh ChatGPT
HC3 Plus News, social media 214,498 N/A en, zh ChatGPT
HC-Var News, reviews, essays, QA 144k (68.8%) 50 to 200 words en ChatGPT
HANSEN Transcripts of speech (spoken text), statements (written text) 535k (96.1%) less than 1k tokens en ChatGPT, PaLM2, Vicuna-13B
M4 Wikipedia, WikiHow, Reddit, QA, news, paper abstracts, peer reviews 147,895 (24.2%) more than 1k characters ar, bg, en, id, ru, ur, zh davinci-003, ChatGPT, GPT-4, Cohere, Dolly2, BLOOMz
MGTBench News, student essays, creative writing 21k (14.3%) 1 to 500 words en ChatGPT, ChatGLM, Dolly, GPT4All, StableLM, Claude
MULTITuDE News 74,081 (10.8%) 200 to 512 tokens ar, ca, cs, de, en, es, nl, pt, ru, uk, zh GPT-3,4, ChatGPT, Llama-65B, Alpaca-LoRa-30B, Vicuna-13B, OPT-66B, OPT-IML-Max-1.3B
OpenGPTText OpenWebText 58,790 (50.0%) less than 2k words en ChatGPT
OpenLLMText OpenWebText 344,530 (20%) 512 tokens en ChatGPT, PaLM, Llama, GPT2-XL
Scientic Paper Scientific papers 29k (55.2%) 900 tokens on average en SCIgen, GPT-2,3, ChatGPT, Galactica
RAID News, Wikipedia, paper abstracts, recipes, Reddit, poems, book summaries, movie reviews 523,985 (2.9%) 323 tokens on average cs, de, en GPT-2,3,4, ChatGPT, Mistral-7B, MPT-30B, Llama2-70B, Cohere command and chat
M4GT-Bench Wikipedia, Wikihow, Reddit, arXiv abstracts, academic paper reviews, student essays 5,368,998 (96.6%) more than 50 characters ar, bg, de, en, id, it, ru, ur, zh ChatGPT, davinci-003, GPT-4, Cohere, Dolly-v2, BLOOMz
MAGE Reddit, reviews, news, QA, story writing, Wikipedia, academic paper abstracts 448,459 (34.4%) 263 words on average en GPT, Llama, GLM-130B, FLAN-T5 OPT, T0, BLOOM-7B1, GPT-J-6B, GPT-NeoX-2
MIXSET Email, news, game reviews, academic paper abstracts, speeches, blogs 3.6k (16.7%) 50 to 250 words en GPT-4, Llama2

The Table below present an overview of LLM-Generated Text Detectors.

Detector Price API Website
GPTZero 150k words at $10/month, 10k words for free per month Yes https://gptzero.me/
ZeroGPT 100k characters for $9.99, 15k characters for free Yes https://www.zerogpt.com/
Sapling 50k characters for $25, 2k characters for free Yes https://sapling.ai/ai-content-detector
Originality.AI 200k words at $14.95/month Yes https://originality.ai/
CopyLeaks 300k words at $7.99/month Yes https://copyleaks.com/ai-content-detector
Winston 80k words at $12/month Yes https://gowinston.ai/
GPT Radar $0.02/100 tokens N/A https://gptradar.com/
Turnitin’s AI detector License required N/A https://www.turnitin.com/solutions/topics/ai-writing/ai-detector/
GPT-2 Output Detector Free N/A https://github.com/openai/gpt-2-output-dataset/tree/master/detector
Crossplag Free N/A https://crossplag.com/ai-content-detector/
CatchGPT Free N/A https://www.catchgpt.ai/
Quil.org Free N/A https://aiwritingcheck.org/
Scribbr Free N/A https://www.scribbr.com/ai-detector/
Draft Goal Free N/A https://detector.dng.ai/
Writefull Free Yes https://x.writefull.com/gpt-detector
Phrasly Free Yes https://phrasly.ai/ai-detector
Writer Free Yes https://writer.com/ai-content-detector/

Paper List

1. Human-written Text Attribution

  • Who could be behind QAnon? Authorship attribution with supervised machine-learning. Florian Cafiero, and Jean-Baptiste Camps. arXiv preprint arXiv:2303.02078 (2023) [link]

  • PART: Pre-trained Authorship Representation Transformer. Javier Huertas-Tato, Alvaro Huertas-Garcia, Alejandro Martin, and David Camacho. arXiv preprint arXiv:2209.15373 (2022) [link]

  • Tracing text provenance via context-aware lexical substitution. Xi Yang, Jie Zhang, Kejiang Chen, Weiming Zhang, Zehua Ma, Feng Wang, and Nenghai Yu. Proceedings of the AAAI Conference on Artificial Intelligence (2022) [link]

  • A Survey on Authorship Analysis Tasks and Techniques. Arta Misini, Arbana Kadriu, and Ercan Canhasi. SEEU Review (2022) [link]

  • On the state of the art in authorship attribution and authorship verification. Jacob Tyo, Bhuwan Dhingra, and Zachary C Lipton. arXiv preprint arXiv:2209.06869 (2022) [link]

  • Overview of PAN 2022: Authorship Verification, Profiling Irony and Stereotype Spreaders, and Style Change Detection. Janek Bevendorff, Berta Chulvi, Elisabetta Fersini, Annina Heini, Mike Kestemont, Krzysztof Kredens, Maximilian Mayerl, Reynier Ortega-Bueno, Piotr Pęzik, Martin Potthast, and others. International Conference of the Cross-Language Evaluation Forum for European Languages (2022) [link]

  • Same author or just same topic? towards content-independent style representations. Anna Wegmann, Marijn Schraagen, and Dong Nguyen. arXiv preprint arXiv:2204.04907 (2022) [link]

  • Adversarial Authorship Attribution for Deobfuscation. Wanyue Zhai, Jonathan Rusert, Zubair Shafiq, and Padmini Srinivasan. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2022) [link]

  • Creating and detecting fake reviews of online products. Joni Salminen, Chandrashekhar Kandpal, Ahmed Mohamed Kamel, Soon-gyo Jung, and Bernard J Jansen. Journal of Retailing and Consumer Services (2022) [link]

  • Unified and multilingual author profiling for detecting haters. Ipek Baris Schlicht, and Angel Felipe Magnossão de Paula. arXiv preprint arXiv:2109.09233 (2021) [link]

  • Performing multilingual analysis with Linguistic Inquiry and Word Count 2015 (LIWC2015). An equivalence study of four languages. Diana Paula Dudău, and Florin Alin Sava. Frontiers in Psychology (2021) [link]

  • The topic confusion task: A novel evaluation scenario for authorship attribution. Malik Altakrori, Jackie Chi Kit Cheung, and Benjamin CM Fung. Findings of the Association for Computational Linguistics: EMNLP 2021 (2021) [link]

  • Learning universal authorship representations. Rafael A Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y Chen, Aleem Khan, Marcus Bishop, and Nicholas Andrews. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021) [link]

  • Posnoise: An effective countermeasure against topic biases in authorship analysis. Oren Halvani, and Lukas Graner. Proceedings of the 16th International Conference on Availability, Reliability and Security (2021) [link]

  • Overview of the Cross-Domain Authorship Verification Task at PAN 2021.. Mike Kestemont, Enrique Manjavacas, Ilia Markov, Janek Bevendorff, Matti Wiegmann, Efstathios Stamatatos, Martin Potthast, and Benno Stein. CLEF (Working Notes) (2021) [link]

  • Authorship attribution of social media messages. Antonio Theophilo, Romain Giot, and Anderson Rocha. IEEE Transactions on Computational Social Systems (2021) [link]

  • Siamese networks for large-scale author identification. Chakaveh Saedi, and Mark Dras. Computer Speech & Language (2021) [link]

  • Developing a benchmark for reducing data bias in authorship attribution. Benjamin Murauer, and Günther Specht. Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems (2021) [link]

  • Transferring bert-like transformers’ knowledge for authorship verification. Andrei Manolache, Florin Brad, Elena Burceanu, Antonio Barbalau, Radu Ionescu, and Marius Popescu. arXiv preprint arXiv:2112.05125 (2021) [link]

  • The importance of suppressing domain style in authorship analysis. Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. arXiv preprint arXiv:2005.14714 (2020) [link]

  • The importance of suppressing domain style in authorship analysis. Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies, Matthias Hagen, Efstathios Stamatatos, Benno Stein, and Martin Potthast. arXiv preprint arXiv:2005.14714 (2020) [link]

  • Overview of pan 2020: Authorship verification, celebrity profiling, profiling fake news spreaders on twitter, and style change detection. Janek Bevendorff, Bilal Ghanem, Anastasia Giachanou, Mike Kestemont, Enrique Manjavacas, Ilia Markov, Maximilian Mayerl, Martin Potthast, Francisco Rangel, Paolo Rosso, and others. Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22--25, 2020, Proceedings 11 (2020) [link]

  • Forensic authorship analysis of microblogging texts using n-grams and stylometric features. Nicole Mariah Sharon Belvisi, Naveed Muhammad, and Fernando Alonso-Fernandez. 2020 8th International Workshop on Biometrics and Forensics (IWBF) (2020) [link]

  • Masking domain-specific information for cross-domain deception detection. Javier Sánchez-Junquera, Luis Villaseñor-Pineda, Manuel Montes-y-Gómez, Paolo Rosso, and Efstathios Stamatatos. Pattern Recognition Letters (2020) [link]

  • Cross-domain authorship attribution using pre-trained language models. Georgios Barlas, and Efstathios Stamatatos. Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5--7, 2020, Proceedings, Part I 16 (2020) [link]

  • Deep Learning based Authorship Identification. Arth Talati, A Sharma, and R Narayanan. No venue (2020) [link]

  • BertAA: BERT fine-tuning for Authorship Attribution. Maél Fabien, Esaú Villatoro-Tello, Petr Motlicek, and Shantipriya Parida. Proceedings of the 17th International Conference on Natural Language Processing (ICON) (2020) [link]

  • Cross-domain authorship attribution using pre-trained language models. Georgios Barlas, and Efstathios Stamatatos. Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, June 5--7, 2020, Proceedings, Part I 16 (2020) [link]

  • Text messaging forensics: Txt 4n6: idiolect-free authorship analysis?. Tim Grant. The Routledge handbook of forensic linguistics (2020) [link]

  • Exclusive: FBI document warns conspiracy theories are a new domestic terrorism threat. Jana Winter. Yahoo News (2019) [[link]](No link found)

  • Paraphrase plagiarism identification with character-level features. Fernando Sánchez-Vega, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Paolo Rosso, Efstathios Stamatatos, and Luis Villasenor-Pineda. Pattern Analysis and Applications (2019) [link]

  • A Girl Has No Name: Automated Authorship Obfuscation using Mutant-X.. Asad Mahmood, Faizan Ahmad, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar. Proc. Priv. Enhancing Technol. (2019) [link]

  • Code authorship attribution: Methods and challenges. Vaibhavi Kalgutkar, Ratinder Kaur, Hugo Gonzalez, Natalia Stakhanova, and Alina Matyukhina. ACM Computing Surveys (CSUR) (2019) [link]

  • A survey on stylometric text features. Ksenia Lagutina, Nadezhda Lagutina, Elena Boychuk, Inna Vorontsova, Elena Shliakhtina, Olga Belyaeva, Ilya Paramonov, and PG Demidov. 2019 25th Conference of Open Innovations Association (FRUCT) (2019) [link]

  • Attributing the Bixby Letter using n-gram tracing. Jack Grieve, Isobelle Clarke, Emily Chiang, Hannah Gideon, Annina Heini, Andrea Nini, and Emily Waibel. Digital Scholarship in the Humanities (2019) [link]

  • Similarity learning for authorship verification in social media. Benedikt Boenninghoff, Robert M Nickel, Steffen Zeiler, and Dorothea Kolossa. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019) [link]

  • Improving author verification based on topic modeling. Nektaria Potha, and Efstathios Stamatatos. Journal of the Association for Information Science and Technology (2019) [link]

  • Learning invariant representations of social media users. Nicholas Andrews, and Marcus Bishop. arXiv preprint arXiv:1910.04979 (2019) [link]

  • Explainable authorship verification in social media via attention-based similarity learning. Benedikt Boenninghoff, Steffen Hessler, Dorothea Kolossa, and Robert M Nickel. 2019 IEEE International Conference on Big Data (Big Data) (2019) [link]

  • Classification for authorship of tweets by comparing logistic regression and naive bayes classifiers. Opeyemi Aborisade, and Mohd Anwar. 2018 IEEE International Conference on Information Reuse and Integration (IRI) (2018) [link]

  • What represents “style” in authorship attribution?. Kalaivani Sundararajan, and Damon Woodard. Proceedings of the 27th International Conference on Computational Linguistics (2018) [link]

  • Masking topic-related information to enhance authorship attribution. Efstathios Stamatatos. Journal of the Association for Information Science and technology (2018) [link]

  • Bert: Pre-training of deep bidirectional transformers for language understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. arXiv preprint arXiv:1810.04805 (2018) [link]

  • An investigation of supervised learning methods for authorship attribution in short hinglish texts using char & word n-grams. Abhay Sharma, Ananya Nandan, and Reetika Ralhan. arXiv preprint arXiv:1812.10281 (2018) [link]

  • Syntax encoding with application in authorship attribution. Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018) [link]

  • Universal language model fine-tuning for text classification. Jeremy Howard, and Sebastian Ruder. arXiv preprint arXiv:1801.06146 (2018) [link]

  • Topic or style? exploring the most useful features for authorship attribution. Yunita Sari, Mark Stevenson, and Andreas Vlachos. Proceedings of the 27th international conference on computational linguistics (2018) [link]

  • Computational forensic authorship analysis: Promises and pitfalls. Shlomo Argamon. Language and Law/Linguagem e Direito (2018) [link]

  • Authorship verification applied to detection of compromised accounts on online social networks: A continuous approach. Sylvio Barbon, Rodrigo Augusto Igawa, and Bruno Bogaz Zarpelão. Multimedia Tools and Applications (2017) [link]

  • What represents “style” in authorship attribution?. Kalaivani Sundararajan, and Damon Woodard. Proceedings of the 27th International Conference on Computational Linguistics (2018) [link]

  • Convolutional neural networks for authorship attribution of short texts. Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes, Paolo Rosso, and Thamar Solorio. Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers (2017) [link]

  • Authorship verification applied to detection of compromised accounts on online social networks: A continuous approach. Sylvio Barbon, Rodrigo Augusto Igawa, and Bruno Bogaz Zarpelão. Multimedia Tools and Applications (2017) [link]

  • Convolutional neural networks for authorship attribution of short texts. Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes, Paolo Rosso, and Thamar Solorio. Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers (2017) [link]

  • Deep learning based authorship identification. Chen Qian, Tianchang He, and Rao Zhang. Report, Stanford University (2017) [[link]](No link found)

  • Stylometric authorship attribution of collaborative documents. Edwin Dauber, Rebekah Overdorf, and Rachel Greenstadt. Cyber Security Cryptography and Machine Learning: First International Conference, CSCML 2017, Beer-Sheva, Israel, June 29-30, 2017, Proceedings 1 (2017) [link]

  • A unified approach to interpreting model predictions. Scott M Lundberg, and Su-In Lee. Advances in neural information processing systems (2017) [link]

  • Convolutional neural networks for authorship attribution of short texts. Prasha Shrestha, Sebastian Sierra, Fabio A González, Manuel Montes, Paolo Rosso, and Thamar Solorio. Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers (2017) [link]

  • Surveying stylometry techniques and applications. Tempestt Neal, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. ACM Computing Surveys (CSuR) (2017) [link]

  • Authorship verification: a review of recent advances. Efstathios Stamatatos. Research in Computing Science (2016) [link]

  • Authorship attribution for social media forensics. Anderson Rocha, Walter J Scheirer, Christopher W Forstall, Thiago Cavalcante, Antonio Theophilo, Bingyu Shen, Ariadne RB Carvalho, and Efstathios Stamatatos. IEEE transactions on information forensics and security (2016) [link]

  • Domain adaptation for authorship attribution: Improved structural correspondence learning. Upendra Sapkota, Thamar Solorio, Manuel Montes, and Steven Bethard. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2016) [link]

  • Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. Sebastian Ruder, Parsa Ghaffari, and John G Breslin. arXiv preprint arXiv:1609.06686 (2016) [link]

  • Profile-based authorship analysis. Jonathan Dunn, Shlomo Argamon, Amin Rasooli, and Geet Kumar. Digital Scholarship in the Humanities (2016) [link]

  • The Rowling case: a proposed standard analytic protocol for authorship questions. Patrick Juola. Digital Scholarship in the Humanities (2015) [link]

  • The development and psychometric properties of LIWC2015. James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. No venue (2015) [link]

  • Author identification using multi-headed recurrent neural networks. Douglas Bagnall. arXiv preprint arXiv:1506.04891 (2015) [link]

  • Does size matter? Authorship attribution, small samples, big problem. Maciej Eder. Digital Scholarship in the Humanities (2015) [link]

  • Authorship attribution with topic models. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. Computational Linguistics (2014) [link]

  • Authorship attribution for forensic investigation with thousands of authors. Min Yang, and Kam-Pui Chow. ICT Systems Security and Privacy Protection: 29th IFIP TC 11 International Conference, SEC 2014, Marrakech, Morocco, June 2-4, 2014. Proceedings 29 (2014) [link]

  • The secret life of pronouns. what our words say about us. John Nerbonne. Literary and Linguistic Computing (2014) [link]

  • Breaking the closed-world assumption in stylometric authorship attribution. Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. Advances in Digital Forensics X: 10th IFIP WG 11.9 International Conference, Vienna, Austria, January 8-10, 2014, Revised Selected Papers 10 (2014) [link]

  • Zipf’s word frequency law in natural language: A critical review and future directions. Steven T Piantadosi. Psychonomic bulletin & review (2014) [link]

  • The handbook of language variation and change. Jack K Chambers, Peter Trudgill, and Natalie Schilling-Estes. John Wiley & Sons (2013) [link]

  • Automated authorship attribution using advanced signal classification techniques. Maryam Ebrahimpour, Tālis J Putniņš, Matthew J Berryman, Andrew Allison, Brian W-H Ng, and Derek Abbott. PloS one (2013) [link]

  • Authorship verification for short messages using stylometry. Marcelo Luiz Brocardo, Issa Traore, Sherif Saad, and Isaac Woungang. 2013 International Conference on Computer, Information and Telecommunication Systems (CITS) (2013) [link]

  • Detecting hoaxes, frauds, and deception in writing style online. Sadia Afroz, Michael Brennan, and Rachel Greenstadt. 2012 IEEE Symposium on Security and Privacy (2012) [link]

  • Stylometry and immigration: A case study. Patrick Juola. JL & Pol'y (2012) [link]

  • The “fundamental problem” of authorship attribution. Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Yaron Winter. English Studies (2012) [link]

  • Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. Michael Brennan, Sadia Afroz, and Rachel Greenstadt. ACM Transactions on Information and System Security (TISSEC) (2012) [link]

  • Finding deceptive opinion spam by any stretch of the imagination. Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. arXiv preprint arXiv:1107.4557 (2011) [link]

  • Ghosts from the high court's past: Evidence from computational linguistics for Dixon ghosting for Mctiernan and rich. Yanir Seroussi, Russell Smyth, and Ingrid Zukerman. University of New South Wales Law Journal, The (2011) [link]

  • Plagiarism and authorship analysis: introduction to the special issue. Efstathios Stamatatos, and Moshe Koppel. Language Resources and Evaluation (2011) [link]

  • Domain independent authorship attribution without domain adaptation. Rohith Menon, and Yejin Choi. Proceedings of the International Conference Recent Advances in Natural Language Processing 2011 (2011) [link]

  • Person identification from text and speech genre samples. Jade Goldstein, Ransom Winder, and Roberta Sabin. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) (2009) [link]

  • Automatically profiling the author of an anonymous text. Shlomo Argamon, Moshe Koppel, James W Pennebaker, and Jonathan Schler. Communications of the ACM (2009) [link]

  • A survey of modern authorship attribution methods. Efstathios Stamatatos. Journal of the American Society for information Science and Technology (2009) [link]

  • Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. Ahmed Abbasi, and Hsinchun Chen. ACM Transactions on Information Systems (TOIS) (2008) [link]

  • Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation.. Ben Allison, and Louise Guthrie. LREC (2008) [link]

  • Author identification: Using text sampling to handle the class imbalance problem. Efstathios Stamatatos. Information Processing & Management (2008) [link]

  • Detecting Fake Content with Relative Entropy Scoring. Thomas Lavergne, Tanguy Urvoy, and François Yvon. Pan (2008) [link]

  • Quantifying evidence in forensic authorship analysis. Tim Grant. International Journal of Speech, Language & the Law (2007) [link]

  • Authorship attribution in law enforcement scenarios. Moshe Koppel, Jonathan Schler, and Eran Messeri. NATO Security Through Science Series D-Information and Communication Security (2008) [link]

  • Opinion spam and analysis. Nitin Jindal, and Bing Liu. Proceedings of the 2008 international conference on web search and data mining (2008) [link]

  • Plagiarism detection without reference collections. Sven Meyer zu Eissen, Benno Stein, and Marion Kulig. Advances in Data Analysis: Proceedings of the 30 th Annual Conference of the Gesellschaft für Klassifikation eV, Freie Universität Berlin, March 8--10, 2006 (2007) [link]

  • On lying and being lied to: A linguistic analysis of deception in computer-mediated communication. Jeffrey T Hancock, Lauren E Curry, Saurabh Goorha, and Michael Woodworth. Discourse Processes (2007) [link]

  • Measuring Differentiability: Unmasking Pseudonymous Authors.. Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. Journal of Machine Learning Research (2007) [link]

  • An algorithm for identifying authors using synonyms. Jonathan H Clark, and Charles J Hannon. Eighth Mexican International Conference on Current Trends in Computer Science (ENC 2007) (2007) [link]

  • Authorship attribution. Ilker Nadi Bozkurt, Ozgur Baghoglu, and Erkan Uyar. 2007 22nd international symposium on computer and information sciences (2007) [link]

  • Bigrams of syntactic labels for authorship discrimination of short texts. Graeme Hirst, and Ol’ga Feiguina. Literary and Linguistic Computing (2007) [link]

  • Measuring Differentiability: Unmasking Pseudonymous Authors.. Moshe Koppel, Jonathan Schler, and Elisheva Bonchek-Dokow. Journal of Machine Learning Research (2007) [link]

  • A framework for authorship identification of online messages: Writing-style features and classification techniques. Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. Journal of the American society for information science and technology (2006) [link]

  • Authorship attribution with thousands of candidate authors. Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (2006) [link]

  • Effects of age and gender on blogging.. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. AAAI spring symposium: Computational approaches to analyzing weblogs (2006) [link]

  • Author identification on the large scale. David Madigan, Alexander Genkin, David D Lewis, Shlomo Argamon, Dmitriy Fradkin, and Li Ye. Proceedings of the 2005 Meeting of the Classification Society of North America (CSNA) (2005) [link]

  • Determining an author's native language by mining a text for errors. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (2005) [link]

  • Who’s at the keyboard? Authorship attribution in digital evidence investigations. Carole E Chaski. International journal of digital evidence (2005) [link]

  • Bayesian multinomial logistic regression for author identification. David Madigan, Alexander Genkin, David D Lewis, and Dmitriy Fradkin. AIP conference proceedings (2005) [link]

  • On compression-based text classification. Yuval Marton, Ning Wu, and Lisa Hellerstein. Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005. Proceedings 27 (2005) [link]

  • Authorship verification as a one-class classification problem. Moshe Koppel, and Jonathan Schler. Proceedings of the twenty-first international conference on Machine learning (2004) [link]

  • Author identification, idiolect, and linguistic uniqueness. Malcolm Coulthard. Applied linguistics (2004) [link]

  • Ad-hoc authorship attribution competition. Patrick Juola. Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (2004) [link]

  • Lying words: Predicting deception from linguistic styles. Matthew L Newman, James W Pennebaker, Diane S Berry, and Jane M Richards. Personality and social psychology bulletin (2003) [link]

  • Gender, genre, and writing style in formal written texts. Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Text & talk (2003) [link]

  • Psychological aspects of natural language use: Our words, our selves. James W Pennebaker, Matthias R Mehl, and Kate G Niederhoffer. Annual review of psychology (2003) [link]

  • Automatically categorizing written texts by author gender. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. Literary and linguistic computing (2002) [link]

  • Grammatical word class variation within the British National Corpus sampler. Paul Rayson, Andrew Wilson, and Geoffrey Leech. New frontiers of corpus research (2002) [link]

  • Machine learning in automated text categorization. Fabrizio Sebastiani. ACM computing surveys (CSUR) (2002) [link]

  • Computer-based authorship attribution without lexical measures. Efstathios Stamatatos, Nikos Fakotakis, and Georgios Kokkinakis. Computers and the Humanities (2001) [link]

  • Mining e-mail content for author identification forensics. Olivier De Vel, Alison Anderson, Malcolm Corney, and George Mohay. ACM Sigmod Record (2001) [link]

  • Support vector machines for spam categorization. Harris Drucker, Donghui Wu, and Vladimir N Vapnik. IEEE Transactions on Neural networks (1999) [link]

  • The state of authorship attribution studies: Some problems and solutions. Joseph Rudman. Computers and the Humanities (1997) [link]

  • Authorship attribution. David I Holmes. Computers and the Humanities (1994) [link]

  • Stylometry. Jose Nilo G Binongo. Notes and Queries (1996) [link]

  • The Federalist revisited: New directions in authorship attribution. David I Holmes, and Richard S Forsyth. Literary and Linguistic computing (1995) [link]

  • Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Frederick Mosteller, and David L Wallace. Journal of the American Statistical Association (1963) [link]

  • The statistical study of literary vocabulary. C Udny Yule. No venue (1944) [link]

  • On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. G Udny Yule. Biometrika (1939) [link]

  • The characteristic curves of composition. Thomas Corwin Mendenhall. Science (1887) [link]

2. LLM-generated Text Detection

Scalable watermarking for identifying large language model outputs. Sumanth Dathathri, et al. Nature 634.8035 (2024): 818-823 [link]

  • Fighting fire with fire: can ChatGPT detect AI-generated text? Amrita Bhattacharjee, and Huan Liu. ACM SIGKDD Explorations Newsletter (2024) [link]

  • Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Advances in Neural Information Processing Systems (2024) [link]

  • Adaptive ensembles of fine-tuned transformers for llm-generated text detection. Zhixin Lai, Xuesheng Zhang, and Suiyao Chen. arXiv preprint arXiv:2403.13335 (2024) [link]

  • ALISON: Fast and Effective Stylometric Authorship Obfuscation. Eric Xing, Saranya Venkatraman, Thai Le, and Dongwon Lee. arXiv preprint arXiv:2402.00835 (2024) [link]

  • Authorship obfuscation in multilingual machine-generated text detection. Dominik Macko, Robert Moro, Adaku Uchendu, Ivan Srba, Jason Samuel Lucas, Michiharu Yamashita, Nafis Irtiza Tripto, Dongwon Lee, Jakub Simko, and Maria Bielikova. arXiv preprint arXiv:2401.07867 (2024) [link]

  • On the detectability of chatgpt content: benchmarking, methodology, and evaluation through the lens of academic writing. Zeyan Liu, Zijun Yao, Fengjun Li, and Bo Luo. ACM SIGSAC Conference on Computer and Communications Security (2024) [link]

  • Can Large Language Models Identify Authorship?. Baixiang Huang, Canyu Chen, and Kai Shu. arXiv preprint arXiv:2403.08213 (2024) [link]

  • RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. arXiv preprint arXiv:2405.07940 (2024) [link]

  • Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Advances in Neural Information Processing Systems (2024) [link]

  • Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. arXiv preprint arXiv:2401.12070 (2024) [link]

  • Red teaming language model detectors with language models. Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, and Cho-Jui Hsieh. Transactions of the Association for Computational Linguistics (2024) [link]

  • MAGE: Machine-generated Text Detection in the Wild. Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Zhilin Wang, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. arXiv preprint arXiv:2305.13242 (2024) [link]

  • Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mainstream news websites. Hans WA Hanley, and Zakir Durumeric. Proceedings of the International AAAI Conference on Web and Social Media (2024) [link]

  • Few-Shot Detection of Machine-Generated Text using Style Representations. Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus Bishop, and Nicholas Andrews. arXiv preprint arXiv:2401.06712 (2024) [link]

  • M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Toru Sasaki, and others. arXiv preprint arXiv:2305.14902 (2023) [link]

  • DeepTextMark: Deep Learning based Text Watermarking for Detection of Large Language Model Generated Text. Travis Munyer, and Xin Zhong. arXiv preprint arXiv:2305.05773 (2023) [link]

  • Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. arXiv preprint arXiv:2303.13408 (2023) [link]

  • A survey on detection of llms-generated content. Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda Petzold, William Yang Wang, and Wei Cheng. arXiv preprint arXiv:2310.15654 (2023) [link]

  • A survey on llm-gernerated text detection: Necessity, methods, and future directions. Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F Wong, and Lidia S Chao. arXiv preprint arXiv:2310.14724 (2023) [link]

  • A watermark for large language models. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. International Conference on Machine Learning (2023) [link]

  • Machine-generated text: A comprehensive survey of threat models and detection methods. Evan Crothers, Nathalie Japkowicz, and Herna L Viktor. IEEE Access (2023) [link]

  • ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models. Yikang Liu, Ziyin Zhang, Wanyang Zhang, Shisen Yue, Xiaojing Zhao, Xinyuan Cheng, Yiwen Zhang, and Hai Hu. arXiv preprint arXiv:2304.07666 (2023) [link]

  • Distinguishing Fact from Fiction: A Benchmark Dataset for Identifying Machine-Generated Scientific Papers in the LLM Era.. Edoardo Mosca, Mohamed Hesham Ibrahim Abdalla, Paolo Basso, Margherita Musumeci, and Georg Groh. Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) (2023) [link]

  • On the Generalization of Training-based ChatGPT Detection Methods. Han Xu, Jie Ren, Pengfei He, Shenglai Zeng, Yingqian Cui, Amy Liu, Hui Liu, and Jiliang Tang. No venue (2023) [link]

  • Large language models can be used to effectively scale spear phishing campaigns. Julian Hazell. arXiv preprint arXiv:2305.06972 (2023) [link]

  • AI model GPT-3 (dis) informs us better than humans. Giovanni Spitale, Nikola Biller-Andorno, and Federico Germani. Science Advances (2023) [link]

  • ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. Brady D Lund, Ting Wang, Nishith Reddy Mannuru, Bing Nie, Somipam Shimray, and Ziang Wang. Journal of the Association for Information Science and Technology (2023) [link]

  • Evade ChatGPT detectors via a single space. Shuyang Cai, and Wanyun Cui. arXiv preprint arXiv:2307.02599 (2023) [link]

  • Do language models plagiarize?. Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Proceedings of the ACM Web Conference 2023 (2023) [link]

  • How reliable are ai-generated-text detectors? an assessment framework using evasive soft prompts. Tharindu Kumarage, Paras Sheth, Raha Moraffah, Joshua Garland, and Huan Liu. arXiv preprint arXiv:2310.05095 (2023) [link]

  • Large language models can be guided to evade ai-generated text detection. Ning Lu, Shengcai Liu, Rui He, Qi Wang, Yew-Soon Ong, and Ke Tang. arXiv preprint arXiv:2305.10847 (2023) [link]

  • Efficient Black-Box Adversarial Attacks on Neural Text Detectors. Vitalii Fishchuk, and Daniel Braun. arXiv preprint arXiv:2311.01873 (2023) [link]

  • Detectgpt: Zero-shot machine-generated text detection using probability curvature. Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. arXiv preprint arXiv:2301.11305 (2023) [link]

  • Deepfake text detection: Limitations and opportunities. Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, and Bimal Viswanath. 2023 IEEE Symposium on Security and Privacy (SP) (2023) [link]

  • Can ai-generated text be reliably detected?. Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. arXiv preprint arXiv:2303.11156 (2023) [link]

  • The science of detecting llm-generated texts. Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. arXiv preprint arXiv:2303.07205 (2023) [link]

  • Generating Phishing Attacks using ChatGPT. Sayak Saha Roy, Krishna Vamsi Naragam, and Shirin Nilizadeh. arXiv preprint arXiv:2305.05133 (2023) [link]

  • Bot or human? detecting chatgpt imposters with a single question. Hong Wang, Xuan Luo, Weizhi Wang, and Xifeng Yan. arXiv preprint arXiv:2305.06424 (2023) [link]

  • Smaller Language Models are Better Black-box Machine-Generated Text Detectors. Fatemehsadat Mireshghallah, Justus Mattern, Sicun Gao, Reza Shokri, and Taylor Berg-Kirkpatrick. arXiv preprint arXiv:2305.09859 (2023) [link]

  • Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations. Josh A. Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. arXiv (2023) [link]

  • Neural Authorship Attribution: Stylometric Analysis on Large Language Models. Tharindu Kumarage, and Huan Liu. arXiv preprint arXiv:2308.07305 (2023) [link]

  • On the possibilities of ai-generated text detection. Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. arXiv preprint arXiv:2304.04736 (2023) [link]

  • Paraphrase detection: Human vs. machine content. Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. arXiv preprint arXiv:2303.13989 (2023) [link]

  • Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. Jinyan Su, Terry Yue Zhuo, Di Wang, and Preslav Nakov. arXiv preprint arXiv:2306.05540 (2023) [link]

  • Gpt-who: An information density-based machine-generated text detector. Saranya Venkatraman, Adaku Uchendu, and Dongwon Lee. arXiv preprint arXiv:2310.06202 (2023) [link]

  • Semstamp: A semantic watermark with paraphrastic robustness for text generation. Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. arXiv preprint arXiv:2310.03991 (2023) [link]

  • Dipmark: A stealthy, efficient and resilient watermark for large language models. Yihan Wu, Zhengmian Hu, Hongyang Zhang, and Heng Huang. arXiv preprint arXiv:2310.07710 (2023) [link]

  • Provable robust watermarking for ai-generated text. Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. arXiv preprint arXiv:2306.17439 (2023) [link]

  • Evaluating AIGC detectors on code content. Jian Wang, Shangqing Liu, Xiaofei Xie, and Yi Li. arXiv preprint arXiv:2304.05193 (2023) [link]

  • Cheat: A large-scale dataset for detecting chatgpt-written abstracts. Peipeng Yu, Jiahan Chen, Xuan Feng, and Zhihua Xia. arXiv preprint arXiv:2304.12008 (2023) [link]

  • Ghostbuster: Detecting text ghostwritten by large language models. Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. arXiv preprint arXiv:2305.15047 (2023) [link]

  • Gpt-sentinel: Distinguishing human and chatgpt generated content. Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha Raj. arXiv preprint arXiv:2305.07969 (2023) [link]

  • MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark. Dominik Macko, Robert Moro, Adaku Uchendu, Jason Samuel Lucas, Michiharu Yamashita, Matúš Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, and others. arXiv preprint arXiv:2310.13606 (2023) [link]

  • Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model. Zhijie Deng, Hongcheng Gao, Yibo Miao, and Hao Zhang. arXiv preprint arXiv:2305.16617 (2023) [link]

  • On the generalization of training-based chatgpt detection methods. Han Xu, Jie Ren, Pengfei He, Shenglai Zeng, Yingqian Cui, Amy Liu, Hui Liu, and Jiliang Tang. arXiv preprint arXiv:2310.01307 (2023) [link]

  • ChatGPT Generated Text Detection. Rexhep Shijaku, and Ercan Canhasi. Publisher: Unpublished (2023) [link]

  • DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text. Xianjun Yang, Wei Cheng, Linda Petzold, William Yang Wang, and Haifeng Chen. arXiv preprint arXiv:2305.17359 (2023) [link]

  • Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model. Zhijie Deng, Hongcheng Gao, Yibo Miao, and Hao Zhang. arXiv preprint arXiv:2305.16617 (2023) [link]

  • GPT detectors are biased against non-native English writers. Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. Patterns (2023) [link]

  • Reverse Turing Test in the Age of Deepfake Texts. Adaku Uchendu. The Pennsylvania State University (2023) [link]

  • How close is chatgpt to human experts? comparison corpus, evaluation, and detection. Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. arXiv preprint arXiv:2301.07597 (2023) [link]

  • Hc3 plus: A semantic-invariant human chatgpt comparison corpus. Zhenpeng Su, Xing Wu, Wei Zhou, Guangyuan Ma, and Songlin Hu. arXiv preprint arXiv:2309.02731 (2023) [link]

  • Llmdet: A third party large language models generated text detection tool. Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. Findings of the Association for Computational Linguistics: EMNLP 2023 (2023) [link]

  • Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. arXiv preprint arXiv:2310.05130 (2023) [link]

  • Radar: Robust ai-text detection via adversarial learning. Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Advances in Neural Information Processing Systems (2023) [link]

  • Conda: Contrastive domain adaptation for ai-generated text detection. Amrita Bhattacharjee, Tharindu Kumarage, Raha Moraffah, and Huan Liu. arXiv preprint arXiv:2309.03992 (2023) [link]

  • On the zero-shot generalization of machine-generated text detectors. Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov, and Tianxing He. arXiv preprint arXiv:2310.05165 (2023) [link]

  • G3Detector: General GPT-generated text detector. Haolan Zhan, Xuanli He, Qiongkai Xu, Yuxiang Wu, and Pontus Stenetorp. arXiv preprint arXiv:2305.12680 (2023) [link]

  • Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as You May Think--Introducing AI Detectability Index. Megha Chakraborty, SM Tonmoy, SM Zaman, Krish Sharma, Niyar R Barman, Chandan Gupta, Shreya Gautam, Tanay Kumar, Vinija Jain, Aman Chadha, and others. arXiv preprint arXiv:2310.05030 (2023) [link]

  • Long-form analogies generated by chatGPT lack human-like psycholinguistic properties. SM Seals, and Valerie L Shalin. arXiv preprint arXiv:2306.04537 (2023) [link]

  • Cross-domain detection of GPT-2-generated technical text. Juan Diego Rodriguez, Todd Hay, David Gros, Zain Shamsi, and Ravi Srinivasan. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022) [link]

  • Cross-domain detection of GPT-2-generated technical text. Juan Diego Rodriguez, Todd Hay, David Gros, Zain Shamsi, and Ravi Srinivasan. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies (2022) [link]

  • Creating and detecting fake reviews of online products. Joni Salminen, Chandrashekhar Kandpal, Ahmed Mohamed Kamel, Soon-gyo Jung, and Bernard J Jansen. Journal of Retailing and Consumer Services (2022) [link]

  • Detecting and understanding textual deepfakes in online reviews. Peter Kowalczyk, Marco Röder, Alexander Dürr, and Frédéric Thiesse. No venue (2022) [link]

  • On pushing DeepFake Tweet Detection capabilities to the limits. Margherita Gambini, Tiziano Fagni, Fabrizio Falchi, and Maurizio Tesconi. 14th ACM Web Science Conference 2022 (2022) [link]

  • Deepfake Text Detection: Limitations and Opportunities. Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, and Bimal Viswanath. arXiv preprint arXiv:2210.09421 (2022) [link]

  • Synthetic Text Detection: Systemic Literature Review. Jesus Guerrero, and Izzat Alsmadi. arXiv preprint arXiv:2210.06336 (2022) [link]

  • Detecting computer-generated disinformation. Harald Stiff, and Fredrik Johansson. International Journal of Data Science and Analytics (2022) [link]

  • Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. Catherine A Gao, Frederick M Howard, Nikolay S Markov, Emma C Dyer, Siddhi Ramesh, Yuan Luo, and Alexander T Pearson. BioRxiv (2022) [link]

  • Findings of the the ruatd shared task 2022 on artificial text detection in russian. Tatiana Shamardina, Vladislav Mikhailov, Daniil Chernianskii, Alena Fenogenova, Marat Saidov, Anastasiya Valeeva, Tatiana Shavrina, Ivan Smurov, Elena Tutubalina, and Ekaterina Artemova. arXiv preprint arXiv:2206.01583 (2022) [link]

  • Automatic detection of Chinese generated essayss based on pre-trained BERT. Xingyuan Chen, Peng Jin, Siyuan Jing, and Chunming Xie. 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC) (2022) [link]

  • SynSciPass: detecting appropriate uses of scientific text generation. Domenic Rosati. arXiv preprint arXiv:2209.03742 (2022) [link]

  • Robustness analysis of grover for machine-generated news detection. Rinaldo Gagiano, Maria Myung-Hee Kim, Xiuzhen Jenny Zhang, and Jennifer Biggs. Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association (2021) [link]

  • All that's' human'is not gold: Evaluating human evaluation of generated text. Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. arXiv preprint arXiv:2107.00061 (2021) [link]

  • Automated identification of social media bots using deepfake text detection. Sina Mahdipour Saravani, Indrajit Ray, and Indrakshi Ray. Information Systems Security: 17th International Conference, ICISS 2021, Patna, India, December 16--20, 2021, Proceedings (2021) [link]

  • Feature-based detection of automated language models: tackling GPT-2, GPT-3 and Grover. Leon Fröhling, and Arkaitz Zubiaga. PeerJ Computer Science (2021) [link]

  • Unsupervised and distributional detection of machine-generated text. Matthias Gallé, Jos Rozen, Germán Kruszewski, and Hady Elsahar. arXiv preprint arXiv:2111.02878 (2021) [link]

  • TweepFake: About detecting deepfake tweets. Tiziano Fagni, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. Plos one (2021) [link]

  • Through the looking glass: Learning to attribute synthetic text generated by language models. Shaoor Munir, Brishna Batool, Zubair Shafiq, Padmini Srinivasan, and Fareed Zaffar. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021) [link]

  • Neural deepfake detection with factual structure of text. Wanjun Zhong, Duyu Tang, Zenan Xu, Ruize Wang, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. arXiv preprint arXiv:2010.07475 (2020) [link]

  • Automatic detection of machine generated text: A critical survey. Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS Lakshmanan. arXiv preprint arXiv:2011.01314 (2020) [link]

  • Authorship attribution for neural text generation. Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (2020) [link]

  • How effectively can machines defend against machine-generated fake news? an empirical study. Meghana Moorthy Bhat, and Srinivasan Parthasarathy. Proceedings of the First Workshop on Insights from Negative Results in NLP (2020) [link]

  • How effectively can machines defend against machine-generated fake news? an empirical study. Meghana Moorthy Bhat, and Srinivasan Parthasarathy. Proceedings of the First Workshop on Insights from Negative Results in NLP (2020) [link]

  • Detecting cross-modal inconsistency to defend against neural fake news. Reuben Tan, Bryan A Plummer, and Kate Saenko. arXiv preprint arXiv:2009.07698 (2020) [link]

  • The limitations of stylometry for detecting machine-generated fake news. Tal Schuster, Roei Schuster, Darsh J Shah, and Regina Barzilay. Computational Linguistics (2020) [link]

  • Defending against neural fake news. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Advances in neural information processing systems (2019) [link]

  • Gltr: Statistical detection and visualization of generated text. Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. arXiv preprint arXiv:1906.04043 (2019) [link]

  • The curious case of neural text degeneration. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. arXiv preprint arXiv:1904.09751 (2019) [link]

  • Automatic detection of generated text is easiest when humans are fooled. Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. arXiv preprint arXiv:1911.00650 (2019) [link]

  • Best practices for the human evaluation of automatically generated text. Chris Van Der Lee, Albert Gatt, Emiel Van Miltenburg, Sander Wubben, and Emiel Krahmer. Proceedings of the 12th International Conference on Natural Language Generation (2019) [link]

  • Identifying computer-generated text using statistical analysis. Hoang-Quoc Nguyen-Son, Ngoc-Dung T Tieu, Huy H Nguyen, Junichi Yamagishi, and Isao Echi Zen. 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (2017) [link]

  • Computer-generated text detection using machine learning: A systematic review. Daria Beresneva. Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings 21 (2016) [link]

3. LLM-generated Text Attribution

  • TURINGBENCH: A benchmark environment for Turing test in the age of neural text generation. Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang, and Dongwon Lee. arXiv preprint arXiv:2109.13296 (2021) [link]

  • Origin tracing and detecting of llms. Linyang Li, Pengyu Wang, Ke Ren, Tianxiang Sun, and Xipeng Qiu. arXiv preprint arXiv:2304.14072 (2023) [link]

  • Mgtbench: Benchmarking machine-generated text detection. Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. arXiv preprint arXiv:2303.14822 (2023) [link]

  • Overview of autextification at iberlef 2023: Detection and attribution of machine-generated text in multiple domains. Areg Mikael Sarvazyan, José ángel González, Marc Franco-Salvador, Francisco Rangel, Berta Chulvi, and Paolo Rosso. arXiv preprint arXiv:2309.11285 (2023) [link]

  • HANSEN: human and AI spoken text benchmark for authorship analysis. Nafis Irtiza Tripto, Adaku Uchendu, Thai Le, Mattia Setzu, Fosca Giannotti, and Dongwon Lee. arXiv preprint arXiv:2310.16746 (2023) [link]

  • Token prediction as implicit classification to identify LLM-generated text. Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha Raj. arXiv preprint arXiv:2311.08723 (2023) [link]

  • TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles. Adaku Uchendu, Thai Le, and Dongwon Lee. arXiv preprint arXiv:2309.12934 (2023) [link]

4. Human-LLM Co-authored Text Attribution

  • Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection. Jiaqi Chen, Xiaoye Zhu, Tianyang Liu, Ying Chen, Xinhui Chen, Yiwen Yuan, Chak Tou Leong, and others. arXiv preprint arXiv:2412.10432 (2024) [link]

  • LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?. Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang, Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, and Lichao Sun. Findings of the Association for Computational Linguistics: NAACL 2024 (2024) [link]

  • Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text. Xianjun Yang, Wei Cheng, Yue Wu, Linda Petzold, William Yang Wang, and Haifeng Chen. arXiv preprint arXiv:2305.17359 (2023) [link]

  • M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection. Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, and others. to appear in ACL 2024 (2024) [link]

  • Automatic Authorship Analysis in Human-AI Collaborative Writing. Aquia Richburg, Calvin Bao, and Marine Carpuat. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (2024) [link]

  • Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text. Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris Callison-Burch. Proceedings of the AAAI Conference on Artificial Intelligence (2023) [link]

  • Automatic Detection of Hybrid Human-Machine Text Boundaries. Joseph Cutler, Liam Dugan, Shreya Havaldar, and Adam Stein. (2021) [link]

Contributing

We welcome contributions from the community. If you have a paper, dataset, or implementation to add, please submit a pull request or open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Contact

For any questions or inquiries, please contact Baixiang Huang and Kai Shu.


By organizing and curating the extensive body of literature on authorship attribution, we hope this repository will be a valuable resource for researchers and practitioners in the field.

About

Paper list for the paper "Authorship Attribution in the Era of Large Language Models: Problems, Methodologies, and Challenges"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published