8 months ago

Natural Language Processing

Document Understanding

Natural Language Processing

Laida Kushnareva Tatiana Gaintseva German Magai Sergei Barannikov Dmitry Abulkhanov Kristian Kuznetsov Eduard Tulchinskii Irina Piontkovskaya Sergey Nikolenko

Abstract

Due to the rapid development of large language models, people increasinglyoften encounter texts that may start as written by a human but continue asmachine-generated. Detecting the boundary between human-written andmachine-generated parts of such texts is a challenging problem that has notreceived much attention in literature. We attempt to bridge this gap andexamine several ways to adapt state of the art artificial text detectionclassifiers to the boundary detection setting. We push all detectors to theirlimits, using the Real or Fake text benchmark that contains short texts onseveral topics and includes generations of various language models. We use thisdiversity to deeply examine the robustness of all detectors in cross-domain andcross-model settings to provide baselines and insights for future research. Inparticular, we find that perplexity-based approaches to boundary detection tendto be more robust to peculiarities of domain-specific data than supervisedfine-tuning of the RoBERTa model; we also find which features of the textconfuse boundary detection algorithms and negatively influence theirperformance in cross-domain settings.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Natural Language Processing

Document Understanding

Natural Language Processing

Laida Kushnareva Tatiana Gaintseva German Magai Sergei Barannikov Dmitry Abulkhanov Kristian Kuznetsov Eduard Tulchinskii Irina Piontkovskaya Sergey Nikolenko

Abstract

Due to the rapid development of large language models, people increasinglyoften encounter texts that may start as written by a human but continue asmachine-generated. Detecting the boundary between human-written andmachine-generated parts of such texts is a challenging problem that has notreceived much attention in literature. We attempt to bridge this gap andexamine several ways to adapt state of the art artificial text detectionclassifiers to the boundary detection setting. We push all detectors to theirlimits, using the Real or Fake text benchmark that contains short texts onseveral topics and includes generations of various language models. We use thisdiversity to deeply examine the robustness of all detectors in cross-domain andcross-model settings to provide baselines and insights for future research. Inparticular, we find that perplexity-based approaches to boundary detection tendto be more robust to peculiarities of domain-specific data than supervisedfine-tuning of the RoBERTa model; we also find which features of the textconfuse boundary detection algorithms and negatively influence theirperformance in cross-domain settings.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp