Detecting & Responding to Online Harassment with LLMs

Human-in-the-loop · Generative AI evaluation

Detecting & Responding to Online Harassment with LLMs

Testing whether LLMs can help users recognize and respond to harassment in private messaging while protecting wellbeing and stopping the harassment.

Role

UX Researcher / Co-Author

Client

Georgia Tech University & University of Pittsburgh

Timeline

~5 months (Jan 2025 – May 2025)

Team

3 researchers across Georgia Tech and UT Austin

Outcome

Submitted to ICWSM. Preprint on arXiv (2512.14700, Nov 2025).

Research process at a glance

  1. 01
    Literature Review
  2. 02
    Dataset Curation
  3. 03
    Labeling Protocol
  4. 04
    LLM Pipeline
  5. 05
    Comparative Evaluation
  6. 06
    Design Implications
A four-scene storyboard following Alex from a hurtful message through AI-supported response to relief.

Storyboard

A four-scene storyboard following Alex from a hurtful message through AI-supported response to relief.
Cascading classification — agents pass outputs forward to a final harassment label.

Classification pipeline

Cascading classification — agents pass outputs forward to a final harassment label.
From flagged conversation to a 1–3 message suggested response set.

Simulated response pipeline

From flagged conversation to a 1–3 message suggested response set.

Problem space

Private-message harassment is under-studied, and existing tools put the work on the user.

When a user receives a hostile message in their DMs, the platform's response options are limited: report or block. This places the burden of stopping harassment on users already managing its emotional toll.

Two research gaps shaped the questions I explored

  1. 01

    AI detection of online harassment for adolescents is understudied.

  2. 02

    Most harassment research focuses on public posts, while private messaging is ignored due to scarce data.

The team addressed both gaps through a dataset of Instagram DMs voluntarily donated by adolescent users via IRB-approved grants.

RQ1

How can we effectively identify online harassment in private messaging at scale?

RQ2

How can we help people more appropriately address online harassment in private messaging?

Task

Designing and running a study with a 3-person team.

As UX Researcher on a 3-person team, I helped design and run a study testing whether LLMs could support users in two ways: recognizing harassment as it happens, and suggesting responses that work.

My responsibilities

  • Literature review and research-gap identification
  • Study design (LLM pipeline architecture, human-labeling protocol, response-evaluation rubric)
  • Codebook development and inter-rater reliability testing
  • Gathering structured feedback from labelers
  • Comparative analysis (LLM responses vs. what users actually receive)
  • Manuscript co-authorship

Constraints

  • Strict ethical protocols throughout
  • Acknowledging that different things are 'helpful' for different people
  • Limited context beyond message content (a privacy trade-off)

Action 1

Two pipelines, one dataset, and a labeling protocol for structured user feedback.

Dataset

A subset of an Instagram DM corpus donated by adolescents through joint grants. After cleaning:

  • 80,056 messages
  • 26 adolescent data donors (ages 12 to 18)
  • Multi-message conversations preserved with prior messages as context

Phase 1 — Detection (RQ1)

I designed a labeling protocol that gave the LLM pipeline feedback to learn from. Labelers worked from a shared codebook. Each message was labeled by one person, then re-labeled by a second person who couldn't see the first label, and a third labeler resolved any disagreements. Those disagreements told us which cases were genuinely ambiguous, and helped us iterate on the LLM's prompts.

  • 14,607 messages labeled
  • 7,531 used for evaluation (excluding messages sent by the donor themselves)
  • The pipeline read each message together with the prior 50 messages, because harassment in DMs depends on what came before

Phase 2 — Response generation (RQ2)

For each conversation flagged as harassment, the pipeline generated 3 suggested responses, based on 9 strategies sourced from the literature.

The strategies serve two goals:

  • Deterrence: warning the harasser, denouncing the message, pointing out hypocrisy
  • Promoting wellbeing: showing empathy, demonstrating understanding, repairing the relationship
  • Several strategies serve both at once

Comparative user testing

Labelers evaluated 100 conversation pairs. Each pair had two responses: one written by the AI, one the user had actually received. Raters didn't know which was which, and the order was randomized.

  • 3 evaluators had prior experience in mental health / harm reduction
  • 6-question rubric drawn from coping research: stopping the harassment, de-escalating, improving the user's position in the exchange, emotional helpfulness, sounding natural in conversation, and whether ignoring would have been better

Action 2

Key findings

Finding 1. AI-suggested responses are rated more helpful than what users currently receive.

On four measures (stopping harassment, de-escalating, improving the user's position, emotional support), evaluators preferred AI-generated responses over the original ones recipients had actually received. The result was statistically significant.

Finding 2. The responses don't sound like the user yet.

When asked which sounded more natural in the conversation, evaluators preferred the original human responses (also statistically significant). A response that's helpful but doesn't sound like the user isn't ready to ship.

Finding 3. Detection works when the model has prior conversation.

The LLM pipeline reliably identified harassment in private messages, outperforming a BERT baseline and matching a 30-model ensemble. It only worked when given the prior messages in the conversation.

Action 3

Design implications

  • Detect with prior conversation, not single messages. Single-message detection misses the back-and-forth that defines private-message harassment.
  • Suggest responses; don't auto-respond. A suggested-response feature keeps the user in control while doing the hard work of finding the right words.
  • Let users say what they want from a response. A real product should let users (or trusted adults) choose whether the response should focus on stopping the harassment, helping the user feel better, or both.
  • Make the responses sound like the user. Editable templates, light personalization, or fine-tuning on a user's writing style are candidates worth testing.

Evolution

What I learned

  • Disagreements between labelers were useful, not a problem. They flagged the same ambiguous cases the LLM also struggled with, and helped us improve the prompts. This ambiguity indicated subjective preferences for support.
  • Look at where the system fails, not just overall accuracy. Average performance can hide errors that hurt specific groups or specific types of harassment.
  • LLM responses need to be tested in a real-world setting. Harassers might be encouraged if someone gives them attention, even if it's negative. A follow-up study should test what happens after the response is sent.

Results + relevance

Study at a glance

80,056 messages · 7,531 reviewed · 100 response pairs · 10 users gave feedback

Detection: how often did each system catch real harassment?

LLM pipeline

65%

With prior context

ML ensemble

40%

30-model vote

BERT baseline

22%

Pretrained toxic-bert

Recall on harassment class. LLM pipeline matched the ensemble on F1 (0.23) and beat the baseline (F1 0.10).

Response: which did users prefer?

Helpfulness · Q1–4 combined

AI 54% / Human 46%

Statistically significant

Naturalness · Q5

AI 29% / Human 71%

Highly significant

At a glance

  • Dataset: 80,056 Instagram messages from 26 adolescent donors
  • Labeled dataset: 7,531 messages reviewed through a two-round labeling process with a tie-breaker
  • Detection: LLM pipeline outperformed BERT baseline, matched a 30-model ensemble
  • Response quality: AI-suggested responses rated more helpful than original responses (significant); original responses rated more natural (significant)
  • Output: Peer-reviewed preprint; submitted to ICWSM

Publication

Lu, P., Ishfaq, N., Win, E., Rose, M., Strickland, S. R., Biernesser, C. L., Zelazny, J., & De Choudhury, M. (2025). Effectively Detecting and Responding to Online Harassment with Large Language Models. arXiv:2512.14700. https://arxiv.org/abs/2512.14700

Tools + skills

Toolkit

Tools: Overleaf · Excel · Figma · Google Docs · Zoom

Skills: Prompt and rubric design · Mixed-methods research · Human-in-the-loop annotation · Literature reviews