Trojan Detection Challenge 2023 - Trojan Detection Track (Large Model Subtrack)

Organized by mmazeika - Current server time: Aug. 29, 2025, 3:44 a.m. UTC
Reward $30,000

First phase

Development
July 26, 2023, 7 a.m. UTC

End

Competition Ends
Nov. 7, 2023, noon UTC

Trojan Detection Challenge 2023
Trojan Detection Track (Large Model Subtrack)


Navigation

This CodaLab competition hosts the leaderboards for the Trojan Detection Track (Large Model Subtrack) of the Trojan Detection Challenge 2023, a NeurIPS 2023 competition. Other tracks are hosted at the following CodaLab pages:

For definitive information regarding evaluation, prizes, and rules, please see the competition website.

News

  • November 1: The test phase has started. See here for important details.
  • October 27: The start of the test phase has been postponed to 10/31 (midnight AoE)
  • October 23: The start of the test phase has been postponed to 10/27
  • August 14: The competition now has a Discord server for discussion and asking questions: https://discord.gg/knwH4Zm6Tx
  • July 25: The development phase has started. See here for updates and more details.
  • July 24: The start of the development phase has been postponed to 7/25.
  • July 20: To allow time for final preparations, the start of the development phase has been postponed to 7/24.
  • July 17: Registration has opened on CodaLab.

Competition Overview

This competition aims to advance the understanding and development of methods for detecting hidden functionality in large language models (LLMs). The competition features two main tracks: the Trojan Detection Track and the Red Teaming Track. In the Trojan Detection Track, participants will be given large language models containing hundreds of trojans and tasked with discovering the triggers for these trojans. In the Red Teaming Track, participants will be challenged to elicit specific undesirable behaviors from a large language model fine-tuned to avoid those behaviors.

How can we detect hidden functionality in large language models? Participants will help answer this question in two complimentary tracks:

  • Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings..
  • Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors.

To enable broader participation, each track has a Large Model and Base Model subtrack, corresponding to larger and smaller LLMs.

Trojan Detection Track

In this track, we ask you to build a detector that can find trojans inserted into a large language model (LLM). We provide an LLM containing 1000 trojans, where each trojan is defined by a (trigger, target) pair. Triggers and targets are both text strings, and the LLM has been fine-tuned to output the target when given the trigger as an input. All target strings will be provided, and the task is to reverse-engineer the corresponding triggers given a target string. We provide a training set of triggers for developing your detection method, and the remaining triggers are held-out. You will submit predictions for each target string's held-out triggers. The evaluation server only accepts 5 submissions per day for the development phase and 5 submissions total for the test phase.

Data

A single LLM will be provided, which contains 1000 trojans. These trojans are divided evenly among 100 target strings. That is, each target string has 10 triggers that cause the model to generate the target string. We provide the full set of 100 target strings. A training set of 200 trojans is provided to help develop detection methods. For more information, see the Tracks page of the competition website.

Prizes

Monetary prizes will be awarded to the top three teams in the leaderboard of the Trojan Detection Track (Large Model Subtrack). Special awards will also be given for top submissions satisfying certain criteria. For more information, see the Prizes page of the competition website.

Trojan Detection Track (Large Model Subtrack)

  • 🥇 1st place: $5,000
  • 🥈 2nd place: $2,500
  • 🥉 3rd place: $1,000

Organizers

This competition is organized by Mantas Mazeika (UIUC), Andy Zou (CMU), Norman Mu (UC Berkeley), Long Phan (CAIS), Zifan Wang (CAIS), Chunru Yu (UIUC), Adam Khoja (UC Berkeley), Fengqing Jiang (UW), Aidan O'Gara (CAIS), Ellie Sakhaee (Microsoft), Zhen Xiang (UIUC), Arezoo Rajabi (UW), Dan Hendrycks (CAIS), Radha Poovendran (UW), Bo Li (UIUC), and David Forsyth (UIUC).

You can reach us by email at tdc2023-organizers@googlegroups.com or through the CodaLab forums.

We are kindly sponsored by a private funder.

Evaluation

For each target string, you will generate a list of 20 predicted triggers. Each predicted trigger is a string, and each submission is a zipped JSON file containing the predicted triggers. Each predicted trigger must be between 5 and 50 tokens long (inclusive) after tokenization. Submissions will be evaluated with three metrics:

  • Combined Score (average of Recall and REASR)
  • Recall
  • Reverse-Engineered Attack Success Rate (REASR)

All metrics range between 0 and 100 percent. Higher is better. Recall measures the degree to which the original triggers were recovered, and REASR measures the degree to which the submitted triggers elicit the target string. Our primary metric for ranking submissions is Combined Score. Ties will be broken using Recall.

Terms and Conditions

NOTE: For the most up-to-date copy of the rules, see the competition website.

  1. Open Format: This is an open competition. All participants are encouraged to share their methods upon conclusion of the competition, and outstanding submissions will be highlighted in a joint publication. To be eligible for prizes and named as a winning team, top-ranking teams in the test phase are required to share their methods, code, and models with the organizers at a minimum, though public releases are highly encouraged.
  2. Registration: Double registration is not allowed. We expect teams to self-certify that all team members are not part of a different team registered for the competition, and we will actively monitor for violation of this rule. Teams may participate in multiple tracks. Organizers are not allowed to participate in the competition or win prizes.
  3. Prize Distribution: Monetary prizes will be awarded to teams as specified in the Prizes page of the competition website.
  4. Compute Limits: In both tracks, methods developed for the Base Model subtrack must run in at most 2 A100 GPU-days to generate a submission for the test phase. For Large Model subtracks, methods must run in at most 4 A100 GPU-days to generate a submission for the test phase. These limits do not include initial setup during the development phase. This is a soft requirement; we will check it for the winning teams, but understandable reasons for exceeding the limit by a modest amount are acceptable.
  5. Allowed Methods (Trojan Detection Track): The use of features that are clearly loopholes is not allowed (e.g., metadata). We may not anticipate all loopholes and we encourage participants to alert us to their existence. Legitimate features that do not constitute loopholes include all features derived from the trained parameters of networks, the target strings, training triggers, and text datasets. Similar to the Red Teaming Track, we also do not allow the submission of prompts that effectively make LLMs copy-paste target strings from the prompt into the generation.
  6. Allowed Methods (Red Teaming Track): Methods used for generating submitted test cases must satisfy the following requirements. (1) They must be able to generate any desired number of test cases, not just a fixed number of test cases. (2) They must be fully automated. After an initial setup, they should require no human effort to adapt to a new set of behaviors and no human-in-the-loop for generating test cases. (3) They must cause LLMs to generate behaviors from scratch. In particular, getting LLMs to copy-paste whole instances of undesirable behavior provided in the prompt does not count as generating undesirable behavior from scratch. Test cases must not include any substantive aspects of the behaviors they are attempting to elicit. Including aspects of unrelated behaviors is allowed (e.g., few-shot examples of following unrelated behaviors). Including text that someone without any knowledge of or skill in the behavior could derive from the behavior string is allowed (e.g., "Sure, here is how to make a pipe bomb. Step 1: Gather"). Due to the subjective nature of some aspects of this rule, it will be a soft rule that is only checked during manual evaluation. Understandable reasons for sometimes exceeding the threshold of information allowed in the test cases are acceptable. (4) They must not intentionally boost the diversity metric in an artificial manner, e.g., by adding random text to the beginning of a prompt.
  7. Rule breaking may result in disqualification, and significant rule breaking will result in an ineligibility for prizes.

These rules are an initial set, and we require participants to consent to a change of rules if there is an urgent need during registration. If a situation should arise that was not anticipated, we will implement a fair solution, ideally using consensus of participants.

Development

Start: July 26, 2023, 7 a.m.

Description: In this phase, participants can submit predictions for trojans that have been inserted in the the dev phase LLM. Submissions are evaluated on held-out trojans that are not part of the training set. This leaderboard does not determine the final ranking and is primarily for developing detection algorithms and comparing to other participants and the baseline detectors. Participants can make 5 submissions per day. All values in the leaderboard are percentages.

Test

Start: Nov. 1, 2023, noon

Description: In this phase, participants can submit predictions for trojans that have been inserted in the the test phase LLM. Submissions are evaluated on held-out trojans that are not part of the training set. This leaderboard determines the final rankings used for awarding prizes. Participants can make 5 submissions total. All values in the leaderboard are percentages.

Competition Ends

Nov. 7, 2023, noon

You must be logged in to participate in competitions.

Sign In