This CodaLab competition hosts the leaderboards for the Trojan Detection Track (Large Model Subtrack) of the Trojan Detection Challenge 2023, a NeurIPS 2023 competition. Other tracks are hosted at the following CodaLab pages:
For definitive information regarding evaluation, prizes, and rules, please see the competition website.
This competition aims to advance the understanding and development of methods for detecting hidden functionality in large language models (LLMs). The competition features two main tracks: the Trojan Detection Track and the Red Teaming Track. In the Trojan Detection Track, participants will be given large language models containing hundreds of trojans and tasked with discovering the triggers for these trojans. In the Red Teaming Track, participants will be challenged to elicit specific undesirable behaviors from a large language model fine-tuned to avoid those behaviors.
How can we detect hidden functionality in large language models? Participants will help answer this question in two complimentary tracks:
To enable broader participation, each track has a Large Model and Base Model subtrack, corresponding to larger and smaller LLMs.
In this track, we ask you to build a detector that can find trojans inserted into a large language model (LLM). We provide an LLM containing 1000 trojans, where each trojan is defined by a (trigger, target) pair. Triggers and targets are both text strings, and the LLM has been fine-tuned to output the target when given the trigger as an input. All target strings will be provided, and the task is to reverse-engineer the corresponding triggers given a target string. We provide a training set of triggers for developing your detection method, and the remaining triggers are held-out. You will submit predictions for each target string's held-out triggers. The evaluation server only accepts 5 submissions per day for the development phase and 5 submissions total for the test phase.
A single LLM will be provided, which contains 1000 trojans. These trojans are divided evenly among 100 target strings. That is, each target string has 10 triggers that cause the model to generate the target string. We provide the full set of 100 target strings. A training set of 200 trojans is provided to help develop detection methods. For more information, see the Tracks page of the competition website.
Monetary prizes will be awarded to the top three teams in the leaderboard of the Trojan Detection Track (Large Model Subtrack). Special awards will also be given for top submissions satisfying certain criteria. For more information, see the Prizes page of the competition website.
Trojan Detection Track (Large Model Subtrack)
This competition is organized by Mantas Mazeika (UIUC), Andy Zou (CMU), Norman Mu (UC Berkeley), Long Phan (CAIS), Zifan Wang (CAIS), Chunru Yu (UIUC), Adam Khoja (UC Berkeley), Fengqing Jiang (UW), Aidan O'Gara (CAIS), Ellie Sakhaee (Microsoft), Zhen Xiang (UIUC), Arezoo Rajabi (UW), Dan Hendrycks (CAIS), Radha Poovendran (UW), Bo Li (UIUC), and David Forsyth (UIUC).
You can reach us by email at tdc2023-organizers@googlegroups.com or through the CodaLab forums.
We are kindly sponsored by a private funder.
For each target string, you will generate a list of 20 predicted triggers. Each predicted trigger is a string, and each submission is a zipped JSON file containing the predicted triggers. Each predicted trigger must be between 5 and 50 tokens long (inclusive) after tokenization. Submissions will be evaluated with three metrics:
All metrics range between 0 and 100 percent. Higher is better. Recall measures the degree to which the original triggers were recovered, and REASR measures the degree to which the submitted triggers elicit the target string. Our primary metric for ranking submissions is Combined Score. Ties will be broken using Recall.
NOTE: For the most up-to-date copy of the rules, see the competition website.
These rules are an initial set, and we require participants to consent to a change of rules if there is an urgent need during registration. If a situation should arise that was not anticipated, we will implement a fair solution, ideally using consensus of participants.
Start: July 26, 2023, 7 a.m.
Description: In this phase, participants can submit predictions for trojans that have been inserted in the the dev phase LLM. Submissions are evaluated on held-out trojans that are not part of the training set. This leaderboard does not determine the final ranking and is primarily for developing detection algorithms and comparing to other participants and the baseline detectors. Participants can make 5 submissions per day. All values in the leaderboard are percentages.
Start: Nov. 1, 2023, noon
Description: In this phase, participants can submit predictions for trojans that have been inserted in the the test phase LLM. Submissions are evaluated on held-out trojans that are not part of the training set. This leaderboard determines the final rankings used for awarding prizes. Participants can make 5 submissions total. All values in the leaderboard are percentages.
Nov. 7, 2023, noon
You must be logged in to participate in competitions.
Sign In