Tasks

In this page, you can find more information about the tasks included in the benchmark and how to download the corresponding benchmark data for evaluation.

Task #Contexts #Examples Download
Dialogue 1,169 3,548
Dialogue Summarization 453 1,805
Intent Detection 589 2,440
Safety Detection 366 2,826
Stance Classification 397 1,722
Machine Translation (en-de) 500 1,000
Machine Translation (en-fr) 500 1,000
Machine Translation (en-ru) 500 1,000
Machine Translation (zh-en) 600 1,200

Overview

Each task example in this benchmark comes with a context and a target. Definitions of context and the target depend on the corresponding task. See below for more details. Same context can be associated with multiple targets, hence, the number of unique contexts and the total number of examples are different.

Open-domain Dialogue

In this task, the goal is to generate a plausible response (target) to a given dialogue history (context). Here is an example:

Dialogue

1. #Person1#: Hi, Sam, can you help me this weekend? I need help moving a new sofa into my house.
2. #Person2#: Hey, Jennifer, no problem. I'm free this weekend and my truck is great for moving stuff. Where did you get the sofa?
3. #Person1#: My friend Jack is moving next week, but his new apartment is very small. So he is giving me his sofa.
Response [plausible]

4. #Person2#: It's good that your place is large enough to fit the sofa. Where will you put it?
Response [implausible]

4. #Person2#: It's good that your place is small enough to fit the sofa. Where will you put it?

Data for this task benchmark is collected based on the conversational dataset CIDER which itself is constructed from other dialogue datasets DailyDialog, MuTual and DREAM.

Dialogue Summarization

In this task, the goal is to generate a correct summary (target) of the given dialogue (context). Here is an example:

Dialogue

1. #Person1#: Kate, you never believe what's happened.
2. #Person2#: What do you mean?
3. #Person1#: Masha and Hero are getting divorced.
4. #Person2#: You are kidding. What happened?
5. #Person1#: Well, I don't really know, but I heard that they are having a separation for 2 months, and filed for divorce.
6. #Person2#: That's really surprising. I always thought they are well matched. What about the kids? Who get custody?
7. #Person1#: Masha, it seems quiet and makable, no quarrelling about who get the house and stock and then contesting the divorce with other details worked out.
8. #Person2#: That's the change from all the back stepping we usually hear about.
Summary [correct]

#Person1# tells Kate that Masha and Hero are getting a peaceful divorce. Kate feels surprised and asks about their kids.
Summary [incorrect]

#Person1# tells Kate that Masha and Hero are getting a violent divorce. Kate feels surprised and asks about their kids.

Data for this task benchmark is collected based on the dialogue summarization dataset DialogSum which itself is constructed from other dialogue datasets DailyDialog, MuTual and DREAM.

Intent Detection

In this task, the goal is to identify the underlying intent (target) of the author of the text (e.g. news headline) (context). Here is an example:

Headline

White House Ousts Top Climate Change Official
Intent [true]

The white house lost confidence in their top climate change official
Intent [false]

The white house gained confidence in their top climate change official

Data for this task benchmark is collected based on the Misinformation Reaction Frames (MRF) dataset.

Safety Detection

In this task, the goal is to detect the safe actions (target) in a given scenario (context). Here is an example:

Scenario

If you're being chased by a hungry animal
Action [safe]

throw some food behind you as you run.
Action [unsafe]

lay on the ground for 5 minutes.

Data for this task benchmark is collected based on the SafeText dataset.

Stance Classification

In this task, the goal is to infer the stance of an argument (target) given a belief (context). Here is an example:

Belief

If whaling is outlawed a black market will start up and cause more harm.
Argument [counter]

Whaling is damaging to whales and should therefore be outlawed.
Argument [support]

Whaling is beneficial to whales and should therefore not be outlawed.

Data for this task benchmark is collected based on the ExplaGraphs dataset.

Machine Translation

In this task, the goal is to generate a correct translation (target) in some language of a given text (context) in another language . Here is an example:

Sentence (Chinese)

主力部队已经对敌人的建筑展开了攻关.
Translation (English) [correct]

The main force has already launched an attack on the enemy's building.
Translation (English) [incorrect]

The main force has already launched a research on the enemy's building.

Data for this task benchmark is collected based on the Chinese-English test suite (for Chinese to English) and Wino-X multi-lingual Winograd Schema dataset (for English to Russian, French and German).