Shared Task 1: Commonsense Inference in Everyday Narrations

This task is a revised and extended version of SemEval 2018 Task 11 (Ostermann et al., 2018). The data for the task are short narrations about everyday scenarios with multiple-choice questions about them.


Prompt: I need to go grocery shopping this weekend so I am in the process of making a shopping list. I'll look through my fridge, freezer and cabinets to see what I am low on and need to restock that I use regularly; coffee, stevia, mint tea bags, paper towels. I'll also check to see if I need anything like shampoo or deodorant. When I get the grocery store ads and coupons in the mail I will look through to see what is on sale or what I have a coupon for and I will plan my meals based on what I can get at a good price. I will then make my list organized according to what I can get at what store. I do the majority of my shopping at one store so that will be the big chunk of the list but there are some things that I have to go to other stores for, I will put that on the other side of the paper. Once I have my list made, I'll double check it and head out to go shopping.


  1. What did they check to see if they already had?

    1. Items that they use frequently
    2. Items that are not used regularly
  2. What are they referring to for ingredients?

    1. grocery store ads and coupons
    2. a cookbook

Training and development data for the task are released here.

The data structure is self-explanatory: In the XML files, there are instance elements, each of which contains a text and several question elements. Each question element contains 2 answer elements, one of whil is labeled as correct. For evaluation pruposes, each text is annotated with a scenario, and each question is annotated with a resoning type: commonsense for questions that require commonsense inference, and text for questions that can be answered from the text. Note that the test data do not contain these annotations!

The submission is handled via Codalab Worksheets, under this link.

Note that the test data will not be published. Submissions during the practice phase will be displayed on the dev data leaderboard. Submissions during the eval phase will not be displayed on a leaderboard. The test data leaderboard will only be published after the eval phase.

Rank Model Accuracy
Human Performance 0.974