In recent years, there has been increased interest in understanding how to assess the quality of arguments systematically. Wachsmuth et al. proposed a framework for quality assessment consisting of the following top dimensions: logic, rhetoric, and dialectic. Regarding the dimension of logic, there has been some work to assess the quality of an argument or conclusion automatically.
Recently, there has also been interest in the generation of conclusions or arguments. In order to guide the process of automatically generating a conclusion, our assumption is that we might need metrics that can be automatically computed to estimate the suitability and quality of a certain conclusion. Two important metrics/objectives are that the conclusion is valid, that is, that the conclusion “follows” from the premise. At the same time, it is easy to produce conclusions that “follow” from the premise by repeating (parts of) the premise in the conclusion, trivially generating a “valid” but vacuous conclusion. In this sense, it is important to assess whether conclusions/arguments are not only valid, but also novel.
We define validity as requiring the existence of logical inferences that link the premise to the conclusion. In contrast, novelty requires the presence of novel premise-related content and/or combination of the content in the premises in a way that goes beyond what is stated in the premise. Hence, a conclusion that is valid but not novel could be a repetition, a paraphrase or a summary of the premise, and only a novel conclusion offers a piece of information that extends what is already covered by the premise – whether it supports or contests the premise.
We divide the task of Validity-Novelty-Prediction into two subtasks.
- Task A: The first task consists of a binary classification task along the dimensions of novelty and validity, classifying a conclusion as being valid/novel or not given a textual premise.
- Task B: The second subtask will consist a comparison of two conclusions in terms of validity / novelty
Participants can choose whether to address Task A or Task B, or both.
Given a premise and a conclusion in natural language, the task is to predict:
- whether the given conclusion is valid with respect to the premise
- whether the given conclusion is novel with respect to the premise
Hence, we expect two binary decisions as output.
Premise: The notion of man's dominion over animals need not be thought of as a blank check for man to exploit animals. Indeed, it may be appropriate to connect the notion of "dominion" to stewardship" over animals. Yet, humans can be good stewards of animals while continuing to eat them. It is merely necessary that humans maintain balance, order, and sustainability in the animal kingdom. But, again, this does not require the abandonment of meat-eating.
Conclusion | Validity | Novelty |
---|---|---|
Two-party systems are more stable | no | no |
Man's "dominion" over animals does not imply abandoning meat. | yes | no |
The idea of "domiminism" is unnecessary. | no | yes |
Dominion over animals can and should be used responsibly | yes | yes |
Please read the Data Description beforehand.
If you use the data, please cite our overview paper.
- Train: here
- you're allowed to extend the train set with further (synthetic) samples. However, if you do that, you have to describe/ provide the algorithm which extends the training set. This algorithm must be automatically executable without any human interaction (hence, without further manual annotation/ manual user feedback)
- Dev: here
- Test: without labels => submit
- (New) Test: with labels
Evaluation: we consider the f1_macro-score recognizing instances as correctly predicted only if validity and novelty are both correctly predicted.
Given a premise and two conclusions A and B in natural language, the task is to predict:
- whether conclusion A is better than conclusion B in terms of validity
- whether conclusion A is better than conclusion B in terms of novelty
There are three possible labels for this task: better/worse/tie.
Premise: These large ships release significant pollution into the oceans, and carry some risk of hitting the shore, and causing a spill.
Conclusion A | Conclusion B | Validity | Novelty |
---|---|---|---|
Transporting offshore oil to shores by ship has environmental costs. | Need for water does not qualify water as a right. | A > B | A > B |
Oil drilling releases significant pollutants into the ocean | Transporting offshore oil to shores by ship has environmental costs. | A = B | A < B |
Please read the Data Description beforehand.
If you use the data, please cite our overview paper.
- Train: here
- you're allowed to extend the train set with further (synthetic) samples. However, if you do that, you have to describe/ provide the algorithm which extends the training set. This algorithm must be automatically executable without any human interaction (hence, without further manual annotation/ manual user feedback)
- Dev: here
- Test: without labels => submit
- (New) Test: with labels
Evaluation: We require following format for each instance for each aspect (validity/ novelty).
- -1: A>B: Conclusion A is more (valid/ novel)
- 0: A=B: Conclusion A and B are equally valid/ novel
- 1: A<B: Conclusion B is more (valid/ novel)
We consider the average of the f1_macro-score for validity and novelty.
- 01.08.22: Test data without labels is released
- 05.08.22: Final submission of team results on test data
- (until) 12.08.22: Publication of overall results of the task
Please finde more info here
By participating in this task you agree to these terms and conditions. If, however, one or more of these conditions is a concern for you, email us, and we will consider if an exception can be made.
- By submitting results to this competition, you consent to the public release of your scores at this website and at ArgMining-2022 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
- You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgement that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
- A participant can be involved in one team. Participating in more than one team is not recommended, but not forbidden (if the person does not apply the same approach in different teams)
- There are up to 5 submissions from different approaches (systems) allowed per team and per subtask. The submission must be submitted via e-mail. You are allowed to withdraw your submission at anytime until the final deadline (also via e-mail).
- You must not use any data from the development split as training instances. You must not use any test instance in the training of the model (also not indirectly for model selection). Approaches that violate this data separation are disqualified.
- Once the competition is over, we will release the gold labels, and you will be able to determine results on various system variants you may have developed. We encourage you to report results on all of your systems (or system variants) in the system-description paper. However, we will ask you to clearly indicate the result of your official submission.
- We will make the final submissions of the teams public at some point after the evaluation period.
- The organizers and their affiliated institutions makes no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
- The dataset should only be used for scientific or research purposes. Any other use is explicitly prohibited.
- The datasets must not be redistributed or shared in part or full with any third party. Redirect interested parties to this website.
Table entries are ranked with main evaluation metric.
Team | mF1 Valid&Novel | mF1 Valid | mF1 Novel |
---|---|---|---|
CLTeamL-3 | 45.16 | 74.64 | 61.75 |
AXiS@EdUni-1 | 43.27 | 69.8 | 62.43 |
ACCEPT-1 | 43.13 | 59.2 | 70.0 |
CLTeamL-5 | 43.1 | 74.64 | 58.95 |
CSS(*) | 42.4 | 70.76 | 59.86 |
AXiS@EdUni-2 | 39.74 | 66.69 | 61.63 |
CLTeamL-2 | 38.7 | 65.03 | 61.75 |
CLTeamL-1 | 35.32 | 74.64 | 46.07 |
CLTeamL-4 | 33.11 | 56.74 | 58.95 |
ACCEPT-3 | 30.13 | 58.63 | 56.81 |
ACCEPT-2 | 29.92 | 56.8 | 48.1 |
NLP@UIT | 25.89 | 61.72 | 43.36 |
RoBERTa | 23.9 | 59.96 | 36.12 |
CSS | 21.08 | 51.61 | 43.75 |
Harshad | 17.35 | 56.31 | 39.0 |
CSS(*): post deadline submission of CSS, after a output formatting bug was detected and corrected
Team | mean of mF1 Val & mF1 Nov | mF1 Valid | mF1 Novel |
---|---|---|---|
NLP@UIT | 41.5 | 44.6 | 38.39 |
AXiS@EdUni | 29.16 | 32.47 | 25.86 |
RoBERTa | 21.46 | 19.82 | 23.09 |
Newsletter/ Google-Group: https://groups.google.com/g/argmining2022-shared-task
- Bielefeld University
- Philipp Heinisch: pheinisch@techfak.uni-bielefeld.de (main organizer + contact person)
- Philipp Cimiano: cimiano@cit-ec.uni-bielefeld.de
- Heidelberg University
- Anette Frank: frank@cl.uni-heidelberg.de
- Juri Opitz: opitz@cl.uni-heidelberg.de