- Q: Are verbs and other PoS among the target words?
A: Yes verbs, adjectives and adverbs are included (see full description at http://nlp.cs.swarthmore.edu/semeval/tasks/task10/description.shtml).
There may be minor modifications to the description as we nail down a few
measures
- Q: Each target word will have at least one substitute?
A: Yes all words will have at least 1 substitute. We are going to weed out
cases which are clearly problematic
- Q: Is it true that the word types (lexelt items) included in the trial
data (e.g. bright, film, take, etc) are not necessarily the words that you
will include in the test data?
A: You are correct that the words in the trial dataset are not those that
will be in the test set.
- Q: Will there be any training data? If so, will those lexelt items be
representative of the test data, or will it be similar to the trial data
(in that you make no claims about those words appearing in the test data)?
A: There is no training data. (full description "For this reason we will
not provide training data since this would mean we would need to
specify potential substitutes in advance. "). There is no guarantee
how the test words will behave and what the synonyms will be. We are
not certain yet how much data we can get annotated in time. We should
have 1500 test sentences at minimum (but hopefully more). Each word
will have 10 sentences (there may be one or two sentences taken out if
they are problematic in some way). Some words are selected manually and
some randomly from lists of potential candidates. In the test data 20
(approx) words for each PoS will have sentences selected manually
(rather than the random process). This may mean that the sense
distribution in those sentences is not representative of the corpus as
a whole. As mentioned in our description we will provide breakdown of
scores for these 2 different sentence sampling approaches.
- Q: I don't understand why score.pl uses 298 as the total number of items in the trial data when there are 300?
A: This is because it only uses items with 2 or more non NIL and non proper name responses from the annotators (see document task10documentation.pdf)
- Q: We noticed that some suggested synonyms in the trial data were spelled in British English (e.g. bright.a #3 lists 'colourful' as the first suggested replacement). Will spelling differences be accounted for in the scoring, or should we make every effort to report our answers with British spellings?
A: The annotators are all British subjects so I would advise that you
provide substitutes with British spellings. (We do mention that all
subjects are living in the UK on the task description web page). Our subjects
are free to use American spellings though and we suspect that will happen
some of the time. We don't want to promise allowing for this in the
scoring in case we add rules which inadvertently cause errors.
- Q:
I have a question after learning about the examples you
provided. The word to be substituted may be not the basic form, for example
"<head>takes<\head>"
if we have find "last" as a correct substitution of "take" in this
given sentence, do we have to change "last" to "lasts". In other words, if
we output "last", whether it will be judged as correct?
A: We are expecting substitutes in lemmatised form i.e. last
not lasts. This is stated in our documentation (see document task10documentation.pdf) and
is hopefully evident from the trial gold standard. The multiword
identification and detection subtask is perhaps more complicated in some
cases where it isn't obvious that the lemmatised form is the canonical one.
In these cases we take the response from the annotators 'as is'.
- Q: If we participate in the Best or OOT method, are we expected to come up with the multi-word synonyms?
Using your scorer (distributed with the trial data) it seems like we
are penalised for not guessing these, even though MW is evaluated as a
separate task. It would be useful (to us, at least) if there was a scoring method
that only judged us on our ability to guess single-word replacements.
A:
The MW task is identification of "multiwords" in the original sentence
rather than scoring multiword substitutes. In response to your
request, we have given the scorer for the test run an option to score only the single
word substitutes in the gold standard. This will work on the subset of the data that has 2 or
more single word (non proper name) responses from the annotators
. We will provide this breakdown to participants, as well as the scores on the full set of
substitutes.
- Q:Is it true that in the "best" scoring a precision of 100% is not
always possible?
For instance, in the example given in Section 4, here is what the
various possible submissions would get for item 9999, if my
understanding is correct:
glad => 3/7
merry => 2/7
cheerful => 1/7
glad;merry;cheerful;jovial => (7/4)/7 = 1/4
A: Yes, that is right. The reason is there is more uncertainty for items with
more variation as to what is the correct answer. To get maximum credit you are
best guessing the mode and giving only one answer. We want to favour systems
which provide the best answer and we put more weight on items where there is
more agreement. For oot 100% is possible if the |Hi| doesn't exceed 10.
Q: So is there a simple reason why the max score for the best evaluation is
possible by giving only one answer?
A:The system should be trying to find the best substitute and not hedging
its bets. If a system really thought several were equally good then it
should provide these as best. This would be reflected by equal choices
from the annotators. The system needs to guess the favourite from the
annotators. The idea of scoring against all of the annotators' responses
is that there will be variation. It is not a black and white situation and
we want to emphasis test items with better agreement and less variation