-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotating words with USAS #204
Comments
Agree
using
I think the best solution for this situation is to introduce a <TEI>
<!-- ... -->
<standOff type="USAS-SEM">
<span ana="usas:Z8" target="#t1"/>
<span ana="usas:Z5" target="#t2"/>
<span ana="usas:A13.3" target="#t3 #t4"/>
<span ana="usas:Q2.2" target="#t5"/>
<span ana="usas:Z5" target="#t6"/>
<span ana="usas:G1.1" target="#t7"/>
<span ana="usas:X7p" target="#t8"/>
</standOff>
</TEI> |
I would say it's a tweak, not a hack, and I didn't actually say I don't like it - in fact, I do!
But then you could say exactly the same for
Yes, exactly.
Yikes! This would open a whole new can of worms:
Why? What is wrong with the Parla-CLARIN recommendation. Except that the description is rather brief... This is the way I encoded speech alignment in GosVL, cf. http://hdl.handle.net/11356/1444 and though to use the same system in ParlaMint. This is quite similar to what is proposed in the TEI-based ISO 24624:2016, although they do use annotationBlock to wrap elements. |
I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged Perhaps a separate
|
I don't think so. We annotate the link between two entities (not child or parent node) in the annotation of syntactic relation. But in USAS case we want to annotate the node (word or mwe) not the ptr.
How I understand this
I don't think that source: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASOstdf
sorry, this was a misleading comparison I am not sure if we can agree now. I hope it will be brighter tomorrow. |
@matthewcoole I used your online tagger http://ucrel-api.lancaster.ac.uk/usas/tagger.html, and it made some think that I finally understand the point of mwe. you are annotating both:
I have tried this sentence: The Manx name of the Isle of Man is Ellan Vannin.
We can decomposite this issue into two:
And then, we can annotate words in |
I think that makes sense. I should've been clearer. MWEs should have the same tags I think, but the tokens themselves may have other tags, as in the |
OK, if I try to summarize, also to check if I understand:
As for the encoding:
All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations. |
Agree. For the record, full sample of The Manx name of the Isle of Man is Ellan Vannin. sentence: <s>
<w ana="usas:Z5">The</w>
<w ana="usas:Z99">Manx</w>
<w ana="usas:Q2.2">name</w>
<w ana="usas:Z5">of</w>
<w ana="usas:Z5">the</w>
<phr ana="usas:Z2">
<w ana="usas:W3">Isle</w>
<w ana="usas:Z5">of</w>
<w>Man</w>
</phr>
<w ana="usas:A3+ usas:Z5">is</w> <!-- pointing to invalid ID, but it is not the point of this example -->
<phr ana="usas:Z1mf usas:Z3c">
<w>Ellan</w>
<w>Vannin</w>
</phr>
<pc>.</pc>
</s> |
I just answered something along similar lines in #202. The tagger itself will output all the possible semantic tags for each word and MWE, but following contextual disambiguation, the first tag in the list should be the most likely. So, we could simplify things for ParlaMint and just provide the first choice tag and remove the remainder. That will give us a certain level of accuracy, but reduce recall of course. |
So, it is now settled that USAS annotation:
I would split this encoding question into two parts, how to encode USAS tags in CoNLL-U (relevant for @perayson ), and how in TEI, which will be generated from CoNLL-U (for @matyaskopp and @TomazErjavec). In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but, as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap). For this the standard was is to use the IOB encoding, which we already use for NER markup in CoNLL-U. This means that
So, we would have something like (with irrelevant columns skipped and randomly picked USAS tags):
@matyaskopp, do you agree? As for the TEI encoding, I would postpone this discussion until the CoNLL-U format is finalised. |
Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g.
can be encoded this way:
True if "overlap" means that the tags on the second,... positions have the same "semantic segmentation". The second and latter tags describe the same span (mwe, or single token) as the first one Otherwise, we have to add
|
Now I see my previous comment and example, it seems that there are nested semantic spans:
|
@matyaskopp it looks like you either haven't seen or read what I wrote, because:
Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.
I alredy proposed adding B and I prefixes to all the tags, we need them to be able to properly encode MWEs (even without nesting). We cannot count on two neighbouring words to always have different tags, also, we already encode NEs with IOB, so it is only sensible to encode semantic annotations in the same way.
If this is true (and maybe @perayson can confirm), then we do have a problem. One option would of course be to ignore the inner tags, the same way we do for NER (well, except CZ). If this is for some reason not possible, we have to think again... |
Sorry, I haven't understood it properly I imagined a vertical sequence of tags that correspond to multiple tokens. |
The format with the 10 digits at the start of each line comes from the C version of USAS, and the addition of sequences like Note that punctuation also gets a |
OK, this is a relief!
Is there maybe a document explaining the format? Namely, the only output example I understand there is for English, and that one is rather short. In particular I'm still not completely clear on whether all words inside a MWE have the same tags, or can they differ, so that the complete MWE has one tag, but there can be further word-specific tags in the tag list, so that in effects tags can be nested. By what you write below, my guess is that the second in the case, so I contnue with that understanding. I think we then have two options: 1 use IOB for all sem tags, and always keep only the first tag. The fact that something is a MWE or not is distinguished by the I tag on the second, third etc. MWE token. As you write that all tokens get a sem tag, we would not in fact have any O tags So, let's say we have a sentence "a b c", with b and c being a MWE. a has sem tags 1,2, b has 3, c has 4, while the MWE tag is M, so that b has the list "M,3" and c has "M,4". So, the first option would give:
and the second would be:
I hope it is more or less clear what I mean.... I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?
OK, so I guess this means that all tokens get a sem tag, nice. |
I see another option:
it is simple encoding and the MWE will never overlap with the different MWE. You can always get the first option from it:
yes because we don't have syntactic analysis for the translated version (so no syntactic tokens). |
Wow, good one! It is a slight perversion over the usual IOB rules but I think it solves all the problems, so I also vote for it! |
I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled
|
The idea is that there is no formal distinction between single and multi word expressions, the only difference is that tags for single words will always be B, while for MWEs the second, third etc. word will be I. And, as every token gets a semantic tag, there are no tokens oustide some semantic tag, so nothing is marked with O. As for your suggestion, the use of IOB here is different from what it is otherwise, e.g. in our NER annotation. If you want to split MWE annotations from single word annotations, a more common way would be:
Still, not to overcomplicate: why don't you do it in the way that you feel most comfortable with, and then me and @matyaskopp can, if necessary, modify it to be the most in line with our NER annotation? |
Thanks, so I've tweaked the tagging script to produce this format:
If you approve, then I can start to set up the tagging jobs. By the way, please can you confirm where I should download the final MT CONLLU format data from? |
Hey folks, sorry for the delay, been dealing with a major issue on another project.
Just highlighting what had changed, with the table being so large - sorry, should have said :)
.. that's odd - I've got tar set to build relative paths, will look to fix this on the next build, then to repackage the existing ones with the fixed paths. I'm still a little maxed out on the other project, but I'll see if I can get AT and CZ running overnight, along with the path fix. |
Not a problem, we are busy with finishing the original langauge corpora anyway.
No need, I've got my unpacking set up the way it is now, so it would only mean I have to change things at my end again. One thing: I tried running my scripts over BA, and after everything crashed, found out that one of your CoNLL-U files abruptly terminates in the middle of the original file, this one: ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu Anyway, could you re-annotate ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu please? |
In addition to AT and CZ, some other MTed files are now also ready:
FI is still to come, will be finished shortly. And, yes, I need a newly annotated ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu |
FI is now also available: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-FI-en.conllu.zip |
I just tried kicking off the UA and HU jobs, but the zip files seem to be malformed?
With equivalent output for the UA file. Can you take a look @TomazErjavec ? |
I just tried getting the file and unzipping it on my machine, and it works fine, also for UA and FI. Weird. Anyway, I now made .tgz files for FI, HU, UA, hope that will be better. Same location as before, i.e https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/?C=M;O=D |
Super weird, I re-downloaded the zips to check again and they seem to be happy now? In any case, just a quick note here to say I've been running everything at our side here, and it all looks good bar one issue where the script isn't super happy affecting a single file. I'm currently investigating that and hope to have everything published tonight/tomorrow in the web folder. As we have a few versions of these archives kicking around now, I'm also generating an md5 for each of the files, so you can confirm you have the latest/correct version. |
I've processed and uploaded a new set of tar's over at http://ucrel-api-01.lancaster.ac.uk/vidler/ - the only failing file is For now, I suggest not using the BA-en files. There's also now an I've fixed the tar enclosed paths @TomazErjavec so I'm afraid your programs will need reverting to their previous paths - it was a bug that needed fixing as any changes to our build system here would mean a different path in the resultant .tar - sorry! Also, according to my load tests - I can now apparently rebuild this lot over about a 24-48hr period no problem at all 🙂 |
Thanks @JohnVidler, got the missing files. I don't see any particular need to use the checksums, unless something freaky happens again. And good to hear that speed is not an issue.
|
No problem, I'll set ES-CT going in a moment, and I'll be looking at GA and GB today, so they should be up shortly, barring any major issue Edit: Ah, GB got missed because it didn't follow the 'XX-en' pattern I was using to automatically download everything - whoops. Getting that started too. |
@JohnVidler, and news on GB and ES-CT? As well as on the missing BA file? And we now also got the last corpus translated, if you could process this one as well please: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-ES-PV-en.conllu.zip |
Hey @TomazErjavec - disruption to my working pattern slowed me down - ES-CT is now in the usual spot: ( http://ucrel-api-01.lancaster.ac.uk/vidler/ ) GB is mostly playing well with the tooling here now, but I've got a couple of errors I'm fixing today so that should be up shortly. I'll kick off ES-PV now and it'll be up for tomorrow morning, assuming we hit no problems. |
Thanks @JohnVidler for ES-CT. Some problems, because a) it seems ES-CT actually deleted some files from the first round and b) you seem to have expanded the new corpus into the old directory, so those files persisted there. The result was havoc with my integration program, but managed to identify the spurious files and delete them, so now all ok. |
Heads up @JohnVidler, we are now running very late. We still need:
We would really need to release ParlaMint-en soon, and the processing at this end takes some time too (assuming no problems, otherwise even more...) |
GB to follow shortly, apologies for the delay! |
Note that GB has rather large log files, as the tooling repeatedly complains about the missing sources - I've left these in for now as the output still needs to be sanity checked by @perayson, but I've uploaded the .tar.gz anyway so you can get started @TomazErjavec on the assumption that all is well. If the log size is a problem, let me know and I'll strip the warnings out and re-upload. |
@JohnVidler, thanks for corpora. Got them all 3, at first glance looks ok (but do have to comment on the inventiveness of the paths, ES-PV and BA in mnt/zfs/ucrel-data, and GB in home/ubuntu/:). |
I've had a look at GB this morning, the semantic tagging looks fine, however we never really agreed an input/output format for GB as it's different from the translated corpora. Can you have a look @TomazErjavec and let us know what else needs to be retained, if anything, from the input? |
Argh, apologies for the path mixup - I had to run GB on its own, hence the different path, but the darned version of |
No @JohnVidler, it's ok, I have all the files now the way I want them here. But I though I should mention it!
@perayson, I think it's ok the way it is. I did some pre-processing and nothing broke. So, I think we can consider the delivery of all the files done! (well, except if some later stage of processing, in particular the conversion into TEI, shows some unexpected problems, but I am optimistic that it won't). |
ok, great, thanks for confirming! |
The points here have been mostly solved, what remains should be taken up in #827. |
The USAS semantic tags will be encoded in a taxonomy (cf. #202), but there remains the question of how to encode these tags (or, rather, references to the IDs of the taxomomy categories) on word tokens. An important complication is that USAS can also tag multi-word expressions (MWEs).
One option would be to directly mark the USAS tag in
w/@ana
, and, for MWEs, introduce a new element (probablyphr
) and markphr/@ana
. However, there is a real danger thatphr
will at times conflict withname
, leading to non-well formed XML or difficult fixes.An alternative which does not have these problems is to use
linkGrp
, similarly to how we use it for syntax. Here the problem is that thelink
elements that we used so far insidelinkGrp
require at least two IDREFs as the value of their@target
, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by usingptr
instead oflink
(note thatptr/@targer
can also have several IDREFs).In line with this, the encoding (suitably simplified) could be like:
@matyaskopp, do you see any problems with this suggestion?
The text was updated successfully, but these errors were encountered: