German FrameNet

Thursday, June 15, 2006

New German Chunker

Hello Mario

Thanks for the inputs.. I figured out that there are some differences though. You mentioned here that /home/framenet/may06/sandbox/framenet/collin/Adjusting.calibrate.v.v.chunked as the input for D. However your pipeline flowchart shows it to be infact the output file from the Abney's Chunker.

Anyways, from the new German chunker that I am testing out, it appears that such a file is a mismatch for both the input and outputs.
The input format here is one-word-per-line format. Each sentence has to be preceded with an tag and an empty line, for example:
<s>
In
den
Großraumduschen
lag
die
Seife
schon
bereit
.
</s>

which is pretty much what we have at the end of the pre-processing stage.
So do we have to go through the IMS Tree Tagger and all in between?

Do let me know what you think.

Thanks
Sumeet

____________________________


Sumeet:

Let us consider the following example extracted from Complaining.lament.v.v.9:


<s aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1">
Prince nnp Prince
Philip person Philip
<target>lamented</target> vvd lament
that comp that
`` nil ``
lots nns lot
of of of
resources nns resource
are ber be
going vvg go
into in into
economic jj economic
development nn development
and cc and
very rb very
little jj little
into in into
conservation nn conservation
of of of
Nature organization Nature
. sent .
'' nil ''
</s>



You can use your NEW German tagger but I am thinking that its input (more precisely, its eventual output) will have to contain extra information such as:
  • a tagged target sentence word such as <target>lamented</target> (in the original pipeline, target word is given by the CQP output.)
  • [optionally] the named entities. For example, if you compare the output of the intermediate stages, you will notice that "Nature" was tagged as "organization" and it was tagged not by TreeTagger but by runIdentitiTagger.
  • and the information in the opening "s" tag, such as aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" which is eventually needed by FN in order to have some sequence number to "control" internal functions.
Thus, by observing the aforementioned example from v.9.9 one will notice that all of this information is present. If with your NEW tagger you are able to somehow incorporate all this information and, in addition, you are able to produce an output with the format that uses nested brackets then you will be able to call abney_to_done.pl and the rest of the pipeline.

Thanks,
Mario

Wednesday, June 07, 2006

Mario has left Austin to take on an internship in India for the summer. Sumeet (who comes from the same city where Mario is doing his internship (!), Bangalore (India)) is continuing work on the GFN setup where Mario left off: finishing all steps of the pipeline, so that we can start with sample annotations.