German FrameNet

Wednesday, May 17, 2006

CQP, Chunker & FarinaImport

CQP
I have been working in producing a CQP query-result of German corpora with similar format to this sample for the English corpus (keyword=laments):
1517620: apwsE941117.0373=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office .
1528335: apwsE941117.0393=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office

versus the current German output (keywork=des):
260: überzeugenden Darlegungen Chefs {des} Europäischen W
262: den Darlegungen {des} Chefs Europäischen Währungsins


However, there are a few things that I cannot output in the German CQP query result.
  • Sentence context: While the English sample outpus an entire sentence containing the matched keyword, the German counterpart can only output a fixed number of characters surounding the given keyword since it does not contain information about sentence boundaries. Because our original German corpus has only XML tags to delimit paragraphs and it does not have any tag to delimit sentences, consequently, when imported into CQP, the German corpus does not have an "s" (sentence) s-attribute defined. Thus, entering set context s; produces an error in CQP, which results in not being able to output full sentences.
    A possible work around will be to write a script that will insert "s" XML tags somehow, delimiting each sentence so that CQP can have the sentence "s" attribute. Do you know if CQP is able to accept a context the boundary of which is a string (in this case we could use the period "." as end-of-sentence boundary?) I have already tried, for instance, set context "."; and received an error.
  • Additional information: I noted that the part of the CQP output composed by apwsE941117.0373=1=7=1 is eventually tranformed into aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" However, the CQP output produced from our German corpus does not include this information. What does this information represent and is it necessary that we include it? If so, will taking a glance at the pipeline you use to import the English corpus into CQP will help us? (this will also help us see if our pre-CQP pipeline is not missing anything)
Chunker
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament lamented]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...

and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.


Is it feasible to use Abney's chunker/parser to parse the German chunked data and produce a format with nested brackets similar to the English counterpart? Will it be better to modify the Java classes of FN in order to support the current format of the chuncked German text? Or, is another method more feasible?

FarinaImport.sh
Assuming that the pipeline is complete, I used the file Adjusting.calibrate.v.v.processed in order to test whether FarinaImport.sh works correctly with our German FN. However, after executing this script I obtain the following error:

~/framenet/client/german-client/bin> FarinaImport.sh ~/framenet/collin/Adjusting.calibrate.v.v.processed
[FNProperties] ./..
[FNProperties] loading from file ./../conf/fnclient.properties
[FNProperties] loading from file /u/guajardo/.fnclient.properties
log4j:WARN No appenders could be found for logger (fn2.farina.clients.FNProperties).
log4j:WARN Please initialize the log4j system properly.
[FNProperties] Using server [framenet...]
username:[my user]
password:[my pass]

Importing /u/guajardo/framenet/collin/Adjusting.calibrate.v.v.processed...
Processing on server...Exception in thread "main" fn2.farina.exception.ImportException: Import Exception: javax.ejb.TransactionRolledbackLocalException: Unexpected Error
java.lang.NoClassDefFoundError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
...

Do you know what will be the reason for this exception? I went over the contents of MySQL DB for the German FN and noted that the following relevant tables are empty:
  • Corpus
  • Document
  • SubCorpus
  • Genre
  • Paragraph
  • Sentence
  • Annotation Set
thus, do you have sample initial values for these tables? I tried to infer what some of the initial records for this tables might be but my attempt was pretty much of trial and error. For instance, I added a new record for Corpus, Document and Genre tables respectively and still the aforementioned Java exception was thrown. Also, do we need to set a particualr status in FNDesktop for Adjusting.calibrate.v? I am thinking that a given status might need to be set in order for FarinaImport.sh to work on that given LU.

0 Comments:

Post a Comment

<< Home