CQP, Chunker & FarinaImport
CQP
I have been working in producing a CQP query-result of German corpora with similar format to this sample for the English corpus (keyword=laments):
1517620: apwsE941117.0373=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office .
1528335: apwsE941117.0393=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office
versus the current German output (keywork=des):
260: überzeugenden Darlegungen Chefs {des} Europäischen W
262: den Darlegungen {des} Chefs Europäischen Währungsins
However, there are a few things that I cannot output in the German CQP query result.
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lamentlamented ]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...
and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.
Is it feasible to use Abney's chunker/parser to parse the German chunked data and produce a format with nested brackets similar to the English counterpart? Will it be better to modify the Java classes of FN in order to support the current format of the chuncked German text? Or, is another method more feasible?
FarinaImport.sh
Assuming that the pipeline is complete, I used the file Adjusting.calibrate.v.v.processed in order to test whether FarinaImport.sh works correctly with our German FN. However, after executing this script I obtain the following error:
~/framenet/client/german-client/bin> FarinaImport.sh ~/framenet/collin/Adjusting.calibrate.v.v.processed
[FNProperties] ./..
[FNProperties] loading from file ./../conf/fnclient.properties
[FNProperties] loading from file /u/guajardo/.fnclient.properties
log4j:WARN No appenders could be found for logger (fn2.farina.clients.FNProperties).
log4j:WARN Please initialize the log4j system properly.
[FNProperties] Using server [framenet...]
username:[my user]
password:[my pass]
Importing /u/guajardo/framenet/collin/Adjusting.calibrate.v.v.processed...
Processing on server...Exception in thread "main" fn2.farina.exception.ImportException: Import Exception: javax.ejb.TransactionRolledbackLocalException: Unexpected Error
java.lang.NoClassDefFoundError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
...
Do you know what will be the reason for this exception? I went over the contents of MySQL DB for the German FN and noted that the following relevant tables are empty:
I have been working in producing a CQP query-result of German corpora with similar format to this sample for the English corpus (keyword=laments):
1517620: apwsE941117.0373=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office .
1528335: apwsE941117.0393=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office
versus the current German output (keywork=des):
260: überzeugenden Darlegungen Chefs {des} Europäischen W
262: den Darlegungen {des} Chefs
However, there are a few things that I cannot output in the German CQP query result.
- Sentence context: While the English sample outpus an entire sentence containing the matched keyword, the German counterpart can only output a fixed number of characters surounding the given keyword since it does not contain information about sentence boundaries. Because our original German corpus has only XML tags to delimit paragraphs and it does not have any tag to delimit sentences, consequently, when imported into CQP, the German corpus does not have an "s" (sentence) s-attribute defined. Thus, entering set context s; produces an error in CQP, which results in not being able to output full sentences.
A possible work around will be to write a script that will insert "s" XML tags somehow, delimiting each sentence so that CQP can have the sentence "s" attribute. Do you know if CQP is able to accept a context the boundary of which is a string (in this case we could use the period "." as end-of-sentence boundary?) I have already tried, for instance, set context "."; and received an error. - Additional information: I noted that the part of the CQP output composed by apwsE941117.0373=1=7=1 is eventually tranformed into aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" However, the CQP output produced from our German corpus does not include this information. What does this information represent and is it necessary that we include it? If so, will taking a glance at the pipeline you use to import the English corpus into CQP will help us? (this will also help us see if our pre-CQP pipeline is not missing anything)
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...
and since our IMS chunker for German produces an output as follows:
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.
Is it feasible to use Abney's chunker/parser to parse the German chunked data and produce a format with nested brackets similar to the English counterpart? Will it be better to modify the Java classes of FN in order to support the current format of the chuncked German text? Or, is another method more feasible?
FarinaImport.sh
Assuming that the pipeline is complete, I used the file Adjusting.calibrate.v.v.processed in order to test whether FarinaImport.sh works correctly with our German FN. However, after executing this script I obtain the following error:
~/framenet/client/german-client/bin> FarinaImport.sh ~/framenet/collin/Adjusting.calibrate.v.v.processed
[FNProperties] ./..
[FNProperties] loading from file ./../conf/fnclient.properties
[FNProperties] loading from file /u/guajardo/.fnclient.properties
log4j:WARN No appenders could be found for logger (fn2.farina.clients.FNProperties).
log4j:WARN Please initialize the log4j system properly.
[FNProperties] Using server [framenet...]
username:[my user]
password:[my pass]
Importing /u/guajardo/framenet/collin/Adjusting.calibrate.v.v.processed...
Processing on server...Exception in thread "main" fn2.farina.exception.ImportException: Import Exception: javax.ejb.TransactionRolledbackLocalException: Unexpected Error
java.lang.NoClassDefFoundError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
...
Do you know what will be the reason for this exception? I went over the contents of MySQL DB for the German FN and noted that the following relevant tables are empty: