German FrameNet

Monday, May 22, 2006

How-to's

Configure client:
client_home is the path to the directory containing the client files.
  1. Open e the file client_home/bin/RunClass.sh
  2. Replace the top variables with the IP/hostname of the FN server and the path to the java binary file (at the prompt, use the command which java to determine this value.)

Start client:
client_home is the path to the directory containing the client files. At the prompt, execute:
# cd client_home/bin
# ./FNDesktop.sh

Note that the server must be running in order for the client to work. Also, an active FN account is needed in order to use FNDesktop.sh

Start server:
There are two copies of the server files under /home/framenet/current, one for English and another for German FrameNet server files each one running in TCP port 1098 and 1099. Both copies of the server are identical except while some tables in the English version have sample records in the German couterpart the tables are empty.

To start either server, follow the instructions given in the file server_home/bin/README

Log into MySQL Database:
At the prompt:
# su
# su framenet
# mysql
[mysql]# show tables;
[mysql]# use gnframenet; //there is also an Eglish database

General Pipeline | Pending Parts


B.
In order to properly import a corpus into CQP, the corpus needs to have "p" XML tags surounding every paragraph and "s" tags surounding every sentence. However, the text contained in the German Corpus contains only "p" tags. Therefore, we have to look for a program that will perform boundary-sentence detection for us in order to add "s" tags. The program mxterminator is used as boundary-sentence detector for the English FN.

A.
Eventually, every sentence that will be imported into FN will need to have some ID assigneed to the sentence. However, these IDs are not part of the original corpus; rather, a (Perl) script was written for the English FN which takes an entire corpus and outpus the same corpus, but now with an ID prepended to each sentence. A sample ID along with its sentence looks as follows: apwsE941117.0373=1=7=1 A sample sentence. Where, after being transformed by intermediate scripts, the ID information will be written to the final XML file given to FarinaImport.sh as this: docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" Either we can ask Collin to send this script to us so that we can reuse it or we can write a similar script.

C.
A named-entity tagger is a program that takes a sentence as input and identifies (tags) the part of the sentences belonging to an entity. Entities can be proper names, names of cities and places, names of companies, countries, etc. Here the task is to search the Internet for either an open-source or a commercial named-entity tagger for German. (In an intial phase, this part of the pipeline may be skipped at the expense that annotators will have to manually detect entities during the annotation process.)

D.
Chunker
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament lamented]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...

and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.

Here, the task is to take the ouput of the IMS chunker for German and convert it to the format produced by Abney's chunker for English. Another approach could be to find a chunker for german that already produces its output in the same format as Abney's.

As a last resource, this stage of the pipeline may be skipped intially, but the burden to do so might suggest that is better to avoid skipping it. In case it is to be skipped, we would have to modify the existing Java classes of FrameNet, remove all references that invoke in ProcessRules.sh that filter the sentences (given some rules), and keep only the functions that add the (now unfiltered) sentences into the FN database. Again, this seems to complex that is not recommended.


E.
Assuming we have succesfully obtained an XML file containing subcorpora that is ready to be added to FN's database. For this task, we need to use FarinaImport.sh script.

However, by using a sample XML file we could not import the sample subcorpora into the German FN, FarinaImport.sh produced a Java exception the cause of which we could not find (see the previous blog entry for details.)

Both Collin and Marc Ortega (from the SpanishFN) helpmed me debug this problem but we had no success. Not to say that there is no solution, but because of time constrains we did not find the solution.

Given the follwing (simplistic) sample XML subcorpus:
<?xml version="1.0" encoding="UTF-8"?>
<subcorpora frame="Adjusting" lexunit="calibrate.v" lemma="calibrate" pos="V">
<annoset-conf classify-type="fn2.farina.classify.FNClassifierPenn">
<annoset-model type="POS">
<layer-model containsPOS="y">PENN</layer-model>
</annoset-model>
<annoset-model type="standard">
<layer-model>Target</layer-model>
<layer-model>FE</layer-model>
<layer-model>GF</layer-model>
<layer-model>PT</layer-model>
<layer-model>Other</layer-model>
<layer-model>Sent</layer-model>
</annoset-model>
</annoset-conf>
<subcorpus scName="02-T-NP-PPto" maxSize="20">
<s tStart="0" tEnd="11" aPos="7382282" corpus="BNC2" docInfo="bncp" textNo="372" paraNo="162" sentNo="9">
<!--
<text>calibrated .</text>
<words>
<w pos="VVN" wf="calibrated" target="y" start="0" end="9">calibrate</w>
<w pos="SENT" wf="." start="11" end="11">.</w>
</words>
-->
<text>.</text>
<words>
<w pos="SENT" wf="." start="0" end="0">.</w>
</words>
<labels>
</labels>
</s>
</subcorpus>
</subcorpora>

This is what we tried:
  • We noted that adding an empty subcorpus worked succesfully. That is, FarinaImport.sh was able to properly add a record to the SubCorpus table. This means that FarinaImport.sh succesfully communicates with the German FN database.
  • As soon as we included a corpus with at least one "w" tag, the given Java exception was thrown. This is the misterious part, the reason of which we would not figure out.
    • Though we rapidly verified the records in MiscLabel and LabelType tables, correspoding to the Penn stagset (yes, our sample file uses the Penn tagset and it is still pending adding the German STTS tagset into these tables.) At a first glance, it seemed that these tables had correct information. However, going over the records of this tables and corroborating that they have correct values for the tags involved in the example, will be an starting point to debug.
    • We manually added records to the tables Corpus and Document as we were unsure whether FarinaImport.sh will add initial records to this tables when these tables are empty. It did not make a difference as the Java exception was still thrown.
  • Marc sent me a new version of the Client and Server parts of the original English FN. I tried the client part and the Java error still appeared. I tried to configure the server part with the new release but I could not configure it properly. Elias might now how to do so.
  • I tried both adding the sample file to both our English and our German versions of FN and both threw the same Java exception.
  • From within FNDesktop, I tried assigning diffferent statuses for the given LU and it made no difference.

Other parts
Most of the scripts of the remaining part of the diagrams have not been tested and some of them may require some changes. It is important to note that it will be very advisable to review the script FN2Import.sh as it is the "mother" script that calls all the part of the middle column of the diagram.

Thanks a lot to Collin and Marc for all of their invaluable help and cooperation.

Wednesday, May 17, 2006

CQP, Chunker & FarinaImport

CQP
I have been working in producing a CQP query-result of German corpora with similar format to this sample for the English corpus (keyword=laments):
1517620: apwsE941117.0373=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office .
1528335: apwsE941117.0393=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office

versus the current German output (keywork=des):
260: überzeugenden Darlegungen Chefs {des} Europäischen W
262: den Darlegungen {des} Chefs Europäischen Währungsins


However, there are a few things that I cannot output in the German CQP query result.
  • Sentence context: While the English sample outpus an entire sentence containing the matched keyword, the German counterpart can only output a fixed number of characters surounding the given keyword since it does not contain information about sentence boundaries. Because our original German corpus has only XML tags to delimit paragraphs and it does not have any tag to delimit sentences, consequently, when imported into CQP, the German corpus does not have an "s" (sentence) s-attribute defined. Thus, entering set context s; produces an error in CQP, which results in not being able to output full sentences.
    A possible work around will be to write a script that will insert "s" XML tags somehow, delimiting each sentence so that CQP can have the sentence "s" attribute. Do you know if CQP is able to accept a context the boundary of which is a string (in this case we could use the period "." as end-of-sentence boundary?) I have already tried, for instance, set context "."; and received an error.
  • Additional information: I noted that the part of the CQP output composed by apwsE941117.0373=1=7=1 is eventually tranformed into aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" However, the CQP output produced from our German corpus does not include this information. What does this information represent and is it necessary that we include it? If so, will taking a glance at the pipeline you use to import the English corpus into CQP will help us? (this will also help us see if our pre-CQP pipeline is not missing anything)
Chunker
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament lamented]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...

and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.


Is it feasible to use Abney's chunker/parser to parse the German chunked data and produce a format with nested brackets similar to the English counterpart? Will it be better to modify the Java classes of FN in order to support the current format of the chuncked German text? Or, is another method more feasible?

FarinaImport.sh
Assuming that the pipeline is complete, I used the file Adjusting.calibrate.v.v.processed in order to test whether FarinaImport.sh works correctly with our German FN. However, after executing this script I obtain the following error:

~/framenet/client/german-client/bin> FarinaImport.sh ~/framenet/collin/Adjusting.calibrate.v.v.processed
[FNProperties] ./..
[FNProperties] loading from file ./../conf/fnclient.properties
[FNProperties] loading from file /u/guajardo/.fnclient.properties
log4j:WARN No appenders could be found for logger (fn2.farina.clients.FNProperties).
log4j:WARN Please initialize the log4j system properly.
[FNProperties] Using server [framenet...]
username:[my user]
password:[my pass]

Importing /u/guajardo/framenet/collin/Adjusting.calibrate.v.v.processed...
Processing on server...Exception in thread "main" fn2.farina.exception.ImportException: Import Exception: javax.ejb.TransactionRolledbackLocalException: Unexpected Error
java.lang.NoClassDefFoundError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
...

Do you know what will be the reason for this exception? I went over the contents of MySQL DB for the German FN and noted that the following relevant tables are empty:
  • Corpus
  • Document
  • SubCorpus
  • Genre
  • Paragraph
  • Sentence
  • Annotation Set
thus, do you have sample initial values for these tables? I tried to infer what some of the initial records for this tables might be but my attempt was pretty much of trial and error. For instance, I added a new record for Corpus, Document and Genre tables respectively and still the aforementioned Java exception was thrown. Also, do we need to set a particualr status in FNDesktop for Adjusting.calibrate.v? I am thinking that a given status might need to be set in order for FarinaImport.sh to work on that given LU.

Wednesday, May 03, 2006

Subcorporation Pipeline


The attached diagram pictures Collin's response in a graphical form. I also included the pipeline that we follow to pre-process our German corpus so as to be able to imported into CQP, and though this pipeline is not yet totally implemented, I know how to implement this portion of the diagram. For the rest of the diagram, however, I still have some questions about how the different pieces look together. This is just a first draft and as things progress I will incorporate more detailed information into this diagram.

Questions

The following questions correspond to specific parts of the diagram:

A:

Within FNDesktop, particularly within the Subcorpus Rule-Definition GUI, the following error appeared when trying to save a sample rule: “You must select a corpus before you can save!” I noticed that pull-down menu for the Corpus field does not show any corpus value. How can we add corpus to FNDesktop so as to be able to save the rules?

Also, in the file, conf/fnclient.properties, where does the following variable point to? rule_path=/n/jolt/da/aicorpus/fncorp/FErec Is it related to the aforementioned error?

Is it also related that within FNDesktop, when enabling "Main/Tree Mode/Corpus Mode," all the frames in the left column disappear and the FNDesktop lists no frames at all. Why is this if other Tree Modes (i.e., Corpus, Semantic Type, Inheritance and Using) list all of the frames?

B:
How do we call this script and what command-line argument shall we provide for it?

C:
Does the shell-script in B, will call the CQP engine? Is there any special directory where the CQP engine must reside and/or any other special configuration for CQP? Also note that we already know how to perform the steps in the block Pre-processing German Corpus.

D:
What format shall the CQP output have? For example, we are able to produce KWIC format from CQP:

260: überzeugenden Darlegungen Chefs des Europäischen W
262: den Darlegungen des Chefs Europäischen Währungsins
373: Partei ` Für Lettland ' ' deutschen Rechtsradikale
510: Partei ` Für Lettland ' ' deutschen Rechtsradikale
530: - Die Zahl der Todesopfer Erdbebens in der westtür
584: ürden in Zelten im Garten Krankenhauses behandelt
952: trierte sich nach Angaben bosnischen Rundfunks auf
968: und Sanski Most im Westen Landes . In allen übrige


E:
How does this output look like? We are able to chunk our sentences as follows:
<s>
<PC>
Im APPRART
Innern NN
<NC>
dieser PDAT
Insel NN
</NC>
</PC>
<NC>
der ART
wenigen PIS
Seligen NN
</NC>
- $(
<NC>
ihre PPOSAT
Familien NN
</NC>
<VC>
hätten VAFIN
</VC>
<NC>
die ART
Kongreßmitglieder NN
</NC>
nicht PTKNEG
<VC>
mitbringen VVINF
dürfen VMINF
</VC>
- $(
<VC>
war VAFIN
</VC>
<NC>
Platz NN
</NC>
<PC>
für APPR
800 CARD
Menschen NN
. $.
</s>

F:
This part is “cloudy” as it is not very clear how the pipeline will flow until being able to import the subcorpora into the FrameNet DB?

G:
Does this refer to the FE Classifier mentioned the Farina Book, section 6.8? How (and who) invokes this Java Class? Where is the output of this classifier sent to? To FarinaProcessRules.sh, FarinaImport.sh, or somewhere else?

H:
How is this script called (e.g., arguments and other required input) and what part of the pipeline does it go into?

Monday, May 01, 2006

Problems with subcorpora creation and import into FN Desktop

Met with Elias and Mario today to discuss further progress. Right now, we are having two major issues that we are trying to resolve:

(1) integrating the FN Desktop with the CQP and other parts of "the pipeline" so that we can create subcopora for import;

(2) importing the subcorpora into the FN Desktop so that we can start annotating.

Both points sound like they should be straightforward, but it turns out that it is much harder than we thought initially. Mario will be getting in touch with Collin to sort out these issues.