German FrameNet

Friday, May 18, 2007

File structure & To-do

The file structure of the German Framenet is currently organized as follows:

current/remote_client/heartofgold/hog Contains the files of the Hear of Gold engine. This includes the chunker and tokenizer. It also includes some of the xsl transformation files such as toFrameNet.xsl
current/remote_client/german-client Includes client utils such as FnDesktop
current/german German version of the FN database
current/english English version of the FN database
current/fnSystem Complete FN database and JBoss that Jisup emailed to us
current/corpus corpus files including:

bin utils
cfg some scripts and config files such as header/footers & DTD files
cqp CQP engine version 3.0
doc some documentation drafts
lib includes OpenSP SGML to XML comverter
other tar files of the original corpora as they came on the CDs
raw uncompressed corpora (AFP, APWS, DPA)
tagger IMS tree-tagger
txt plain-text version of corpora (i.e., without tags)
xml XML version of the corpora

current/sandbox misc files

Major To-do items:

Incorporate the H.O.G. engine
Solve the FarinaImport.sh error
Create a few more scripts to plug-in different pipeline components

Sunday, January 21, 2007

Pipeline Version 2

This diagram shows a preliminary version of the new conversion pipeline. Compared to the previous version, we do not use Abney's parser as it was not available for German; instead we are using several tools from the Heart of Gold NLP suite developed at The German Research Center for Artificial Intelligence.

The new tools used include:

Jtok - Tokenizer
Chunkie - Chunker
xsltproc - XSL Transformer
HOG engine - Hosts Jtok & Chunkie processes' RPC

As shown in the diagram, there has been significant changes to the pipeline.

Saturday, January 13, 2007

How-to's for Version 2

Converting Entire corpus from SGML to XML
APWS Corpus
XML Location: HOME/current/corpus/xml

How to install SGML conversion engine:

Download OpenSP from http://sourceforge.net/projects/openjade to a local directory, say /tmp
Decompress and install it with the following commands:

# tar zxvf OpenSP-1.5.2.tar.gz
# OpenSP-1.5.2
# ./configure --prefix /home/framenet/current/corpus/lib/opensp --disable-doc-build
# make
# make install

Add /framenet/opensp/bin to PATH

How to convert SGML corpus into XML:
# cd /home/framenet/current/corpus/cfg
#perl sgml2xml.pl ../raw/apws_ger ../xml/apws_ger

How to trim XML corpus:
# cd /home/framenet/current/corpus/cfg
# perl trim.pl ../xml/apws_ger ../xml/apws_ger_trimmed

How to convert XML corpus into plain-text:
# cd /home/framenet/current/corpus/cfg
# perl ./xml2txt.pl ../xml/apws_ger_trimmed ../txt/apws_ger ./extract-text.xsl ./remove-short-sent.pl

Install TreeTagger
# cd /home/framenet/current/corpus/tagger
Copy the following files from http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

tagger-scripts.tar.gz
tree-tagger-linux-3.1.tar.gz
german-chunker-par-linux-3.1.bin.gz
german-par-linux-3.1.bin.gz
install-tagger.sh

# chmod +x install-tagger.sh
# ./install-tagger.sh

You may have to modify the file cmd/filter-chunker-output.perl and include the correct path to perl (obtained executing the command "which perl").

Thursday, June 15, 2006

New German Chunker

Hello Mario

Thanks for the inputs.. I figured out that there are some differences though. You mentioned here that /home/framenet/may06/sandbox/framenet/collin/Adjusting.calibrate.v.v.chunked as the input for D. However your pipeline flowchart shows it to be infact the output file from the Abney's Chunker.

Anyways, from the new German chunker that I am testing out, it appears that such a file is a mismatch for both the input and outputs.
The input format here is one-word-per-line format. Each sentence has to be preceded with an tag and an empty line, for example:
<s>
In
den
Großraumduschen
lag
die
Seife
schon
bereit
.
</s>

which is pretty much what we have at the end of the pre-processing stage.
So do we have to go through the IMS Tree Tagger and all in between?

Do let me know what you think.

Thanks
Sumeet

____________________________

Sumeet:

Let us consider the following example extracted from Complaining.lament.v.v.9:

<s aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1">
Prince nnp Prince
Philip person Philip
<target>lamented</target> vvd lament
that comp that
`` nil ``
lots nns lot
of of of
resources nns resource
are ber be
going vvg go
into in into
economic jj economic
development nn development
and cc and
very rb very
little jj little
into in into
conservation nn conservation
of of of
Nature organization Nature
. sent .
'' nil ''
</s>

You can use your NEW German tagger but I am thinking that its input (more precisely, its eventual output) will have to contain extra information such as:

a tagged target sentence word such as <target>lamented</target> (in the original pipeline, target word is given by the CQP output.)
[optionally] the named entities. For example, if you compare the output of the intermediate stages, you will notice that "Nature" was tagged as "organization" and it was tagged not by TreeTagger but by runIdentitiTagger.
and the information in the opening "s" tag, such as aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" which is eventually needed by FN in order to have some sequence number to "control" internal functions.

Thus, by observing the aforementioned example from v.9.9 one will notice that all of this information is present. If with your NEW tagger you are able to somehow incorporate all this information and, in addition, you are able to produce an output with the format that uses nested brackets then you will be able to call abney_to_done.pl and the rest of the pipeline.

Thanks,
Mario

Wednesday, June 07, 2006

Mario has left Austin to take on an internship in India for the summer. Sumeet (who comes from the same city where Mario is doing his internship (!), Bangalore (India)) is continuing work on the GFN setup where Mario left off: finishing all steps of the pipeline, so that we can start with sample annotations.

Monday, May 22, 2006

How-to's

Configure client:
client_home is the path to the directory containing the client files.

Open e the file client_home/bin/RunClass.sh
Replace the top variables with the IP/hostname of the FN server and the path to the java binary file (at the prompt, use the command which java to determine this value.)

Start client:
client_home is the path to the directory containing the client files. At the prompt, execute:
# cd client_home/bin
# ./FNDesktop.sh
Note that the server must be running in order for the client to work. Also, an active FN account is needed in order to use FNDesktop.sh

Start server:
There are two copies of the server files under /home/framenet/current, one for English and another for German FrameNet server files each one running in TCP port 1098 and 1099. Both copies of the server are identical except while some tables in the English version have sample records in the German couterpart the tables are empty.

To start either server, follow the instructions given in the file server_home/bin/README

Log into MySQL Database:
At the prompt:
# su
# su framenet
# mysql
[mysql]# show tables;
[mysql]# use gnframenet; //there is also an Eglish database

General Pipeline | Pending Parts

B.
In order to properly import a corpus into CQP, the corpus needs to have "p" XML tags surounding every paragraph and "s" tags surounding every sentence. However, the text contained in the German Corpus contains only "p" tags. Therefore, we have to look for a program that will perform boundary-sentence detection for us in order to add "s" tags. The program mxterminator is used as boundary-sentence detector for the English FN.

A.
Eventually, every sentence that will be imported into FN will need to have some ID assigneed to the sentence. However, these IDs are not part of the original corpus; rather, a (Perl) script was written for the English FN which takes an entire corpus and outpus the same corpus, but now with an ID prepended to each sentence. A sample ID along with its sentence looks as follows: apwsE941117.0373=1=7=1 A sample sentence. Where, after being transformed by intermediate scripts, the ID information will be written to the final XML file given to FarinaImport.sh as this: docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" Either we can ask Collin to send this script to us so that we can reuse it or we can write a similar script.

C.
A named-entity tagger is a program that takes a sentence as input and identifies (tags) the part of the sentences belonging to an entity. Entities can be proper names, names of cities and places, names of companies, countries, etc. Here the task is to search the Internet for either an open-source or a commercial named-entity tagger for German. (In an intial phase, this part of the pipeline may be skipped at the expense that annotators will have to manually detect entities during the annotation process.)

D.
Chunker
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament lamented]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...

and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.

Here, the task is to take the ouput of the IMS chunker for German and convert it to the format produced by Abney's chunker for English. Another approach could be to find a chunker for german that already produces its output in the same format as Abney's.

As a last resource, this stage of the pipeline may be skipped intially, but the burden to do so might suggest that is better to avoid skipping it. In case it is to be skipped, we would have to modify the existing Java classes of FrameNet, remove all references that invoke in ProcessRules.sh that filter the sentences (given some rules), and keep only the functions that add the (now unfiltered) sentences into the FN database. Again, this seems to complex that is not recommended.

E.
Assuming we have succesfully obtained an XML file containing subcorpora that is ready to be added to FN's database. For this task, we need to use FarinaImport.sh script.

However, by using a sample XML file we could not import the sample subcorpora into the German FN, FarinaImport.sh produced a Java exception the cause of which we could not find (see the previous blog entry for details.)

Both Collin and Marc Ortega (from the SpanishFN) helpmed me debug this problem but we had no success. Not to say that there is no solution, but because of time constrains we did not find the solution.

Given the follwing (simplistic) sample XML subcorpus:
<?xml version="1.0" encoding="UTF-8"?>
<subcorpora frame="Adjusting" lexunit="calibrate.v" lemma="calibrate" pos="V">
<annoset-conf classify-type="fn2.farina.classify.FNClassifierPenn">
<annoset-model type="POS">
<layer-model containsPOS="y">PENN</layer-model>
</annoset-model>
<annoset-model type="standard">
<layer-model>Target</layer-model>
<layer-model>FE</layer-model>
<layer-model>GF</layer-model>
<layer-model>PT</layer-model>
<layer-model>Other</layer-model>
<layer-model>Sent</layer-model>
</annoset-model>
</annoset-conf>
<subcorpus scName="02-T-NP-PPto" maxSize="20">
<s tStart="0" tEnd="11" aPos="7382282" corpus="BNC2" docInfo="bncp" textNo="372" paraNo="162" sentNo="9">

<text>.</text>
<words>
<w pos="SENT" wf="." start="0" end="0">.</w>
</words>
<labels>
</labels>
</s>
</subcorpus>
</subcorpora>

This is what we tried:

We noted that adding an empty subcorpus worked succesfully. That is, FarinaImport.sh was able to properly add a record to the SubCorpus table. This means that FarinaImport.sh succesfully communicates with the German FN database.

As soon as we included a corpus with at least one "w" tag, the given Java exception was thrown. This is the misterious part, the reason of which we would not figure out.

Though we rapidly verified the records in MiscLabel and LabelType tables, correspoding to the Penn stagset (yes, our sample file uses the Penn tagset and it is still pending adding the German STTS tagset into these tables.) At a first glance, it seemed that these tables had correct information. However, going over the records of this tables and corroborating that they have correct values for the tags involved in the example, will be an starting point to debug.

We manually added records to the tables Corpus and Document as we were unsure whether FarinaImport.sh will add initial records to this tables when these tables are empty. It did not make a difference as the Java exception was still thrown.

Marc sent me a new version of the Client and Server parts of the original English FN. I tried the client part and the Java error still appeared. I tried to configure the server part with the new release but I could not configure it properly. Elias might now how to do so.

I tried both adding the sample file to both our English and our German versions of FN and both threw the same Java exception.

From within FNDesktop, I tried assigning diffferent statuses for the given LU and it made no difference.

Other parts
Most of the scripts of the remaining part of the diagrams have not been tested and some of them may require some changes. It is important to note that it will be very advisable to review the script FN2Import.sh as it is the "mother" script that calls all the part of the middle column of the diagram.

Thanks a lot to Collin and Marc for all of their invaluable help and cooperation.

Wednesday, May 17, 2006

CQP, Chunker & FarinaImport

CQP
I have been working in producing a CQP query-result of German corpora with similar format to this sample for the English corpus (keyword=laments):
1517620: apwsE941117.0373=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office .
1528335: apwsE941117.0393=1=7=1 `` The most unpleasant thing is that we are attacked by those formerly high officials who insisted on us being a brainwashing center , '' {laments} Perfilov , sitting under the once-obligatory portrait of Lenin in his office

versus the current German output (keywork=des):
260: überzeugenden Darlegungen Chefs {des} Europäischen W
262: den Darlegungen {des} Chefs Europäischen Währungsins

However, there are a few things that I cannot output in the German CQP query result.

Sentence context: While the English sample outpus an entire sentence containing the matched keyword, the German counterpart can only output a fixed number of characters surounding the given keyword since it does not contain information about sentence boundaries. Because our original German corpus has only XML tags to delimit paragraphs and it does not have any tag to delimit sentences, consequently, when imported into CQP, the German corpus does not have an "s" (sentence) s-attribute defined. Thus, entering set context s; produces an error in CQP, which results in not being able to output full sentences.
A possible work around will be to write a script that will insert "s" XML tags somehow, delimiting each sentence so that CQP can have the sentence "s" attribute. Do you know if CQP is able to accept a context the boundary of which is a string (in this case we could use the period "." as end-of-sentence boundary?) I have already tried, for instance, set context "."; and received an error.

Additional information: I noted that the part of the CQP output composed by apwsE941117.0373=1=7=1 is eventually tranformed into aPos="2784485" corpus="AP" docInfo="apwsE941123.0183" textNo="1" paraNo="7" sentNo="1" However, the CQP output produced from our German corpus does not include this information. What does this information represent and is it necessary that we include it? If so, will taking a glance at the pipeline you use to import the English corpus into CQP will help us? (this will also help us see if our pre-CQP pipeline is not missing anything)

Chunker
Since for the English FN, Abney's chunker produces an ouput as follows:
[nmess lemma=
h=[nx lemma=
h=[person
[nnp lemma=Prince Prince]
[person lemma=Philip Philip]]]]
[vvd lemma=lament lamented]
[comp lemma=that that]
[nil lemma=`` ``]
[nmess lemma
...

and since our IMS chunker for German produces an output as follows:
<NC>
Eine ART
weitere ADJA
Schwierigkeit NN
</NC>
<VC>
besteht VVFIN
</VC>
darin PAV
, $,
daß KOUS
<NC>
die ART
Kameras NN
</NC>
nur ADV
dann ADV
<NC>
verwertbares ADJA
Bildmaterial NN
</NC>
<VC>
liefern VVFIN
</VC>
, $,
wenn KOUS
<NC>
die ART
See NN
</NC>
einigermaßen ADV
ruhig ADJD
<VC>
ist VAFIN
</VC>
. $.

Is it feasible to use Abney's chunker/parser to parse the German chunked data and produce a format with nested brackets similar to the English counterpart? Will it be better to modify the Java classes of FN in order to support the current format of the chuncked German text? Or, is another method more feasible?

FarinaImport.sh
Assuming that the pipeline is complete, I used the file Adjusting.calibrate.v.v.processed in order to test whether FarinaImport.sh works correctly with our German FN. However, after executing this script I obtain the following error:

~/framenet/client/german-client/bin> FarinaImport.sh ~/framenet/collin/Adjusting.calibrate.v.v.processed
[FNProperties] ./..
[FNProperties] loading from file ./../conf/fnclient.properties
[FNProperties] loading from file /u/guajardo/.fnclient.properties
log4j:WARN No appenders could be found for logger (fn2.farina.clients.FNProperties).
log4j:WARN Please initialize the log4j system properly.
[FNProperties] Using server [framenet...]
username:[my user]
password:[my pass]

Importing /u/guajardo/framenet/collin/Adjusting.calibrate.v.v.processed...
Processing on server...Exception in thread "main" fn2.farina.exception.ImportException: Import Exception: javax.ejb.TransactionRolledbackLocalException: Unexpected Error
java.lang.NoClassDefFoundError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
...

Do you know what will be the reason for this exception? I went over the contents of MySQL DB for the German FN and noted that the following relevant tables are empty:

Corpus

Document

SubCorpus

Genre

Paragraph

Sentence

Annotation Set

thus, do you have sample initial values for these tables? I tried to infer what some of the initial records for this tables might be but my attempt was pretty much of trial and error. For instance, I added a new record for Corpus, Document and Genre tables respectively and still the aforementioned Java exception was thrown. Also, do we need to set a particualr status in FNDesktop for Adjusting.calibrate.v? I am thinking that a given status might need to be set in order for FarinaImport.sh to work on that given LU.

Wednesday, May 03, 2006

Subcorporation Pipeline

The attached diagram pictures Collin's response in a graphical form. I also included the pipeline that we follow to pre-process our German corpus so as to be able to imported into CQP, and though this pipeline is not yet totally implemented, I know how to implement this portion of the diagram. For the rest of the diagram, however, I still have some questions about how the different pieces look together. This is just a first draft and as things progress I will incorporate more detailed information into this diagram.

Questions
The following questions correspond to specific parts of the diagram:

A:
Within FNDesktop, particularly within the Subcorpus Rule-Definition GUI, the following error appeared when trying to save a sample rule: “You must select a corpus before you can save!” I noticed that pull-down menu for the Corpus field does not show any corpus value. How can we add corpus to FNDesktop so as to be able to save the rules?

Also, in the file, conf/fnclient.properties, where does the following variable point to? rule_path=/n/jolt/da/aicorpus/fncorp/FErec Is it related to the aforementioned error?

Is it also related that within FNDesktop, when enabling "Main/Tree Mode/Corpus Mode," all the frames in the left column disappear and the FNDesktop lists no frames at all. Why is this if other Tree Modes (i.e., Corpus, Semantic Type, Inheritance and Using) list all of the frames?

B:

How do we call this script and what command-line argument shall we provide for it?

C:

Does the shell-script in B, will call the CQP engine? Is there any special directory where the CQP engine must reside and/or any other special configuration for CQP? Also note that we already know how to perform the steps in the block Pre-processing German Corpus.

D:

What format shall the CQP output have? For example, we are able to produce KWIC format from CQP:

260: überzeugenden Darlegungen Chefs des Europäischen W
262: den Darlegungen des Chefs Europäischen Währungsins
373: Partei ` Für Lettland ' ' deutschen Rechtsradikale
510: Partei ` Für Lettland ' ' deutschen Rechtsradikale
530: - Die Zahl der Todesopfer Erdbebens in der westtür
584: ürden in Zelten im Garten Krankenhauses behandelt
952: trierte sich nach Angaben bosnischen Rundfunks auf
968: und Sanski Most im Westen Landes . In allen übrige

E:

How does this output look like? We are able to chunk our sentences as follows:

<s>
<PC>
Im    APPRART
Innern    NN
<NC>
dieser    PDAT
Insel    NN
</NC>
</PC>
<NC>
der    ART
wenigen    PIS
Seligen    NN
</NC>
-    $(
<NC>
ihre    PPOSAT
Familien    NN
</NC>
<VC>
hätten    VAFIN
</VC>
<NC>
die    ART
Kongreßmitglieder    NN
</NC>
nicht    PTKNEG
<VC>
mitbringen    VVINF
dürfen    VMINF
</VC>
-    $(
<VC>
war    VAFIN
</VC>
<NC>
Platz    NN
</NC>
<PC>
für    APPR
800    CARD
Menschen    NN
.    $.
</s>

This part is “cloudy” as it is not very clear how the pipeline will flow until being able to import the subcorpora into the FrameNet DB?

G:

Does this refer to the FE Classifier mentioned the Farina Book, section 6.8? How (and who) invokes this Java Class? Where is the output of this classifier sent to? To FarinaProcessRules.sh, FarinaImport.sh, or somewhere else?

H:

How is this script called (e.g., arguments and other required input) and what part of the pipeline does it go into?

Monday, May 01, 2006

Problems with subcorpora creation and import into FN Desktop

Met with Elias and Mario today to discuss further progress. Right now, we are having two major issues that we are trying to resolve:

(1) integrating the FN Desktop with the CQP and other parts of "the pipeline" so that we can create subcopora for import;

(2) importing the subcorpora into the FN Desktop so that we can start annotating.

Both points sound like they should be straightforward, but it turns out that it is much harder than we thought initially. Mario will be getting in touch with Collin to sort out these issues.

Friday, April 21, 2006

Procedure to import subcorpora into FrameNet

Collin explained to Elias the overall information flow that is required in order to import subcorpora into FrameNet. Because we are doing a lexicographic project, we will need to create a subcorpus for each lexical unit (LU) associate with a given frame. In order to be able to import subcorpora into FrameNet, we need to create an XML file containing the given subcorpus, the format of which is defined by FrameNet. Collin sent us a sample XML file in order to see the exact format that FrameNet expects.

As a next step, we plan to import our German corpus into the CQP engine. I know how to do that already, we just haven't had a chance to yet. The way I import our German corpus into CQP is as follows:

The orignal German corpus is transformed from SGML to XML format.
The paragraph portions of the XML corpus files are combined into a single plain-text file.
The plain-text file, containing German sentences, is tagged using Tree tagger.
The tagged output is imported into CQP using the CQP import and compile tools.

There seem to be two different ways of exracting subcorpora using the CQP as a query engine.

On one hand, accoriding to the article "FrameNet in Action: The Case of Attaching" there seems to exist a GUI (called Subcorpus Query Definition page) within FrameNet Desktop that allows the user to define CQP queries in order to produce subcorpora; though we have not actually tried it out, it is my understanding that this GUI is able to translate its input parameters into an actual CQP query that will obtain the desired subcorpus.

On the other hand, Elias understood from Collin that there is a process called "farina-import" that froms a pipeline from the larger German corpus using CQP, a named-entity recognizer, the IMS tree tagger, and Steve Abney's chunk parser, to form the desired subcorpora. These subcorpora can then be imported into the server using a feature called import-xml. Apparently farina-import comprises the Berkeley technique for doing this import process, other systems (Spanish FrameNet, Japanese FrameNet) have used other techniques.

The one point to the farina-import system that Elias is not clear on is the creation of chunk rules. Collin said he'd share with us some of the source for the farina-import pipeline system and examples of chunk-rule creation.

So, at the point where we're at, we have five questions:

Is there relationship between the farina-import process and the SQD page GUI process? We believe Hans is more familiar with the latter process.
How are CQP queries formed by either process, since naturally we want to get the right queries generated to make our subcorpora
Similarly, we need to know how chunk rules are created and applied, basically how that step works with the process. We may have a conference call on that matter.
What tool or CQP parameter is used to tranform the subcorpora from the KWIC format that CQP outputs to the XML format that import-xml seems to require?
Finally, how does import-xml work, getting the fully specified subcorpora XML into the FN system.

Wednesday, April 19, 2006

Creating a German FrameNet

FrameNet is one of the most amazing lexical resources for English (http://framenet.icsi.berkeley.edu). In order to spread knowledge on how to set up FrameNets for other languages we're collecting information on how the set-up of German FrameNet at UT Austin is taking place (see http://gframenet.gmc.utexas.edu). This will hopefully help others with setting up FrameNets for other languages. In addition, we're blogging other FrameNet-related information.

The idea to build a German FrameNet grew out of my stay with the Berkeley FrameNet group from 1999-2001. Since then, I've thought about different ways of creating FrameNets for other languages (see, for example: Hans C. Boas. 2005. Semantic Frames as Interlingual Representations for Multilingual Lexical Databases. In: International Journal of Lexicography 18.4, 445-478).