German FrameNet

Sunday, January 21, 2007

Pipeline Version 2


This diagram shows a preliminary version of the new conversion pipeline. Compared to the previous version, we do not use Abney's parser as it was not available for German; instead we are using several tools from the Heart of Gold NLP suite developed at The German Research Center for Artificial Intelligence.

The new tools used include:
  • Jtok - Tokenizer
  • Chunkie - Chunker
  • xsltproc - XSL Transformer
  • HOG engine - Hosts Jtok & Chunkie processes' RPC
As shown in the diagram, there has been significant changes to the pipeline.

Saturday, January 13, 2007

How-to's for Version 2

Converting Entire corpus from SGML to XML
APWS Corpus
XML Location: HOME/current/corpus/xml

How to install SGML conversion engine:
  1. Download OpenSP from http://sourceforge.net/projects/openjade to a local directory, say /tmp
  2. Decompress and install it with the following commands:
    • # tar zxvf OpenSP-1.5.2.tar.gz
    • # OpenSP-1.5.2
    • # ./configure --prefix /home/framenet/current/corpus/lib/opensp --disable-doc-build
    • # make
    • # make install
  3. Add /framenet/opensp/bin to PATH
How to convert SGML corpus into XML:
# cd /home/framenet/current/corpus/cfg
#perl sgml2xml.pl ../raw/apws_ger ../xml/apws_ger

How to trim XML corpus:
# cd /home/framenet/current/corpus/cfg
# perl trim.pl ../xml/apws_ger ../xml/apws_ger_trimmed

How to convert XML corpus into plain-text:
# cd /home/framenet/current/corpus/cfg
# perl ./xml2txt.pl ../xml/apws_ger_trimmed ../txt/apws_ger ./extract-text.xsl ./remove-short-sent.pl


Install TreeTagger
# cd /home/framenet/current/corpus/tagger
Copy the following files from http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
  • tagger-scripts.tar.gz
  • tree-tagger-linux-3.1.tar.gz
  • german-chunker-par-linux-3.1.bin.gz
  • german-par-linux-3.1.bin.gz
  • install-tagger.sh
# chmod +x install-tagger.sh
# ./install-tagger.sh

You may have to modify the file cmd/filter-chunker-output.perl and include the correct path to perl (obtained executing the command "which perl").