Converting Entire corpus from SGML to XMLAPWS CorpusXML Location: HOME/current/corpus/xml
How to install SGML conversion engine:
- Download OpenSP from http://sourceforge.net/projects/openjade to a local directory, say /tmp
- Decompress and install it with the following commands:
- # tar zxvf OpenSP-1.5.2.tar.gz
- # OpenSP-1.5.2
- # ./configure --prefix /home/framenet/current/corpus/lib/opensp --disable-doc-build
- # make
- # make install
- Add /framenet/opensp/bin to PATH
How to convert SGML corpus into XML:
# cd /home/framenet/current/corpus/cfg
#perl sgml2xml.pl ../raw/apws_ger ../xml/apws_ger
How to trim XML corpus:
# cd /home/framenet/current/corpus/cfg
# perl trim.pl ../xml/apws_ger ../xml/apws_ger_trimmed
How to convert XML corpus into plain-text:
# cd /home/framenet/current/corpus/cfg
# perl ./xml2txt.pl ../xml/apws_ger_trimmed ../txt/apws_ger ./extract-text.xsl ./remove-short-sent.pl
Install TreeTagger# cd /home/framenet/current/corpus/tagger
Copy the following files from http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
- tagger-scripts.tar.gz
- tree-tagger-linux-3.1.tar.gz
- german-chunker-par-linux-3.1.bin.gz
- german-par-linux-3.1.bin.gz
- install-tagger.sh
# chmod +x install-tagger.sh
# ./install-tagger.sh
You may have to modify the file cmd/filter-chunker-output.perl and include the correct path to perl (obtained executing the command "which perl").