Of late, we have made considerable progress in identifying off-the-shelf tools for Sentence Boundary Disambiguators (SBD) and Chunking. Besides, we have found answers or ways to get around some of the issues enlisted in our earlier posts.
For SBD, we could either look at Satz or Uplug. Neither of them are usable directly, since we need to generate a training script and a cross-validation script with respect to our corpus.
For Chunking, we could use the German Chunker from the University of Stuttgart.
Currently both these 2 parts of our pipeline are facing issues and have enlisted the same here:
Satz:
1) While creating the cross-validation and training scripts, we do not yet know how to consider "embedded sentences". For example:
<s>Zum Exil der Schrifstellerin Taslima Nasreen, die in ihrer Heimat Bangladesch vom Tode bedroht ist und am Mittwoch nach Schweden ausreiste, schreibt die Wirtschaftszeitung "Les Echos":
"<s>Ministerpräsidentin Khaleda Zia hat sich sicher für das geringere Übel entschieden, als sie die Ausreise von Taslima Nasreen erlaubte.</s>
....
<s>Die Fundamentalisten, die vor weniger als zwei Wochen rund 200.000 Demonstranten auf die Straße brachten, werden ihr jetzt keine Ruhe mehr lassen.</s>"Here, we face multiple issues.
(i)Are we to consider the whole paragraph above as one single sentence [since the ":" is not really a sentence terminator], even if within the double-quotes we have multiple sentences ,
(ii)Are we to consider the sub-portion within the double-quotes as one sentence [since without the start-and-end quotes, the sentence is not grammatically correct] - in which case the format should be
<s>" Ministerpräsidentin... lassen."<s>
For now, we have followed the following approach :
<s>Zum Exil der Schrifstellerin Taslima Nasreen, die in ihrer Heimat Bangladesch vom Tode bedroht ist und am Mittwoch nach Schweden ausreiste, schreibt die Wirtschaftszeitung "Les Echos":
"<s>Ministerpräsidentin Khaleda Zia hat sich sicher für das geringere Übel entschieden, als sie die Ausreise von Taslima Nasreen erlaubte.</s> ...
<s>Die Fundamentalisten, die vor weniger als zwei Wochen rund 200.000 Demonstranten auf die Straße brachten, werden ihr jetzt keine Ruhe mehr lassen.</s>"</s> (iii)With regard to "paragraphs" such as
<p>
(folgt drei)mf/rom
</p>
<p>
AFP
</p>
we have decided to ignore them as parts of meaningful sentences.
Similarly in the case of columnar/tabular data, we do not consider them to be sentences.
(iv)The training and cross-validation scripts are ready. However, there is an issue in the execution of Satz. It appears that it needs additional executables that are not shipped with the same package. However the source code does not indicate usage of external libraries, which I noticed today.
(v)Besides, the only reason we need to have a SBD is to compute structural information in [B]. Jisup, from Berkeley, recently clarfied that these <s> tags that are added by the SBD are removed during the CQP processing and the 1-sentence per line input does not have them. The <s> tags are added again prior the processing in send_to_schmid.pl where the <s> tags are appended with attributed such as aPos,docInfo,etc.
For example
<s aPos="1351844" corpus="BNCP" docInfo="default_document" textNo="41"paraNo="178" sentNo="2">
This leads us to the "sentence-count" script or [B] which is the only part that has been completed successfully. Initially the misunderstanding was that we needed to compute the sentence attributes like aPos,docInfo,etc here. But this was not possible since the aPos is dependent on the target word. Thanks to Jisup, we sorted this issue out. The script has been designed such that future additions of new corpuses would need to be new directories in the parent directory of AFG.
1094991: bncp=33=246=15 These distinctive units were finally {withdrawn} in 1984.
Here 1094991 represents the absolute position of the target word "withdrawn", bncp=33=246=15 implies that this sentence occurs in BNCP corpus, Text# 33, Para# 246, Sentence# 15. This "bncp=33=246=15" is the one that gets added in [B], and the 1094991 is prefixed by the CQP engine.
Going to the next module that we have worked on: German Chunker
(i) We do not have a comprehensive list of chunker tags that would cover all the cases of noun/adj/adv/verb/preposition chunks.
(ii) We are able to obtain recursive chunks in the chunk-file, but their mapping with the parts-of-speech is not taking place.
i.e In the chunk-file, contents are like :
chunk pos="NP.Nom">
<chunk pos="NP.Nom">
Platz
</chunk>
für
<chunk pos="NP.Akk">
800
Menschen
</chunk>
</chunk>
while in the output file, it is like:
<s> [ PPARTADJ.Dat Im Innern ] [ NP.Gen dieser Insel der wenigen Seligen ] - [ NC.Dir ihre Familien ] [ VVFIN hätten ] [ NP.Akk die Kongreßmitglieder ] nicht [ VVINF mitbringen ] [ VMINF dürfen ] - [ VSFIN war ] [ NP.Nom Platz ] für [ NP.Akk 800 Menschen ] . </s>
while we want the output file to be like [My lack of German knowlegde prevents me from specifying the correct output in German, however in the English world, it would be as below]:
<s aPos="1351633" corpus="ELNC" docInfo="default_document" textNo="41"
paraNo="177" sentNo="1">
[dt the The]
[rx
[next next next]]
[vx
h=[bez be is]]
[to to to]
[vv calibrate <target>calibrate</target>]
[nmess
h=[nx
[dt the the]
h=[nn gain gain]]]
[cma , ,]
[vvg use using]
[nmess
h=[nx
[dt-a a a]
h=[nn pair pair]]
[pp-of
f=[of of of]
h=[nx
[name
[nnp <unknown> Helmholtz]]
h=[nns coil coils]]]]
[sent . .]
</s>
Discussions with Sabine have reached a dead-end since Sabine's proposed solutions to my questions had already been tried out by me.