German FrameNet

Friday, April 21, 2006

Procedure to import subcorpora into FrameNet

Collin explained to Elias the overall information flow that is required in order to import subcorpora into FrameNet. Because we are doing a lexicographic project, we will need to create a subcorpus for each lexical unit (LU) associate with a given frame. In order to be able to import subcorpora into FrameNet, we need to create an XML file containing the given subcorpus, the format of which is defined by FrameNet. Collin sent us a sample XML file in order to see the exact format that FrameNet expects.

As a next step, we plan to import our German corpus into the CQP engine. I know how to do that already, we just haven't had a chance to yet. The way I import our German corpus into CQP is as follows:
  1. The orignal German corpus is transformed from SGML to XML format.
  2. The paragraph portions of the XML corpus files are combined into a single plain-text file.
  3. The plain-text file, containing German sentences, is tagged using Tree tagger.
  4. The tagged output is imported into CQP using the CQP import and compile tools.
There seem to be two different ways of exracting subcorpora using the CQP as a query engine.

On one hand, accoriding to the article "FrameNet in Action: The Case of Attaching" there seems to exist a GUI (called Subcorpus Query Definition page) within FrameNet Desktop that allows the user to define CQP queries in order to produce subcorpora; though we have not actually tried it out, it is my understanding that this GUI is able to translate its input parameters into an actual CQP query that will obtain the desired subcorpus.

On the other hand, Elias understood from Collin that there is a process called "farina-import" that froms a pipeline from the larger German corpus using CQP, a named-entity recognizer, the IMS tree tagger, and Steve Abney's chunk parser, to form the desired subcorpora. These subcorpora can then be imported into the server using a feature called import-xml. Apparently farina-import comprises the Berkeley technique for doing this import process, other systems (Spanish FrameNet, Japanese FrameNet) have used other techniques.

The one point to the farina-import system that Elias is not clear on is the creation of chunk rules. Collin said he'd share with us some of the source for the farina-import pipeline system and examples of chunk-rule creation.

So, at the point where we're at, we have five questions:
  1. Is there relationship between the farina-import process and the SQD page GUI process? We believe Hans is more familiar with the latter process.
  2. How are CQP queries formed by either process, since naturally we want to get the right queries generated to make our subcorpora
  3. Similarly, we need to know how chunk rules are created and applied, basically how that step works with the process. We may have a conference call on that matter.
  4. What tool or CQP parameter is used to tranform the subcorpora from the KWIC format that CQP outputs to the XML format that import-xml seems to require?
  5. Finally, how does import-xml work, getting the fully specified subcorpora XML into the FN system.

1 Comments:

Post a Comment

<< Home