문제

I have a corpus(held in a JSerial Datastore) of thousands of documents with annotations. Now I need to divide it into 3 smaller ones, with random picking. What is the easiest way in GATE?

a piece of running code or detailed guide will be most welcomed!

도움이 되었습니까?

해결책

I would use the Groovy console for this (load the "Groovy" plugin, then start the console from the Tools menu).

The following code assumes that

  • you have opened the datastore in GATE developer
  • you have loaded the source corpus, and its name is "fullCorpus"
  • you have created three (or however many you need) other empty corpora and saved them (empty) to the same datastore. These will receive the partitions
  • you have no other corpora open in GATE developer apart from these four
  • you have no documents open

Then you can run the following in the Groovy console:

def rnd = new Random()

def fullCorpus = corpora.find { it.name == 'fullCorpus' }
def parts = corpora.findAll {it.name != 'fullCorpus' }

fullCorpus.each { doc ->
  def targetCorpus = parts[rnd.nextInt(parts.size())]
  targetCorpus.add(doc)
  targetCorpus.unloadDocument(doc)
}

return null

The way this works is to iterate over the documents and pick a corpus at random for each document to be added to. The target sub-corpora should end up roughly (but not necessarily exactly) the same size.

The script does not save the final sub-corpora, so if it messes up you can just close them and then re-open them (empty) from the original datastore, the fix and re-run the script. Once you're happy with the final result, right click on each sub-corpus in turn in the left hand tree and "save to its datastore" to write it all to disk.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top