Pergunta

I just want to follow up this question.

So, I downloaded the Wikipedia dump of February 2014 and run the command with WikiExtractor.py as suggested:

cat mywiki-pages-articles.xml | python WikiExtractor.py -b 500K -o extracted

However, after more than one running hour, I got nothing but an empty file named wiki_00.

Do you have any suggestion for this problem?

Foi útil?

Solução

OK, So I found the solution for this problem.

Last time when I run the command above I added the "screen" instruction before it. In this case, screen will just cat the xml file without tuning it to the WikiExtractor.py. The result is therefore an empty file.

I fixed this by putting the above command in a file, making the file runnable and run the screen command on it.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top