Wikipedia Extractor produces empty file

https://stackoverflow.com/questions/23452118

wikipedia

15-07-2023
|

Pergunta

I just want to follow up this question.

So, I downloaded the Wikipedia dump of February 2014 and run the command with WikiExtractor.py as suggested:

cat mywiki-pages-articles.xml | python WikiExtractor.py -b 500K -o extracted

However, after more than one running hour, I got nothing but an empty file named wiki_00.

Do you have any suggestion for this problem?

Solução

OK, So I found the solution for this problem.

Last time when I run the command above I added the "screen" instruction before it. In this case, screen will just cat the xml file without tuning it to the WikiExtractor.py. The result is therefore an empty file.

I fixed this by putting the above command in a file, making the file runnable and run the screen command on it.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow