Pergunta

I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.

One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?

Foi útil?

Solução

In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.

If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).

If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.

Outras dicas

I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top