I'm trying to set up a translation tool to translate websites. What I want to do is import html-code and get all translatable texts from that site.

One idea would be to use strip_tags, but it would ignore strings that could be translated such as alt-texts, title-texts and probably others that I don't have on my mind yet. Is there a clean way to do this?

有帮助吗?

解决方案

In this case you need to parse HTML and extract text yourself. As you, probably, already know, parsing HTML with regular expressions is A Bad Idea (tm). SO, the only right solution is to parse DOM of the document. On this step you are free to use any tools including standard DOMDocument class.

If you are looking for some libraries or scripts to help, i would suggest to look on html2text which could be used commercially. As i see, it doesn't support attributes for <img> tags, but it's very easy to fix (use <a> tag as example).

If you are looking for some automated text extraction, then you should definitely look on something like Bolierpipe.

其他提示

I would personally use the DOM Crowler component from Symfony2, which is a nice wrapper around php DOM functions and start from there.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top