Frage

I'm working on my applescript right now and I'm stuck here.. Lets take this snippet as an example of html code

<body><div>Apple don't behave accordingly <a href = "http://apple.com>apple</a></div></body>

What I need now is to return the word without the html tags. Either by deleting the bracket with everything in it or maybe there is any other way to reformat html into plain text..

The result should be:

Apple don't behave accordingly apple

War es hilfreich?

Lösung

How about using textutil?

on run -- example (don't forget to escape quotes)
    removeMarkup from "<body><div>Apple don't behave accordingly <a href = \"http://apple.com\">apple</a></div></body>"
end run

to removeMarkup from someText -- strip HTML using textutil
    set someText to quoted form of ("<!DOCTYPE HTML PUBLIC>" & someText) -- fake a HTML document header
    return (do shell script "echo " & someText & " | /usr/bin/textutil -stdin -convert txt -stdout") -- strip HTML
end removeMarkup

Andere Tipps

Thought I would add an extra answer because of the problem I had. If you want UTF-8 characters to not get lost you need:

set plain_text to do shell script "echo " & quoted form of ("<!DOCTYPE HTML PUBLIC><meta charset=\"UTF-8\">" & html_string) & space & "| textutil  -convert txt  -stdin -stdout"

You basically need to add the <meta charset=\"UTF-8\"> meta tag to make sure textutil sees this as an utf-8 document.

on findStrings(these_strings, search_string)
    set the foundList to {}
    repeat with this_string in these_strings
        considering case
            if the search_string contains this_string then set the end of the foundList to this_string
        end considering
    end repeat
    return the foundList
end findStrings

findStrings({"List","Of","Strings","To","find..."}, "...in String to search")
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top