How to strip all tags from wikipedia pages or make page more readable
Question
I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
Please I am aware of the Wikipedia printable version, but I don't need any tags in that, as I have some other use. So please answer the original question only, about any website or webservice or code snippets in php/C# to remove the tags from a webpages.
Also like when I copy some list from firefox it replaces <li>
with the *, is it possible to set something in firefox to return some other non readable character like some kind of
Solution
You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.
OTHER TIPS
You can start by taking a look at the strip_tags function.
I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
You should take a look at DBpedia, Wikipedia, but just the data.
What about htmlagilitypack
Similar thread available in stackoverflow
Try this function.
Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()