Text Extraction from HTML Data

Question 1

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

<span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p>

print html.parse(url).xpath('//p/text()')

OUTPUT

['Here is the First Paragraph.', 'Here is the second Paragraph.',
'Paragraph Three."']

Question 2

One way using BeautifulSoup module to extract all text from <p> tags.

Content of script.py:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

print(' '.join(map(lambda e: e.string, soup.find_all('p'))))

Run it like:

python3 script.py infile

That yields:

I have an app which send mail to my defined mail address "myemail@own.com". For this i create my own Custom Email View Which contains check boxes message body and other options. Now i want that when send button is pressed my app should not go to gmail view or other email client view it directly submit the data String recepientEmail = "myemail@own.comm";  // either set to destination email or leave empty but on submit it opens gmail or chooser email client view but i dont want to show gmail view

Question 3

I recently started playing around with Beautiful Soup. I found this line of code that was extremely helpful. I will throw in my entire example in to show you.

import requests
from bs4 import BeautifulSoup

r = requests.get("your url")

html_text = r.text

soup = BeautifulSoup(html_text)

clean_html = ''.join(soup.findAll(text=True))

print(clean_html)

Hopefully this works for you/answers your question