Question

I have an html file having tons of relative href links like;

href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014/a>br/>

There are tons of other http and ftp links in the file,
I need an output txt file;

14/02/08: station1_140208.txt  
14/02/09: station1_140209.txt  
14/02/10: station1_140210.txt  
14/02/11: station1_140211.txt  
14/02/12: station1_140212.txt  

I tried to write my own but it takes too long for me to get used to Python regex.
I can open the source file, apply a specific regex which I couldn't figure out yet, and write it back to the disk.

I need your help on the regex side.

Was it helpful?

Solution 2

pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'

test:

import re
s = """
<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>
br/>
<a href="data/self/dated/station1_1402010.txt">Saturday, February 10, 2014</a>
br/>
<a href="data/self/dated/station1_1402012.txt">Saturday, February 12, 2014</a>
br/>
"""
pattern = 'href="data/self/dated/([^"]*)"[^>]*>([\s\S]*?)</a>'
re.findall(pattern,s)

output:

[('station1_140208.txt', 'Saturday, February 08, 2014'), ('station1_1402010.txt', 'Saturday, February 10, 2014'), ('station1_1402012.txt', 'Saturday, February 12, 2014')]

OTHER TIPS

I know it's not exactly what you asked for, but I thought I would show a way of converting the dates from your link text into the format you show in your example of desired output(dd/mm/yy). I used BeautifulSoup to read elements from the html.

from bs4 import BeautifulSoup
import datetime as dt
import re

html = '<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a><br/>'

p = re.compile(r'.*/station1_\d+\.txt')   

soup = BeautifulSoup(html)

a_tags = soup.find_all('a', {"href": p})

>>> print a_tags # would be a list of all a tags in the html with relevant href attribute
[<a href="data/self/dated/station1_140208.txt">Saturday, February 08, 2014</a>]

names = [str(a.get('href')).split('/')[-1] for a in a_tags] #str because they will be in unicode

dates = [dt.datetime.strptime(str(a.text), '%A, %B %m, %Y') for a in a_tags]

names and dates use list comprehensions

strptime creates datetime objects out of the date strings

>>> print names # would be a list of all file names from hrefs
['station1_140208.txt']

>>> print dates # would be a list of all dates as datetime objects
[datetime.datetime(2014, 8, 1, 0, 0)]

toFileData = ["{0}: {1}".format(dt.datetime.strftime(d, '%w/%m/%y'), n) for d in dates for n in names]

strftime reformats the date into the format in your example:

>>> print toFileData
['5/08/14: station1_140208.txt']

then write the entries in toFileData to a file

For info on the methods I used such as soup.find_all() and a.get() in the code above, I recommend you look at the BeautifulSoup docs via the link at the top. Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top