parsing a .srt file with regex

Question 1

Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:

an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line

... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.

So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.

from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

For example, using the example on the SRT doc page, I get:

res
Out[60]: 
[['1\n',
  '00:02:17,440 --> 00:02:20,375\n',
  "Senator, we're making\n",
  'our final approach into Coruscant.\n'],
 ['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]

And I could further transform that into a list of meaningful objects:

from collections import namedtuple

Subtitle = namedtuple('Subtitle', 'number start end content')

subs = []

for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, *content = sub # py3 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

subs
Out[65]: 
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
 Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]

Question 2

Disagree with @roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.

import re   

f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result

Question 3

number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*

hope this help.

Question 4

Thanks @roippi for this excellent parser. It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)

from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple

# prepare  - adapt to you needs or use sys.argv
inputname = 'FR.srt'  
outputname = 'FR.stl'
stlheader = """
$FontName           = Arial
$FontSize           = 34
$HorzAlign          = Center
$VertAlign          = Bottom

"""
def converttime(sttime):
    "convert from srt time format (0...999) to stl one (0...25)"
    st = sttime.split(',')
    return "%s:%02d"%(st[0], round(25*float(st[1])  /1000))

# load
with open(inputname,'r') as f:
    res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]

# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
    if len(sub) >= 3: # not strictly necessary, but better safe than sorry
        sub = [x.strip() for x in sub]
        number, start_end, content = sub[0], sub[1], sub[2:]   # py 2 syntax
        start, end = start_end.split(' --> ')
        subs.append(Subtitle(number, start, end, content))

# write
with open(outputname,'w') as F:
    F.write(stlheader)
    for sub in subs:
        F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )

Question 5

for time:

pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")

Question 6

None of the pure REGEx solution above worked for the real life srt files.

Let's take a look of the following SRT patterned text :

1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line

2
00:02:20,476 --> 00:02:22,501
as well as a single line

3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは

Take a note that :

text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted

Here is the working regex :

\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))

https://regex101.com/r/qICmEM/1