Question

I am using the following regex to extract the filename from an rfc822 multipart email.

private static Pattern filenamePattern = Pattern.compile("(?<=filename=\").*?(?=\")");

This is able to extract filenames that have a space, as in:

Content-Type : application/pdf; name="Key.Enrollment_Final.pdf"

but cannot extract filenames that are not quoted, like:

Content-Type : application/octet-stream;    name=.config

I cannot quite figure out how to get both. For the first quote, I think I can check for (?<=filename=\"?), but how should I check for a space or an end of line or a quote?

Was it helpful?

Solution

I have only seen filename attribute being specified in Content-Disposition header, but not Content-Type header.

Either way, this is a regex that correctly matches filename attribute, according to RFC 1806 (which references RFC 1521 and RFC 822.

"filename=(?:([\\x21-\\x7E&&[^\\Q()<>[]@,;:\\\"/?=\\E]]++)|\"((?:(?:(?:\r\n)?[\t ])+|[^\r\"\\\\]|\\\\[\\x00-\\x7f])*)\")"

Well, matching is one thing, but you still have to process the file name in the second case, at least to unquote special characters. (You still need to collapse linear-white-space: (?:(?:\r\n)?[\t ])+, as defined in RFC 822, to a single space, and replace non-printable characters).

OTHER TIPS

The following pattern works on both of your test cases above. Group 1 contains your filename.

name=\"?(.*)\"?

I don't know if I understood it correctly, but if you want to keep just the name of the file, this should work:

private static Pattern filenamePattern = Pattern.compile(".*application\\/.* name=\\"?([^ ]+)\\"?");

In filenamePattern.match(1) you should have the result

I guess this Regex would serve your purpose:

name\=\"?([\w\.]+)\"?

You can work on the ([\w\.]+) according to your file names, but the current state catches the two given examples.

Check this Regex101 fiddle

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top