Link rendering issue while server side markdownsharp conversion and sanitization - how to get the same output as pagedown does

https://stackoverflow.com/questions/12529057

03-07-2021
|

Question

I'm using pagedown editor. The code I'm using for gerating the preview is following:

$(document).ready(function () {
    var previewConverter = Markdown.getSanitizingConverter();
    var editor = new Markdown.Editor(previewConverter);
    editor.run();
});

While I enter some text to the input:

enter image description here

the dynamically generated output preview will be as expected, and looks following:

enter image description here

The content (the pure entered text shown below) is then saved to database:

"http://www.google.com\n\n<script>alert('hi');</script>\n\n[google][4]\n\n\n  [1]: http://www.google.com"

On the server side, before the page is rendered, I'm converting this fetched from database text, using this markdownsharp library v1.13.0.0. After conversion, I'm sanitizing the html using Jeff Atwood's code, which I've found here:

    private static Regex _tags = new Regex("<[^>]*(>|$)",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"
    ^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|
    ^<(b|h)r\s?/?>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(@"
    ^<a\s
    href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""
    (\stitle=""[^""<>]+"")?\s?>$|
    ^</a>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(@"
    ^<img\s
    src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""
    (\swidth=""\d{1,3}"")?
    (\sheight=""\d{1,3}"")?
    (\salt=""[^""<>]*"")?
    (\stitle=""[^""<>]*"")?
    \s?/?>$",
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);


/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using 
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
    if (String.IsNullOrEmpty(html)) return html;

    string tagname;
    Match tag;

    // match every HTML tag in the input
    MatchCollection tags = _tags.Matches(html);
    for (int i = tags.Count - 1; i > -1; i--)
    {
        tag = tags[i];
        tagname = tag.Value.ToLowerInvariant();

        if(!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
        {
            html = html.Remove(tag.Index, tag.Length);
            System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
        }
    }

    return html;
}

The conversion and sanitization process is following::

    var md = new MarkdownSharp.Markdown();            
    var unsafeHtml = md.Transform(content);
    var safeHtml = Sanitize(unsafeHtml);
    return new HtmlString(safeHtml);

unsafeHtml contains

"<p>http://www.google.com</p>\n\n<script>alert('hi');</script>\n\n<p><a href=\"http://www.google.com\">google</a></p>\n"

safeHtml contains

"<p>http://www.google.com</p>\n\nalert('hi');\n\n<p><a href=\"http://www.google.com\">google</a></p>\n"

This renders to:

enter image description here

So sanitization and the second link were converted as expected. Unfortunately, the first link is not a link anymore, just text. How to fix this ?

Maybe better approach is not to use server side conversion, but just use javascript to render the markdown text on the page ?

Solution

In Markdown.Converter.js we can find _DoAutoLinks(text) function. There is section which automatically add < and > around unadorned raw hyperlinks, and then autolink anything like <http://example.com>. This is why

http://www.google.com

will be first converted to:

<http://www.google.com>

and then to:

<a href="http://www.google.com">http://www.google.com</a>

My temporary workaround is doing something similiar at the c# side:

var unsafeHtml = DoAutolinks(md.Transform(content));

private static string DoAutolinks(string content)
{            
    /* url pattern - from msdn.microsoft.com/en-us/library/ff650303.aspx */
    const string url = @"(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&amp;%\$#_]*)?";
    const string pattern = @"<p>(?<url>" + url + ")</p>";
    var result = Regex.Replace(content, pattern, "<p><a href=\"${url}\">${url}</a></p>");
    return result;
}

Should such functionality - responsible for unadorned links conversion, be included in markdownsharp ?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow