Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change: sanitize markdown to prevent JS execution in documentation output #692

Open
MaddyGuthridge opened this issue Sep 8, 2024 · 5 comments
Assignees

Comments

@MaddyGuthridge
Copy link

Is your change request related to a problem? Please describe.

I recently discovered in this mind-bending issue that most Markdown implementations don't sanitize their output, with Python-Markdown being no exception. This means that (contrary to popular belief), Markdown text cannot be trusted to be safe.

While developers should be careful to only accept merges after thoroughly reviewing code, there are lots of ways to subtlely embed JS into documents, which can be easily overlooked (eg using onerror in an <img> tag). I cannot think of any non-malicious to embed executing JS code within documentation markdown, when it is so much easier to bundle additional JS using mkdocs (which reviewers would be much more suspicious of, and therefore much more careful of).

Describe the solution you'd like

Sanitize the Markdown output within documentation to prevent executable JS code from being embedded in the output. Perhaps mozilla/bleach can be used.

Describe alternatives you've considered

Do nothing, but document that docstrings cannot be blindly trusted to be safe, as JS can be embedded within them.

Additional context

@pawamoy
Copy link
Member

pawamoy commented Sep 8, 2024

Hi again @MaddyGuthridge, thanks for the request.

I cannot think of any non-malicious to embed executing JS code within documentation markdown

I often use <script> in my own Markdown pages. Some of these scripts only need to appear in single specific pages, not all pages. A few examples:

To update a few things in my insiders pages:

<script src="../js/insiders.js"></script>
<script>updateInsidersPage('pawamoy');</script>

To make some SVGs zoomable in a specific page:

<script>
    document.addEventListener("DOMContentLoaded", function(){
        const divs = document.getElementsByClassName("interactiveSVG");
        for (let i = 0; i < divs.length; i++) {
            if (!divs[i].firstElementChild.id) {
                divs[i].firstElementChild.id = `interactiveSVG-${i}`
            }
            svgPanZoom(`#${divs[i].firstElementChild.id}`, {});
        }
    });
</script>

Generally speaking, Markdown is a superset of HTML, so I don't find it surprising that HTML (including <script>) is not escaped. The role of Markdown implementations is to convert Markdown to HTML, so keeping the raw HTML as is seems like the correct thing to do. I'd personally be very annoyed if Python-Markdown started escaping <script> or more HTML elements, as I very often rely on HTML when Markdown is insufficient. IMO sanitizing should be done on the final HTML, not during Markdown->HTML conversion.

However I agree that it could make sense to provide an option at the level of mkdocstrings, to escape HTML (or just <script>) in docstrings, since it can sometimes happen that you render docstrings from code you don't have control over. I could argue that you shouldn't render docstrings from untrusted sources, but that wouldn't be very productive.

So the solution I suggest here is to use a Griffe extension that will escape HTML script elements in docstrings. Seems super easy to do with bleach, as you mentioned 🙂

import bleach
import griffe


# I checked bleach's docs, but unfortunately it doesn't look
# like there's a way to only escape script elements.
class EscapeHTMLExtension(griffe.Extension):
    def on_instance(self, obj: griffe.Object, **kwargs) -> None:
        if obj.docstring:
            obj.docstring.value = bleach.clean(obj.docstring.value)

WDYT?

@waylan
Copy link
Contributor

waylan commented Sep 10, 2024

If you are using bleach to sanitize Markdown output, I might suggest yourcelf/bleach-allowlist which contains a comprehensive set of bleach configs.

@pawamoy
Copy link
Member

pawamoy commented Sep 10, 2024

Thanks @waylan!

One thing I'm unsure of, is whether bleach is able to escape Markdown, without breaking it. For example, will the following be left intact, or will <script> within the code block be escaped to &lt;script&gt;?

a = 0
"""Something about `<script>`."""

@waylan
Copy link
Contributor

waylan commented Sep 11, 2024

Bleach does not understand Markdown at all. One should only pass the HTML output of Markdown to Bleach. Of course, if valid HTML code blocks (or spans) contain Markdown, Bleach will understand that those blocks are not HTML and not try to parse them, but I'm not sure what escaping it will do in that scenario. However, if Bleach is passed Markdown which is not in a valid HTML code block or span, then all bets are off. Assume Bleach will not handle it in a sensible way.

@pawamoy
Copy link
Member

pawamoy commented Sep 11, 2024

Thanks. Escaping docstrings with Bleach is probably not a good idea then. Maybe mkdocstrings could allow hooking into its convert_markdown Jinja filter instead, to escape docstrings once they're converted to HTML 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants