Hone logo
Hone
Problems

HTML Sanitizer in JavaScript

Creating an HTML sanitizer is crucial for web applications to prevent Cross-Site Scripting (XSS) vulnerabilities. This challenge asks you to build a JavaScript function that takes an HTML string as input and returns a sanitized version, removing potentially harmful elements and attributes while preserving safe content. This is a fundamental security practice for any application accepting user-provided HTML.

Problem Description

You need to implement a JavaScript function called sanitizeHTML that takes an HTML string as input and returns a sanitized version of that string. The sanitizer should remove potentially dangerous HTML elements and attributes, while allowing a predefined set of safe elements and attributes.

What needs to be achieved:

  • The function should parse the input HTML string.
  • It should identify and remove any HTML elements and attributes that are not explicitly allowed.
  • It should return a new HTML string containing only the allowed elements and attributes, with their content preserved.

Key Requirements:

  • Allowed Elements: p, b, i, u, em, strong, a, img, br, ul, ol, li, span, div, h1, h2, h3, h4, h5, h6.
  • Allowed Attributes:
    • For a elements: href (must be a valid URL - see edge cases)
    • For img elements: src (must be a valid URL - see edge cases), alt
    • For all other allowed elements: No attributes are allowed.
  • URL Validation: The href attribute of <a> tags and the src attribute of <img> tags must be validated to ensure they are valid URLs. A simple check is to ensure the URL starts with http:// or https://. Invalid URLs should result in the attribute being removed.
  • Attribute Value Sanitization: Attribute values should be escaped to prevent injection attacks. For simplicity, replace < with &lt; and > with &gt; in attribute values.

Expected Behavior:

The function should return a string containing only the allowed HTML elements and attributes, with any potentially harmful elements and attributes removed. The content within the allowed elements should be preserved.

Edge Cases to Consider:

  • Empty input string.
  • Input string containing only whitespace.
  • Input string containing only disallowed HTML elements.
  • Input string containing nested HTML elements.
  • Invalid URLs in href and src attributes.
  • HTML entities already present in the input string.
  • Self-closing tags (e.g., <br />).
  • Comments within the HTML. Comments should be removed.

Examples

Example 1:

Input: "<p>This is <b>bold</b> text with a <a href='https://www.example.com'>link</a> and an <img src='https://www.example.com/image.jpg' alt='An image'/>.</p>"
Output: "<p>This is <b>bold</b> text with a <a href='https://www.example.com'>link</a> and an <img src='https://www.example.com/image.jpg' alt='An image'/>.</p>"
Explanation: All elements and attributes are allowed and valid.

Example 2:

Input: "<p>This is <b>bold</b> text with a <a href='javascript:alert("XSS")'>link</a> and an <img src='invalid-url' alt='An image'/>.</p><script>alert('XSS')</script>"
Output: "<p>This is <b>bold</b> text with a <a></a> and an <img alt='An image'/>.</p>"
Explanation: The `javascript:` URL in the `<a>` tag and the invalid URL in the `<img>` tag are removed. The `<script>` tag is also removed.

Example 3:

Input: "<div><p>Hello</p><script>alert('XSS')</script></div>"
Output: "<div><p>Hello</p></div>"
Explanation: The `<script>` tag is removed.

Constraints

  • Input Size: The input HTML string can be up to 10,000 characters long.
  • Performance: The function should complete within 500 milliseconds for typical input strings.
  • Input Format: The input will always be a string.
  • Output Format: The output must be a valid HTML string.

Notes

  • You can use regular expressions or a DOM parser to parse the HTML string. Using a DOM parser (like DOMParser in browsers or a library like jsdom in Node.js) is generally recommended for more robust and accurate parsing.
  • Consider using a whitelist approach, explicitly allowing only the desired elements and attributes.
  • Be mindful of HTML entities and ensure they are handled correctly.
  • Thoroughly test your sanitizer with various inputs, including edge cases, to ensure its effectiveness.
  • This is a simplified sanitizer. Real-world sanitizers are significantly more complex and may involve more sophisticated techniques to prevent XSS vulnerabilities.
Loading editor...
javascript