- Rules
- Risk Assessment Summary
- IDS00-J. Sanitize untrusted data passed across a trust boundary
- IDS01-J. Normalize strings before validating them
- IDS02-J. Canonicalize path names before validating them
- IDS03-J. Do not log unsanitized user input
- IDS04-J. Limit the size of files passed to ZipInputStream
- IDS05-J. Use a subset of ASCII for file and path names
- IDS06-J. Exclude user input from format strings
- IDS07-J. Do not pass untrusted, unsanitized data to the Runtime.exec() method
- IDS08-J. Sanitize untrusted data passed to a regex
- IDS09-J. Do not use locale-dependent methods on locale-dependent data without specifying the appropriate locale
- IDS10-J. Do not split characters between two data structures
- IDS11-J. Eliminate noncharacter code points before validation
- IDS12-J. Perform lossless conversion of String data between differing character encodings
- IDS13-J. Use compatible encodings on both sides of file or network I/O
IDS11-J. Eliminate noncharacter code points before validation
In some versions prior to Unicode 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 states [Unicode 2007]:
- C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.
According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5, “Deletion of Noncharacters”:
Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say “delete.” If what is passed in is “deXlete,” where X is a noncharacter, the gateway lets it through: The sequence “deXlete” may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.
Any string modifications, including the removal or replacement of noncharacter code points, must be performed before any validation of the string is performed.
Noncompliant Code Example
This noncompliant code example accepts only valid ASCII characters and deletes any non-ASCII characters. It also checks for the existence of a <script> tag.
Input validation is being performed before the deletion of non-ASCII characters. Consequently, an attacker can disguise a <script> tag and bypass the validation checks.
// "\uFEFF" is a non-character code point String s = "<scr" + "\uFEFF" + "ipt>"; s = Normalizer.normalize(s, Form.NFKC); // Input validation Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("Found black listed tag"); } else { // ... } // Deletes all non-valid characters s = s.replaceAll("^\\p{ASCII}]", ""); // s now contains "<script>"
Compliant Solution
This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.
String s = "<scr" + "\uFEFF" + "ipt>"; s = Normalizer.normalize(s, Form.NFKC); // Replaces all non-valid characters with unicode U+FFFD s = s.replaceAll("^\\p{ASCII}]", "\uFFFD"); Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("Found blacklisted tag"); } else { // ... }
According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], “U+FFFD is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it doesn’t have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available.”
Risk Assessment
Deleting noncharacter code points can allow malicious input to bypass validation checks.
Rule |
Severity |
Likelihood |
Remediation Cost |
Priority |
Level |
IDS11-J |
high |
probable |
medium |
P12 |
L1 |
Related Guidelines
MITRE CWE |
CWE-182. Collapse of data into unsafe value |
Bibliography
[API 2006] |
|
[Davis 2008b] |
3.5, Deletion of Noncharacters |
[Weber 2009] |
Handling the Unexpected: Character-Deletion |
[Unicode 2007] |
|
[Unicode 2011] |