- Rules
- Risk Assessment Summary
- IDS00-J. Sanitize untrusted data passed across a trust boundary
- IDS01-J. Normalize strings before validating them
- IDS02-J. Canonicalize path names before validating them
- IDS03-J. Do not log unsanitized user input
- IDS04-J. Limit the size of files passed to ZipInputStream
- IDS05-J. Use a subset of ASCII for file and path names
- IDS06-J. Exclude user input from format strings
- IDS07-J. Do not pass untrusted, unsanitized data to the Runtime.exec() method
- IDS08-J. Sanitize untrusted data passed to a regex
- IDS09-J. Do not use locale-dependent methods on locale-dependent data without specifying the appropriate locale
- IDS10-J. Do not split characters between two data structures
- IDS11-J. Eliminate noncharacter code points before validation
- IDS12-J. Perform lossless conversion of String data between differing character encodings
- IDS13-J. Use compatible encodings on both sides of file or network I/O
IDS12-J. Perform lossless conversion of String data between differing character encodings
Performing conversions of String objects between different character encodings may result in loss of data.
According to the Java API [API 2006], String.getBytes(Charset) method documentation:
This method always replaces malformed-input and unmappable-character sequences with this charset’s default replacement byte array.
When a String must be converted to bytes, for example, for writing to a file, and the string might contain unmappable character sequences, proper character encoding must be performed.
Noncompliant Code Example
This noncompliant code example [Hornig 2007] corrupts the data when string contains characters that are not representable in the specified charset.
// Corrupts data on errors public static byte[] toCodePage_bad(String charset, String string) throws UnsupportedEncodingException { return string.getBytes(charset); } // Fails to detect corrupt data public static String fromCodePage_bad(String charset, byte[] bytes) throws UnsupportedEncodingException { return new String(bytes, charset); }
Compliant Solution
The java.nio.charset.CharsetEncoder class can transform a sequence of 16-bit Unicode characters into a sequence of bytes in a specific Charset, while the java.nio.charset.Character-Decoder class can reverse the procedure [API 2006]. Also see rule FIO11-J for more information.
This compliant solution [Hornig 2007] uses the CharsetEncoder and CharsetDecoder classes to handle encoding conversions.
public static byte[] toCodePage_good(String charset, String string) throws IOException { Charset cs = Charset.forName(charset); CharsetEncoder coder = cs.newEncoder(); ByteBuffer bytebuf = coder.encode(CharBuffer.wrap(string)); byte[] bytes = new byte[bytebuf.limit()]; bytebuf.get(bytes); return bytes; } public static String fromCodePage_good(String charset,byte[] bytes) throws CharacterCodingException { Charset cs = Charset.forName(charset); CharsetDecoder coder = cs.newDecoder(); CharBuffer charbuf = coder.decode(ByteBuffer.wrap(bytes)); return charbuf.toString(); }
Noncompliant Code Example
This noncompliant code example [Hornig 2007] attempts to append a string to a text file in the specified encoding. This is erroneous because the String may contain unrepresentable characters.
// Corrupts data on errors public static void toFile_bad(String charset, String filename, String string) throws IOException { FileOutputStream stream = new FileOutputStream(filename, true); OutputStreamWriter writer = new OutputStreamWriter(stream, charset); writer.write(string, 0, string.length()); writer.close(); }
Compliant Solution
This compliant solution [Hornig 2007] uses the CharsetEncoder class to perform the required function.
public static void toFile_good(String filename, String string, String charset) throws IOException { Charset cs = Charset.forName(charset); CharsetEncoder coder = cs.newEncoder(); FileOutputStream stream = new FileOutputStream(filename, true); OutputStreamWriter writer = new OutputStreamWriter(stream, coder); writer.write(string, 0, string.length()); writer.close(); }
Use the FileInputStream and InputStreamReader objects to read back the data from the file. The InputStreamReader accepts an optional CharsetDecoder argument, which must be the same as that previously used for writing to the file.
Risk Assessment
Use of nonstandard methods for performing character-set-related conversions can lead to loss of data.
Rule |
Severity |
Likelihood |
Remediation Cost |
Priority |
Level |
IDS12-J |
low |
probable |
medium |
P12 |
L3 |
Related Guidelines
MITRE CWE |
CWE-838. Inappropriate encoding for output context |
CWE-116. Improper encoding or escaping of output |
Bibliography
[API 2006] |
Class String |
[Hornig 2007] |
Global Problem Areas: Character Encodings |