- Rules
- Risk Assessment Summary
- IDS00-J. Sanitize untrusted data passed across a trust boundary
- IDS01-J. Normalize strings before validating them
- IDS02-J. Canonicalize path names before validating them
- IDS03-J. Do not log unsanitized user input
- IDS04-J. Limit the size of files passed to ZipInputStream
- IDS05-J. Use a subset of ASCII for file and path names
- IDS06-J. Exclude user input from format strings
- IDS07-J. Do not pass untrusted, unsanitized data to the Runtime.exec() method
- IDS08-J. Sanitize untrusted data passed to a regex
- IDS09-J. Do not use locale-dependent methods on locale-dependent data without specifying the appropriate locale
- IDS10-J. Do not split characters between two data structures
- IDS11-J. Eliminate noncharacter code points before validation
- IDS12-J. Perform lossless conversion of String data between differing character encodings
- IDS13-J. Use compatible encodings on both sides of file or network I/O
IDS08-J. Sanitize untrusted data passed to a regex
Regular expressions are widely used to match strings of text. For example, the POSIX grep utility supports regular expressions for finding patterns in the specified text. For introductory information on regular expressions, see the Java Tutorials [Tutorials 08]. The java.util.regex package provides the Pattern class that encapsulates a compiled representation of a regular expression and the Matcher class, which is an engine that uses a Pattern to perform matching operations on a CharSequence.
Java’s powerful regular expression (regex) facilities must be protected from misuse. An attacker may supply a malicious input that modifies the original regular expression in such a way that the regex fails to comply with the program’s specification. This attack vector, called a regex injection, might affect control flow, cause information leaks, or result in denial-of-service (DoS) vulnerabilities.
Certain constructs and properties of Java regular expressions are susceptible to exploitation:
- Matching flags: Untrusted inputs may override matching options that may or may not have been passed to the Pattern.compile() method.
- Greediness: An untrusted input may attempt to inject a regex that changes the original regex to match as much of the string as possible, exposing sensitive information.
- Grouping: The programmer can enclose parts of a regular expression in parentheses to perform some common action on the group. An attacker may be able to change the groupings by supplying untrusted input.
Untrusted input should be sanitized before use to prevent regex injection. When the user must specify a regex as input, care must be taken to ensure that the original regex cannot be modified without restriction. Whitelisting characters (such as letters and digits) before delivering the user-supplied string to the regex parser is a good input sanitization strategy. A programmer must provide only a very limited subset of regular expression functionality to the user to minimize any chance of misuse.
Regex Injection Example
Suppose a system log file contains messages output by various system processes. Some processes produce public messages and some processes produce sensitive messages marked “private.” Here is an example log file:
10:47:03 private[423] Successful logout name: usr1 ssn: 111223333 10:47:04 public[48964] Failed to resolve network service 10:47:04 public[1] (public.message[49367]) Exited with exit code: 255 10:47:43 private[423] Successful login name: usr2 ssn: 444556666 10:48:08 public[48964] Backup failed with error: 19
A user wishes to search the log file for interesting messages but must be prevented from seeing the private messages. A program might accomplish this by permitting the user to provide search text that becomes part of the following regex:
(.*? +public\[\d+\] +.*<SEARCHTEXT>.*)
However, if an attacker can substitute any string for <SEARCHTEXT>, he can perform a regex injection with the following text:
.*)|(.*
When injected into the regex, the regex becomes:
(.*? +public\[\d+\] +.*.*)|(.*.*)
This regex will match any line in the log file, including the private ones.
Noncompliant Code Example
This noncompliant code example periodically loads the log file into memory and allows clients to obtain keyword search suggestions by passing the keyword as an argument to suggestSearches().
public class Keywords { private static ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); private static CharBuffer log; private static final Object lock = new Object(); // Map log file into memory, and periodically reload static try { FileChannel channel = new FileInputStream( "path").getChannel(); // Get the file's size and map it into memory int size = (int) channel.size(); final MappedByteBuffer mappedBuffer = channel.map( FileChannel.MapMode.READ_ONLY, 0, size); Charset charset = Charset.forName("ISO-8859-15"); final CharsetDecoder decoder = charset.newDecoder(); log = decoder.decode(mappedBuffer); // Read file into char buffer Runnable periodicLogRead = new Runnable() { @Override public void run() { synchronized(lock) { try { log = decoder.decode(mappedBuffer); } catch (CharacterCodingException e) { // Forward to handler } } } }; scheduler.scheduleAtFixedRate(periodicLogRead, 0, 5, TimeUnit.SECONDS); } catch (Throwable t) { // Forward to handler } } public static Set<String> suggestSearches(String search) { synchronized(lock) { Set<String> searches = new HashSet<String>(); // Construct regex dynamically from user string String regex = "(.*? +public\\[\\d+\\] +.*" + search + ".*)"; Pattern keywordPattern = Pattern.compile(regex); Matcher logMatcher = keywordPattern.matcher(log); while (logMatcher.find()) { String found = logMatcher.group(1); searches.add(found); } return searches; } } }
This code permits a trusted user to search for public log messages such as “error.” However, it also allows a malicious attacker to perform the regex injection previously described.
Compliant Solution (Whitelisting)
This compliant solution filters out nonalphanumeric characters (except space and single quote) from the search string, which prevents regex injection previously described.
public class Keywords { // ... public static Set<String> suggestSearches(String search) { synchronized(lock) { Set<String> searches = new HashSet<String>(); StringBuilder sb = new StringBuilder(search.length()); for (int i = 0; i < search.length(); ++i) { char ch = search.charAt(i); if (Character.isLetterOrDigit(ch) || ch == ' ' || ch == '\'') { sb.append(ch); } } search = sb.toString(); // Construct regex dynamically from user string String regex = "(.*? +public\\[\\d+\\] +.*" + search + ".*)"; // ... } } }
This solution also limits the set of valid search terms. For instance, a user may no longer search for “name =” because the = character would be sanitized out of the regex.
Compliant Solution
Another method of mitigating this vulnerability is to filter out the sensitive information prior to matching. Such a solution would require the filtering to be done every time the log file is periodically refreshed, incurring extra complexity and a performance penalty. Sensitive information may still be exposed if the log format changes but the class is not also refactored to accommodate these changes.
Risk Assessment
Failing to sanitize untrusted data included as part of a regular expression can result in the disclosure of sensitive information.
Rule |
Severity |
Likelihood |
Remediation Cost |
Priority |
Level |
IDS08-J |
medium |
unlikely |
medium |
P4 |
L3 |
Related Guidelines
MITRE CWE |
CWE-625. Permissive regular expression |
Bibliography
[Tutorials 08] |
Regular Expressions |
[CVE 05] |
CVE-2005-1949 |