Exercises
In many contexts (e.g., in some web forms), users must enter a phone number, and some of these irritate users by accepting only a specific format. Write a program that reads U.S. phone numbers with the three-digit area and seven-digit local codes accepted as ten digits, or separated into blocks using hyphens or spaces, and with the area code optionally enclosed in parentheses. For example, all of these are valid: 555-555-5555, (555) 5555555, (555) 555 5555, and 5555555555. Read the phone numbers from sys.stdin and for each one echo the number in the form “(555) 555 5555” or report an error for any that are invalid.
The regex to match these phone numbers is about eight lines long (in verbose mode) and is quite straightforward. A solution is provided in phone.py, which is about twenty-five lines long.
Write a small program that reads an XML or HTML file specified on the command line and for each tag that has attributes, outputs the name of the tag with its attributes shown underneath. For example, here is an extract from the program’s output when given one of the Python documentation’s index.html files:
html xmlns = http://www.w3.org/1999/xhtml meta http-equiv = Content-Type content = text/html; charset=utf-8 li class = right style = margin-right: 10px
One approach is to use two regexes, one to capture tags with their attributes and another to extract the name and value of each attribute. Attribute values might be quoted using single or double quotes (in which case they may contain whitespace and the quotes that are not used to enclose them), or they may be unquoted (in which case they cannot contain whitespace or quotes).It is probably easiest to start by creating a regex to handle quoted and unquoted values separately, and then merging the two regexes into a single regex to cover both cases. It is best to use named groups to make the regex more readable. This is not easy, especially since backreferences cannot be used inside character classes.
A solution is provided in extract_tags.py, which is less than 35 lines long. The tag and attributes regex is just one line. The attribute name–value regex is half a dozen lines and uses alternation, conditional matching (twice, with one nested inside the other), and both greedy and nongreedy quantifiers.