Exploring Java's Network API: URIs and URLs
- What Are URIs, URLs, and URNs?
- Working with URIs
- Working with URLs
- Review
In 1989, Tim Berners-Lee invented the World Wide Web (WWW). Think of the WWW as a global collection of interconnected physical and abstract resourcesentities supplying information on demandthat are accessed over the Internet. Physical resources range from files to people, and abstract resources include database queries. Because resources are identified in various ways (people have nonunique names, whereas computer files can be accessed via unique pathname combinations), a uniform way to identify WWW resources was needed. To address that need, Tim Berners-Lee introduced standardized ways to identify, locate, and name resources: URIs, URLs, and URNs.
NOTE
You can learn more about Tim Berners-Lee (and read a few of his WWW essays and articles) by visiting his Web page at http://www.w3.org/People/Berners-Lee/.
This article, the second in my Network API trilogy, explores URIs and URLs (and, to a lesser extent, URNs). After presenting basic concepts about those entities, the article examines the Network API's URI and URL classes (along with URL-related classes) and shows how to use those classes in your programs. Along the way, you discover the concept of MIME and how that concept relates to URLs.
This article's coverage of URIs, URLs, URNs, and MIME is based on two Request For Comments (RFC) documents. (RFC documents serve as the mechanism by which the Internet's architecture evolves.) The relevant RFC documents are listed here:
RFC 2396: "Uniform Resource Identifiers (URI) Generic Syntax"
NOTE
Version 1.4 (Beta 2) of Sun's Java 2 Standard Edition (J2SE) SDK was used to build this article's programs.
What Are URIs, URLs, and URNs?
URIs, URLs, and URNs relate to each other in a hierarchy. The URI category sits at the top of that hierarchy, while the URL and URN categories sit at the bottom. That arrangement indicates that both URL and URN are subcategories of URI, as Figure 1 illustrates.
Figure 1 URI, URL, and URN form a hierarchical relationship. URL and URN are subcategories of URI.
URI stands for uniform resource identifier, a compact string of characters that identifies a resource in a uniform (standardized) manner. That string typically begins with a scheme (an identifier that names the URI's namespacea set of related names) and has the following syntax:
[scheme:] scheme-specific-part
The URI optionally begins with scheme and a colon character. The scheme begins with an uppercase/lowercase letter, followed by zero or more uppercase/lowercase letters, digits, plus sign characters, minus sign characters, and period characters. The colon character separates scheme from the scheme-specific-part, and the scheme-specific-part's syntax and semantics (meaning) are determined by the URI's namespace. An example of a URI is http://www.cnn.com, in which http is the scheme, //http://www.cnn.com is the scheme-specific-part, and the scheme and scheme-specific-part are separated by a colon character.
URIs can be categorized as absolute or relative. An absolute URI is a URI that begins with a scheme (followed by a colon character). The earlier http://www.cnn.com is an example of an absolute URI. Other examples include mailto:jeff@javajeff.com, news:comp.lang.java.help, and xyz://whatever. Think of an absolute URI as referring to some resource in a manner that is independent of the context in which that identifier appears. To use a file system analogy, an absolute URI is like a pathname to a file that starts from the root directory. In contrast to an absolute URI, a relative URI is a URI that does not begin with a scheme (followed by a colon character). An example is articles/articles.html. Think of a relative URI as referring to some resource in a manner that is dependent on the context in which that identifier appears. Using the file system analogy, the relative URI is like a pathname to a file that starts from the current directory.
URIs can be further categorized as opaque or hierarchical. An opaque URI is an absolute URI whose scheme-specific-part does not begin with a forward slash (/) character. Examples include news:comp.lang.java and the earlier mailto:jeff@javajeff.com. Opaque URIs are not subject to parsing (beyond identifying the scheme) because the scheme-specific-part does not need to be validated. By contrast, a hierarchical URI is either an absolute URI whose scheme-specific-part begins with a forward slash character, or a relative URI.
Unlike an opaque URI, a hierarchical URI's scheme-specific-part must be parsed into various components. What components are those? The scheme-specific-part of a common subset of hierarchical URI identifies components according to the following syntax:
[//authority] [path] [?query] [#fragment]
The optional authority component identifies the naming authority for the URI's namespace. If present, that component begins with a pair of forward slash characters, is either server-based or registry-based, and terminates with the next forward slash character, question mark character, or no more charactersthe end of the URI. Registry-based authority components have scheme-specific syntaxes (and are not discussed in this article because they are not commonly used), whereas server-based authority components tend to have the following syntax:
[userinfo@] host [:port]
According to this syntax, a server-based authority component optionally begins with user information (such as a username) and an "at" (@) character, continues with the name of a host, and optionally concludes with a colon (:) character and a port. For example, jeff@x.com:90 is a server-based authority component, in which jeff comprises the user information, x.com comprises the host, and 90 comprises the port.
The optional path component identifies the location of a resource according to the authority component (if present) or the scheme (if there is no authority component). A path divides into a sequence of path segments, in which each path segment (a portion of the path) is separated from other path segments by a forward slash character. The path is considered to be absolute if the first path segment begins with a forward slash character. Otherwise, the path is considered to be relative. For example, /a/b/c constitutes a path with three path segmentsa, b, and c. Furthermore, that path is absolute because a forward slash character prefixes the first path segment (a). (Despite appearances to the contrary, a URI's path and a directory's path are two different things.)
The optional query component identifies data to be passed to the resource. That resource uses the data to obtain or produce other data that passes back to the caller. For example, in http://www.somesite.net/a?x=y, x=y represents a query. According to that query, x=y is data to be passed to a resourcex names some entity and y is the value of that entity.
The final component is fragment. Although that component appears to be part of a URI, it is not. When a URI is used in some kind of retrieval action, the software that performs that action later uses fragment to focus on the part of a resource that is of interest to the software (after the software has successfully retrieved data from the resource).
To put the aforementioned component information into perspective, consider the following URI:
ftp://george@x.com:90/public/notes?text=shakespeare#hamlet
The previous URI identifies ftp as the scheme, george@x.com:90 as the server-based authority (in which george constitutes the user information, x.com constitutes the host, and 90 constitutes the port), /public/notes as the path, text=shakespeare as the query, and hamlet as the fragment. Essentially, a user named george wants to retrieve information on hamlet from the shakespeare text that's located, via the /public/notes path, on port 90 of server x.com. After shakespeare is successfully returned to the program, the program locates the hamlet section and presents that section to the program's user.
Some URIs contain one or more path segments consisting of single-period characters. Those path segments contribute nothing to the URIs. Other URIs contain path segments consisting of two consecutive period characters, in which each of those path segments is preceded by a path segment that does not contain those characters. As with single-period character path segments, such path segments contribute nothing to the URIs. The act of removing unnecessary single-period character path segments and unnecessary double-period character path segments (plus immediately preceding nondouble-period path segments) is known as normalization.
Normalization can be understood in directory terms. Suppose that directory x exists immediately below the root directory, x contains directories a and b, b contains the file memo.txt, and a is the current directory.
To display the contents of memo.txt (under Microsoft Windows), you could specify type \x\.\b\memo.txt. However, the single-period character accomplishes nothing. You could also specify type \x\a\..\b\memo.txt. In this case, the presence of a and .. are not necessary. Neither directory path is in its simplest form. However, if you specify type \x\b\memo.txt, you are specifying the simplest path, beginning with the root directory, to access memo.txt. That \x\b\memo.txt simplest path is known as a normalized directory path. (The same idea applies to URIs.)
Resources are often accessed via base and relative URIs. A base URI is an absolute URI that uniquely identifies a resource's namespace, whereas a relative URI identifies a resource relative to the base URI. (Unlike a base URI, a relative URI might never need to change in a resource's lifetime.) Because neither the base URI nor the relative URI completely identifies the resource, it is necessary to merge both URIs through a process known as resolution. Conversely, it is possible to extract the relative URI from the merged URI through a process known as relativization (the inverse of resolution.)
NOTE
Unlike other URIs, opaque URIs are not subject to normalization, resolution, or relativization.
Suppose that you have x://a/ as a base URI and b/c as a relative URI. Resolving the relative URI against the base URI yields x://a/b/c. Relativizing x://a/b/c against x://a/ yields b/c.
URIs cannot locate and read from/write to resources. That is the job of the uniform resource locator (URL). A URL is a URI whose scheme component is known as a network protocol (protocol, for short), and it combines URI components with a protocol handler (a resource locator and read/write mechanism that communicates with a resource according to strict rules that have been established for the protocol).
It is also true that URIs cannot provide persistent names for resources. That is the job of the uniform resource name (URN). A URN is a URI that is globally unique and persistent, even when a resource ceases to exist or is no longer available. (That is all I have to say about URNs in this article.)