Advertising and Embedded Content
- Cross-Site Tracking
- Advertising
- Advertising Risks
- Other Cross-Site Risks
- Summary
- [My web history is] mine—you can’t have it. If you want to use it for something, then you have to negotiate with me. I have to agree, I have to understand what I’m getting in return.1
- —Sir Tim Berners-Lee
Publishing information is the backbone of web content, and bloggers and webmasters frequently rely on embedded content from companies such as Google to enhance the quality of their sites. Unfortunately, embedding third-party content is the equivalent of planting a web bug in web pages,2 alerting the source of the embedded content to a user’s presence on a given site and facilitating logging, profiling, and fingerprinting. More important, the source of the third-party content can aggregate these single instances and track users as they browse the web. This notion deserves restating: The simple act of web browsing across many disparate sites has the potential to generate a continuous stream of information back to the providers of third-party content. The more popular a given third-party service is, the more sites will deploy their content, and the greater the window of visibility on users’ web surfing activity becomes. In the case of advertising networks such as Google/DoubleClick and web-analytics services such as Google Analytics, the risk is large indeed.
This chapter explores the risks associated with embedded content by focusing on Google’s advertising network and Google Analytics, but it also provides an overview of other forms of embedded content that present related risks, such as embedded YouTube videos, maps, and Google’s Chat Back Service.
Cross-Site Tracking
As mentioned in Chapter 3, “Footprints, Fingerprints, and Connections,” many web sites embed third-party content in their sites. Third-party content can take the form of legitimate images and video clips, among other forms of content, but it can also be used to track users as they surf the web. Advertisers and web-analytic services give webmasters enticing analysis tools and advertising profit, simply requiring that, in exchange, webmasters add small snippets of HTML and JavaScript to their pages. Unfortunately, such third-party content is a severe privacy and web-based information-disclosure risk because the user’s web browser automatically visits these third-party servers,3 where their visit is presumably logged and their browser tagged with cookies. More important, the larger the advertising network is, the larger the window a given company has on a user’s online activity. For example, if a user visits 100 different web sites, each containing advertisements from a single advertising service, that service can observe the user as he or she visits each site. Figure 7-1 depicts cross-site tracking via an advertising network. In this figure, a user visits six distinct web sites, each hosting content from a single advertiser. In turn, the user’s visits create one set of log entries on each of the six legitimate servers. However, because each visit contained an advertisement from a single advertising network, the advertiser is able to log all six visits.
Figure 7-1 Example of cross-site tracking by an advertising network. When the user visits six distinct sites, he or she generates one set of log entries at each site. However, if each site contains advertisements from a single advertising network, the advertiser is able to record all six visits.
Let’s look at a real-world example by visiting a popular web site, MSNBC (see Figure 7-2). As it turns out, the MSNBC web site is laden with third-party content.
Figure 7-2 Analyzing the MSNBC web site demonstrates that it contains a great deal of third-party content. The problem is rampant among other web sites, large and small.
In a world without third-party content, the user should simply receive content from MSNBC’s domain, msnbc.msn.com. In the real world, however, the user visits 16 additional domains from 10 different companies. Two of these domains, DoubleClick and GoogleAnalytics, are owned by Google. The web browser provides the user with little assistance in detecting third-party content. The user simply sees the browser’s status bar rapidly flicker as the browser contacts each new site. To provide a clearer picture, I captured the raw network activity using the Wireshark protocol-analysis tool and created Table 7-1 to detail each of the third party domains visited.
Table 7-1. Third-Party Sites Visited When Browsing the MSNBC Web Site
Domain |
Notes |
a365.ms.akamai.net a509.cd.akamai.net |
Domain owned by Akamai.com, a mirroring service for media content |
ad.3ad.doubleclick.net |
Digital marketing service, acquired by Google |
amch.questionmarket.com |
Hosting web site where online surveys are posted |
c.live.com.nsatc.net c.msn.com.nsatc.net rad.msn.com.nsatc.net |
Registered to Savvis Communications, a networking and hosting provider |
context3.kanoodle.com |
Search-targeted sponsored links service |
global.msads.net.c.footprint.net hm.sc.msn.com.c.footprint.net |
Registered to Level 3 Communications, a large network provider |
msnbcom.112.2o7.net |
Registered to Omniture, a web analytics and online business optimization provider |
prpx.service.mirror-image.net wrpx.service.mirror-image.net |
Registered to Mirror Image Internet, a content delivery, streaming media, and web computing service |
switch.atdmt.com view.atdmt.com |
Registered to aQuantive, parent company to a family of digital marketing companies |
www-google-analytics.l.google.com |
Traffic measurement and interactive reporting service offered by Google |
Think about it. Simply visiting a single web page from a popular news service informs 16 third-party servers of the visit, a 16-fold magnification of logging. This is not a manufactured example, but it is representative of a common practice. Embedding third-party content in web sites is ubiquitous, and so is the problem. The end result is that web surfers are frequently tracked by companies they’ve never even heard of. It is also worth considering that information sharing via embedded content doesn’t occur only with “third parties”—sharing can also occur between ostensibly separate entities that are actually owned by the same parent company. For example, the A9 search engine (an Amazon.com company) inserts search term–related Amazon book advertisements adjacent to search results. These advertisements allow Amazon to track what A9 users search for, click through, and possibly buy online. If the user does make a purchase on Amazon.com, Amazon knows that user’s real-world identity, including billing and shipping information. In the case of A9, Amazon makes clear on the A9 site that A9 is an Amazon.com company, but the important idea is that corporate ownership—and, hence, implicit information sharing—might not be obvious as users browse the web.