Phishers can make victims visit their sites by spoofing emails
from users known by the victim, or within the same domain as the
victim. Recent experiments by Jagatic et al. [3] indicate that over of college students
would visit a site appearing to be recommended by a friend of
theirs. Over
of the
subjects receiving emails appearing to come from a friend entered
their login credentials at the site they were taken to. At the
same time, it is worth noticing that around
of the subjects in a
control group entered their credentials; subjects in the control
group received an email appearing to come from an unknown person
within the same domain as themselves. Even though the same
statistics may not apply to the general population of computer
users, it is clear that it is a reasonably successful technique
of luring people to sites where their browsers silently will be
interrogated and the contents of their caches sniffed.
Once a phisher has created an association between an email address and the contents of the browser cache/history, then this can be used to target the users in question with phishing emails that - by means of context - appear plausible to their respective recipients. For example, phishers can infer online banking relationships (as was done in [4]), and later send out emails appearing to come from the appropriate financial institutions. Similarly, phishers can detect possible online purchases and then send notifications stating that the payment did not go through, requesting that the recipient follow the included link to correct the credit card information and the billing address. The victims would be taken to a site looking just like the site they recently did perform a purchase at, and may have to start by entering their login information used with the real site. A wide variety of such tricks can be used to increase the yield of phishing attacks; all benefit from contextual information that can be extracted from the victim's browser.
There are several possible approaches that can be taken to address the above problem at the root - namely, at the information collection stage. First of all, users could be instructed to clear their browser cache and browser history frequently. However, many believe that any countermeasure that is based on (repeated) actions taken by users is doomed to fail. Moreover, the techniques used in [8,4] will also detect bookmarks on some browsers (such as Safari version 1.2). These are not affected by the clearing of the history or the cache, and may be of equal or higher value to an attacker in comparison to the contents of the cache and history of a given user. A second approach would be to once and for all disable all caching and not keep any history data; this approach, however, is highly wasteful in that it eliminates the significant benefits associated with caching and history files. A third avenue to protect users against invasive browser sniffing is a client-side solution that limits (but does not eliminate) the use of the cache. This would be done based on a set of rules maintained by the user's browser or browser plug-in. Such an approach is taken in the concurrent work by Jackson et al. [1]. Finally, a fourth approach, and the one we propose herein, is a server-side solution that prevents cache contents from being verified by means of personalization. Our solution also allows such personalization to be performed by network proxies, such as Akamai.
It should be clear that client-side and server-side solutions not only address the problem from different angles, but also that these different approaches address slightly different versions of the problem. Namely, a client-side solution protects those users who have the appropriate protective software installed on their machines, while a server-side solution protect all users of a given service (but only against intrusions relating to their use of this service). The two are complimentary, in particular in that the server-side approach allows ``blanket coverage'' of large numbers of users that have not yet obtained client-side protection, while the client-side approach secures users in the face of potentially negligent service providers. Moreover, if a caching proxy is employed for a set of users within one organization, then this can be abused to reveal information about the behavioral patterns of users within the group even if these users were to employ client-side measures within their individual browsers; abuse of such information is stopped by a server-side solution, like the one we describe.
From a technical point of view, it is of interest to note that
there are two very different ways in which one can hide the
contents of a cache. According to a first approach, one makes it
impossible to find references in the cache to a visited site,
while according to a second approach, the cache is intentionally
polluted with references to all sites of some class,
thereby hiding the actual references to the visited sites among
these. Our solution uses a combination of these two approaches:
it makes it impossible to find references to all internal
URLs (as well as all bookmarked URLs), while causing pollution of
entrance URLs. Here, we use these terms to mean that an
entrance URL corresponds to a URL a person would typically
type to start accessing a site, while an internal URL is
one that is accessed from an entrance URL by logging in,
searching, or following links. For example, the URL
http://test-run.com is an entrance URL since visitors
are most likely to load that URL by typing it in or following a
link from some other web site. The URL
http://test-run.com/logout.jsp, however, is
internal. This URL is far more interesting to a phisher
than the entrance URL; knowing that a client has been to this internal
URL suggests that
logged
out of the web site -- and thus must have logged in. Our
solution will make it infeasible for an attacker to guess the
internal URLs while also providing some obscurity for the
entrance URLs.
Preliminary numbers support our claims that the solution results in only a minimal overhead on the server side, and an almost unnoticeable overhead on the client side. Here, the former overhead is associated with computing one one-way function per client and session, and with a repeated mapping of URLs in all pages served. The latter overhead stems from a small number of ``unnecessary'' cache misses that may occur at the beginning of a new session. We provide evidence that our test implementation would scale well to large systems without resulting in a bottleneck - whether it is used as a server-side or proxy-side solution.
Felten and Schneider [2] described a timing-based attack that made it possible to determine (with some statistically quantifiable certainty) whether a given user had visited a given site or not - simply by determining the retrieval times of consecutive URL calls in a segment of HTTP code.
Securiteam [8] showed a history attack analogous to the timing attack described by Felten and Schneider. The history attack uses Cascading Style Sheets (CSS) to infer whether there is evidence of a given user having visited a given site or not. This is done by utilizing the :visited pseudoclass to determine whether a given site has been visited or not, and later to communicate this information by invoking calls to URLs associated with the different sites being detected; the data corresponding to these URLs is hosted by a computer controlled by the attacker, thereby allowing the attacker to determine whether a given site was visited or not. We note that it is not the domain that is detected, but whether the user has been to a given page or not; this has to match the queried site verbatim in order for a hit to occur. The same attack was recently re-crafted by Jakobsson et al. to show the impact of this vulnerability on phishing attacks; a demo is maintained at [4]. This demo illustrates how simple the attack is to perform and sniffs visitors' history in order to display one of the visitor's recently visited U.S. banking web sites.
Our implementation may rely on either browser cookies or an
HTTP header called referer (sic). Cookies are small
amounts of data that a server can store on a client. These bits
of data are sent from the server to client in HTTP headers -
content that is not displayed. When a client requests a document
from a server , it sends along
with the request any information stored in cookies by
. This transfer is
automatic, and so using cookies has negligible overhead. The
HTTP-Referer header is an optional piece of information sent to a
server by a client's browser. The value (if any) indicates where
the client obtained the address for the requested document. In
essence it is the location of the link that the client clicked.
If a client either types in a URL or uses a bookmark, no value
for HTTP-Referer is sent.
One particularly aggressive attack (depicted in Figure 1) that we need to be concerned with is one in which the attacker obtains a valid pseudonym from the server, and then tricks a victim to use this pseudonym (e.g., by posing as the service provider in question.) Thus, the attacker would potentially know the pseudonym extension of URLs for his victim, and would therefore also be able to query the browser of the victim for what it has downloaded.
![]() |
![]() |
We let be an adversary
controlling any member of
but
, and
interacting with both
and
some polynomial number of
times in the length of a security parameter
. When interacting with
,
may post arbitrary requests
and observe the responses; when interacting with
, it may send any document
to
, forcing
to attempt to resolve this
by performing the associated queries. Here,
may contain any
polynomial number of URLs
of
's choice. A first goal of
is to output a pair
such that
is true, and where
and
are associated. A second
goal of
is to output a pair
such that
is true, and where
is
-indicated by
.
We say that is
perfectly privacy-preserving if
will not attain the first goal but with a negligible
probability in the length of the security parameter
; the probability is taken
over the random coin tosses made by
,
,
and
. Similarly, we say that
is
privacy-preserving if
will not attain the second
goal but with a negligible probability.
Furthermore, we let be a
search engine; this is allowed to interact with
some polynomial number of
times in
. For each
interaction,
may post an
arbitrary request
and
observe the response. The strategy used by
is independent of
, i.e.,
is oblivious of the
policy used by
to respond to
requests. Thereafter,
receives a
query
from
, and has to output a
response. We say that
is
searchable if and only if
can generate a valid response
to the query, where
is considered valid if and only if it can be
successfully resolved by
.
In the next section, we describe a solution that corresponds
to a policy that is
searchable, and which is perfectly privacy-preserving with
respect to internal URLs and bookmarked URLs, and
-privacy-preserving with
respect to entrance URLs, for a value
corresponding to the maximum anonymity set of the
service offered.
At the heart of our solution is a filter associated with a server whose resources and users are to be protected. Similar to how middleware is used to filter calls between application layer and lower-level layers, our proposed filter modifies communication between users/browsers and servers - whether the servers are the actual originators of information, or simply act on behalf of these, as is the case for network proxies.
When interacting with a client (in the form of a web browser), the filter customizes the names of all files (and the corresponding links) in a manner that is unique for the session, and which cannot be anticipated by a third party. Thus, such a third party is unable to verify the contents of the cache/history of a chosen victim; this can only be done by somebody with knowledge of the name of the visited pages.
Pseudonyms and temporary pseudonyms are selected from a sufficiently large space, e.g., of 64-128 bits length. Temporary pseudonyms includes redundancy, allowing verification of validity by parties who know the appropriate secret key; pseudonyms do not need such redundancy, but can be verified to be valid using techniques to be detailed below.
Pseudonyms are generated pseudorandomly each time any visitor starts browsing at a web site. Once a pseudonym has been established, the requested page is sent to the client using the translation methods described next.
HTTP-Referer is an optional header field. Most modern browsers
provide it (IE, Mozilla, Firefox, Safari) but it will not
necessarily be present in case of a bookmark or manually typed in
link. This means that the referer will be within server
's domain if the link that
was clicked appeared on an one of the pages served by
. This lets us determine
whether we can skip the pseudonym generation phase. Thus, one
approach to determine the validity of a pseudonym may be as
follows:
Namely, one could - using a whitelist approach - allow certain types of robot processes to obtain data that is not pseudonymized; an example of a process with such permission would be a crawler for a search engine. As an alternative, any search engine may be served data that is customized using temporary pseudonyms - these will be replaced with a fresh pseudonym each time they are accessed. All other processes are served URLs with pseudo-randomly chosen (and then static) pseudonym, where the exact choice of pseudonym is not possible to anticipate for a third party.
More in particular, if there is a privacy agreement between
the server and the search engine
, then
may allow
to index its site in a
non-customized state; upon generating responses to queries,
would customize the
corresponding URLs using pseudo-randomly selected pseudonyms.
These can be selected in a manner that allows
to detect that they were
externally generated, allowing
to
immediately replace them with freshly generated pseudonyms. In
the absence of such arrangements, the indexed site may serve the
search engine URLs with temporary pseudonyms (generated and
authenticated by itself) instead of non-customized URLs or URLs
with (non-temporary) pseudonyms. Note that in this case we have
that all users receiving a URL with a temporary pseudonym from
the search engine would receive the same pseudonym. This
corresponds to a degradation of privacy in comparison to the
situation in which there is an arrangement between the search
engine and the indexed site, but an improvement compared to a
situation in which non-customized URLs are served by the search
engine. We note that in either case, we have that the search
engine does is unable to determine what internal pages on an
indexed site a referred user has visited.
The case in which a client-side robot is accessing data corresponds to another interesting situation. Such a robot will not alter the browser history of the client (assuming it is not part of the browser), but will impact the client cache. Thus, such robots should be not be excepted from customization, and should be treated in the same way as search engines without privacy arrangements, as described above.
In the implementation section, we describe these (server-side) policies in greater detail. We also note that these issues are orthogonal to the issue of how robots are handled on a given site, were our security enhancement not to be deployed. In other words, at some sites, where robots are not permitted whatsoever, the issue of when to perform personalization (and when not to) becomes moot.
When 's cache is polluted,
the entries must be either chosen at random or be a list sites
that all provide the same pollutants. Say when Alice accesses
, her cache is polluted
with sites
,
, and
. If these are the chosen
pollutants each time, the presence of these three sites in
Alice's cache is enough to determine that she has visited
. However, if all four
sites
,
,
, and
pollute with the same
list of sites, no such determination can be made.
If cannot guarantee that all
of the sites in its pollutants list will provide the same list,
it must randomize which pollutants it provides. Taken from a
large list of valid sites, a random set of pollutants essentially
acts as a bulky pseudonym that preserves the privacy of
- which of these randomly
provided sites was actually targeted cannot be determined by an
attacker.
It is important to note that the translator should not ever translate pages off-site pages; this could cause the translator software to start acting as an open proxy. The external URLs that it is allowed to serve should be a small number to prevent this.
Redirection may not be necessary, depending on the trust
relationships between the external sites and the protected
server, although for optimal privacy either redirection should be
implemented or off-site images and URLs should be removed from
internal pages. Assuming that redirection is implemented,
the translator has to modify off-site URLs to redirect through
itself, except in cases in which two domains collaborate and
agree to pseudonyms set by the other, in which case we may
consider them the same domain, for the purposes considered
herein. This allows the opportunity to put a pseudonym in URLs
that point to off-site data. This is also more work for the
translator and could lead to serving unnecessary pages. Because
of this, it is up to the administrator of the translator (and
probably the owner of the server) to set a policy of what should
be directed through the translator . We refer to this as an off-site redirection
policy. It is worth noting that many sites with a potential
interest in our proposed measure (such as financial institutions)
may never access external pages unless these belong to partners;
such sites would therefore not require off-site redirection
policies.
Similarly, a policy must be set to determine what types of
files get translated by .
The scanned types should be set by an administrator and is called
the data replacement policy.
The translator notices the pseudonym on the end of the request, so it removes it, verifies that it is valid (e.g., using cookies or HTTP Referer), and then forwards the request to the server. When a response is given by the server, the translator re-translates the page (using the steps mentioned above) using the same pseudonym, which is obtained from the request.
Distinguishing safe from unsafe sites can be difficult
depending on the content and structure of the server's web site.
Redirecting all URLs that are referenced from the domain of
will ensure good privacy,
but this places a larger burden on the translator. Servers that
do not reference offsite URLs from ``sensitive'' portions of
their site could minimize redirections while those that do should
rely on the translator to privatize the clients' URLs.
Since the types of data served by the back-end server
are controlled by its
administrators (who are in charge of
as well), the data types that are translated can
easily be set. The people in charge of
's content can ensure that sensitive URLs are only placed
in certain types of files (such as HTML and CSS) - then the
translator only has to process those files.
Herein, we argue why our proposed solution satisfies the previously stated security requirements. This analysis is rather straight-forward, and only involves a few cases.
It is worth noting that while clients can easily manipulate the pseudonyms, there is no benefit associated with doing this, and what is more, it may have detrimental effects on the security of the client. Thus, we do not need to worry about such modifications since they are irrational.
We implemented a rough prototype translator to estimate ease
of use as well as determine approximate efficiency and accuracy.
Our translator was written as a Java application that sat between
a client and protected site
. The translator performed
user-agent detection (for identifying robots); pseudonym
generation and assignment; translation (as described in Section
4.2); and redirection of external
(off-site) URLs. We placed the translator on a separate machine
from
in order to get an idea of
the worst-case timing and interaction requirements, although they
were on the same local network. The remote client was set up on
the Internet outside that local network.
In an ideal situation, a web site could be augmented with a translator easily: the software serving the site is changed to serve data on the computer's loopback interface (127.0.0.1) instead of through the external network interface. Second, the translator is installed and listens on the external network interface and forwards to the server on the loopback interface. It seems to the outside world that nothing has changed: the translator now listens closest to the clients at the same address where the server listened before. Additionally, extensions to a web server may make implementing a translator very easy.3
A client sent requests to our prototype and the URL was scanned for an instance of the pseudonym. If the pseudonym was not present, it was generated for the client as described and then stored only until the response from the server was translated and sent back to the client.
Most of the parsing was done in the header of the HTTP requests and responses. We implemented a simple data replacement policy for our prototype: any value for User-Agent that was not ``robot'' or ``wget'' was assumed to be a human client. This allowed us to easily write a script using the command-line wget tool in order to pretend to be a robot. Any content would simply be served in basic proxy mode if the User-Agent was identified as one of these two.
Additionally, if the content type was not text/html, then the associated data in the data stream was simply forwarded back and forth between client and server in a basic proxy fashion. HTML data was intercepted and parsed to replace URLs in common context locations:
|
We measured the amount of time it took to completely send the client's request and receive the entire response. This was measured for eight differently sized HTML documents 1000 times each. We set up the client to only load single HTML pages as a conservative estimate - in reality fewer pages will be translated since many requests for images will be sent through the translator. Because of this we can conclude that the actual impact of the translator on a robust web-site will be less significant than our findings.
![]() |
![]() |
Our data (Figures 4 and 5) shows that the translation of pages does not create noticeable overhead on top of what it takes for the translator to act as a basic proxy. Moreover, acting as a basic proxy creates so little overhead that delays in transmission via the Internet completely shadow any performance hit caused by our translator (Table 1)5. We conclude that the use of a translator in the fashion we describe will not cause a major performance hit on a web site.
Cookies that are set or retrieved by external sites (not the translated server) will not be translated by the translator. This is because the translator in effect only represents its server and not any external sites.
[1] C. Jackson, A. Bortz, D. Boneh, J. C. Mitchell, ``Web Privacy Attacks on a Unified Same-Origin Browser,'' in submission.
[2] E. W. Felten and M. A. Schneider, ``Timing Attacks on Web Privacy,'' In Jajodia, S. and Samarati, P., editors, 7th ACM Conference in Computer and Communication Security 2000, pp. 25-32.
[3]T. Jagatic, N. Johnson, M. Jakobsson, F. Menczer: Social Phishing. 2006
[4] M. Jakobsson, T. Jagatic, S. Stamm, ``Phishing for Clues,'' www.browser-recon.info
[5] M. Jakobsson ``Modeling and Preventing Phishing Attacks.'' Phishing Panel in Financial Cryptography '05. 2005.
[6] B. Grow, ``Spear-Phishers are Sneaking in.'' BusinessWeek, July 11 2005. No. 3942, P. 13
[7] M. Jakobsson and S. Myers, ``Phishing and Counter-Measures : Understanding the Increasing Problem of Electronic Identity Theft.'' Wiley-Interscience (July 7, 2006), ISBN 0-4717-8245-9.