home   |   archives   |   about
 
 
SEARCH THIS SITE

Why Third-Party Click Fraud Estimates Don't Add Up - 2

By Shuman Ghosemajumder | Wednesday, January 31, 2007

In part 1, I wrote about the recent press releases from click fraud consulting firms on industry click fraud rates. In this post, I'd like to follow up on some of the issues we covered in our August report, "How Fictitious Clicks Occur in Third-Party Click Fraud Audit Reports", and explain why click fraud firms are still making egregious mistakes in (a) click counting, and even more egregious mistakes in (b) click fraud estimation.

To begin, where do third-party click fraud numbers come from? At Google, whenever we detect malicious activity against an advertiser's account, we mark those clicks as invalid, and thus don't charge the advertiser for them. We utilize a number of different automated techniques and algorithms, as well as proactive manual analysis, to do this, analyzing hundreds of different factors. The analysis that we see from third-party auditing firms (including ClickForensics) seems to essentially rely on just one factor, which we call IP frequency. IP frequency is the number of times an IP address clicks within a certain time window. If it clicks too many times, it could be click fraud. On our end, this is a very simple rule which runs in an automated fashion, protecting Google advertisers 24/7. Third-party firms sometimes find the same suspicious IP frequency patterns that our systems do, and include them in their click fraud reports - leading advertisers to request refunds for clicks they were never charged for in the first place.

But that is actually not even the most common problem with their analyses. What is far more common is that the reports we receive from them ask for refunds for clicks which do not even exist. This more serious problem comes from the issues we addressed in our August report on fictitious clicks. In that report, we demonstrated the limits of web log based analysis for any analytics purpose (including click fraud analysis) due to the way Internet Explorer, Firefox and other browsers work. Unfortunately, that was a very technical report, which was difficult for many readers to parse. I'll try to provide a simpler explanation here.

Here's the problem: web logs, whether generated by an advertisers, or by third-party code on an advertiser's site, cannot directly track ad clicks. Instead, they track visits to a special landing page URL on the advertiser's site (e.g. http://example.com/?adwords ) as a proxy for how many ad clicks occurred. The assumption they're relying upon is that each visit to that URL corresponds to a unique click, and vice versa. But in practice this is not the case. Once a user visits that page, they often browse through the site, navigating through sub pages, and then return to the original landing page by hitting the back button. When the landing page is reloaded in the browser, it appears in the web log as though additional ad "clicks" are occurring. Google can count ad clicks reliably as a click on a Google ad will cause the web browser to contact Google and then we redirect it to the advertiser's landing page. A reload of the advertiser's landing does not contact Google again. In addition, the referrer URL which is passed by the browser when users hit the back button is actually the original referrer URL (which says the page came from an ad click) which gets cached, so there is no analysis which can be done based on logs alone which can resolve this. This is where the fictitious clicks come from.

When one analyzes data from web logs under these default conditions, we find that on average it leads to a 40% inflation of click estimates. You can think of it this way: if an average of 1000 clicks occurred, a log based analysis would estimate on average that there were 1400 clicks, 400 of which are fictitious and did not actually occur.

Now consider the principal analytical tool of third-party click fraud firms: IP frequency. When they see a user browsing through the site, and reloading the landing page multiple times in a short time window, they will classify it as click fraud - even though those "clicks" do not actually exist. It also results in the misclassification of advertisers' best users (the ones who are spending time browsing through their sites) as "fraudulent".

Thus, while click estimates were inflated by 40% on average, click fraud estimates were inflated by much, much higher amounts. As we detailed in our report, we found cases of firms reporting click fraud rates above 100% in some instances due to this problem. We also found that in other instances, clicks classified as "click fraud" by third-party firms produced sales at the same rate as the "good" clicks. In other words, the identification of click fraud by third-party firms was much worse than imprecise - it was not even in the right ballpark, with nearly all of the "bad" clicks they identified actually being fictitious.

The net result was that advertisers were consistently being given false data from reports they trusted, which would actually hurt their advertising campaigns if they acted on them. For example, if an advertiser is told certain keywords have higher "fraud rates", they are likely to change their campaign to eliminate spending on those keywords in favor of others, hurting the performance on their campaigns when this information is false. The damage this can do to advertisers' businesses can be quite large.

So is there a solution to this? Yes. Third-party analytics (not click fraud) firms have been aware of the page reload issue for many years, and generally use redirects (rather than web log based tracking) to avoid it. If one is tied to using web site logs (or landing page code generating logs) however, the only solution is to use the AdWords auto-tagging feature. Auto-tagging has been available since 2005, and is a feature which appends a unique ID to the landing page URL for every click, so that the cases of (a) multiple clicks and (b) multiple reloads of the landing page can be easily distinguished.

Two of the three firms we identified in our report, AdWatcher and ClickFacts, have not made any changes we're aware of. That's discouraging to say the least. ClickForensics claims to have fixed this problem a couple of months ago by requiring their AdWords clients to use auto-tagging, yet despite such a significant change in methodology, their new numbers are nearly the same as their old numbers. Perhaps it hasn't yet been fully or correctly utilized, so the significant corrective drop in their numbers is yet to come. Or perhaps their network is heavily skewed toward non-Google advertisers, and thus they still cannot correct the problem until Yahoo, MSN and others implement their own versions of auto-tagging. Until then, considering that the total number of clicks they're counting could be off by as much as 40%, and their click fraud estimates could be off by much more, there's very little meaning in a difference of 0.1% from Q2 to Q4 - or in any of their other inferred statistics. But most importantly, the fact that they don't take into account the amount that Google already protects advertisers against means that they're not even trying to measure actual click fraud.

   

Comments

There still needs to be discussion of the types of click fraud that are difficult (if not impossible) to detect, such as what originates from botnets. Focusing on the incorrect methodologies of the click fraud firms just deflects attention from the real issues.

CPCcurmudgeon
January 31, 2007, 8:30PM


Greg, I agree with you. Unfortunately, these flawed estimates attract a great deal of publicity and must be debunked quickly before they mislead advertisers, users, and the industry as a whole.

I'll be answering the questions you and others have asked recently in my following posts. Until then, the short answer is that our click fraud protection systems analyze data in a way that is generally independent of the source or method used for the attack. So we're not looking to identify a botnet or a click farm so much as we're trying to identify potentially malicious or fraudulent clicks.

Click fraud which can be "difficult to detect" can come from many different sources, including botnets or even a well-designed manual scheme. Similarly, botnets can also be poorly designed, and their clicks easily detected and filtered.

The challenge is ultimately how do you detect the hard-to-detect click fraud attempts, regardless of their methods. And that is exactly what our click quality team focuses on, and why it's important to stay ahead of fraudsters in technological sophistication. We think we're doing that very well, and I'll write in more detail about this soon.

Shuman, January 31, 2007, 11:26PM


I posted a lengthy reply to you on Matt Cutts' blog. Not everything I post there makes it into the comments. I've also posted a more brief explanation of my concern on Searchengineland (in a comment -- I don't write for them).

One of my complaints is that your responses to the click fraud reports are not addressing concerns I raised last year (and in previous years). Click manipulation has been around since the mid-1990s and the technology was robust before Google existed.

Whether you've got the problems identified by these third-party auditing services (which don't actually address the core concerns either) is another matter. You confidently predict that ClickForensics numbers will adjust downward or -- if they don't -- must reflect other networks' lack of AdWords-like unique URLs.

Merchants need a better reporting system. Let's all agree on that. And merchants need to know why people are clicking on their links. Perhaps the click audits are not providing sufficient consideration the tendency of consumer research.

The problem is that Google is not providing sufficient data to back up its claims (an issue I raised last year as well). Your arguments are not based on a clear presentation of evidence, but rather on a your summation of the data combined with your interpretation.

People are not inclined to trust that kind of response regardless of how sensitive the data being evaluated truly is. You may be able to win this war of words through repetition -- psychologists, marketers, and propagandists have known for years that repetition is very convincing. But it would be more comforting -- to me, at least -- to see Google actually make the point by substantiating its sweeping generalizations with some real data.

Michael Martinez
February 01, 2007, 11:43AM


Thanks for your feedback, Michael.

Here's the core issue, and I hope you'll agree that it's not a matter of our opinion, but something easy to demonstrate: counting clicks using just web logs or code on an advertiser's landing page generating web logs, without auto-tagging, is subject to the click inflation problem we identified above.

Since anyone can reproduce the behavior we describe above on their own systems, you can see that this is not at all an accurate way of counting clicks.

I went to your other posts, and here is a quick response: there's no question we're not talking about the specific signals and techniques we use to catch the types of sophisticated attacks you mention. If we made that information public it would open our advertisers up to enormous risk. But we do talk about the overall approach we use to detect click fraud, and I've spoken at length to advertisers, agencies, and reporters about how we utilize statistical anomaly detection along hundreds of different factors in order to do so. I'm planning on writing more about that here too.

You mentioned in your post that there are systems that emulate clickthroughs, staytimes, spoof useragents, and are distributed between multiple C-Blocks. We see systems like that (and even more sophisticated ones) too, on a daily basis, and dealing with those types of attacks is one of the key functions of our click quality team. The goal of fraudsters is to make their traffic look organic. To do so in large volumes, in a way we would not be able to detect, would require the spoofing of not just what you mention above, but hundreds of different factors we analyze - and most of which we keep secret. This is why we can do an effective job of protecting advertisers in this area. Like I said, I'll be writing more about this soon.

Thanks again for your feedback, and I hope that helps.

Shuman, February 01, 2007, 12:02PM


Thanks for the quick reply.

You're in the difficult position of being asked to corroborate what you say with generally acceptable data extracts.

It's not that people cannot confirm or test the type of behavior you're talking about. It's that you have not substantiated your points by showing conclusively that these are the primary causes of misinterpretation of data.

Michael Martinez
February 01, 2007, 3:28PM


I can appreciate that you keep the means by which you detect fraud secret (ie. not disclosed outside of the company), but does that mean that these means are in fact secret? In other words, is it possible that through study of the Internet architecture, fraudsters can not only discern these means, but devise means of their own of thwarting them? This is why I would like to see panels of Internet technical experts discuss the realities of how secret fraudulent traffic detection/generation is.

On a related issue, it strikes me that it's taken over a decade now for open discussion of these matters, limited as it is. This despite the fact that means of compromising computers on wide scales and getting them to generate fraudulent traffic has been known within the technical Internet community as long as the Internet has existed. It seems to me that this type of discussion needed to be had long ago. I can appreciate how advertisers feel wronged by the fact that it's only in response to challenges that certain facts come to light, rather than as a part of general discourse.

CPCcurmudgeon
February 01, 2007, 4:38PM


Here is why you need to disclose data. You're telling the world that people are misinterpreting user behavior when people click on the BACK button in their browsers.

There are two things a browser can do when you hit the BACK button: look at its cache or refetch the Web page. What do most browsers do.

You're not in a position to know what most browsers do. So your assumptions are unsupported by the fact that you don't have access to the browser settings that people are working with.

For example, here is set of raw server log entries from my own domain where I visited my page from a Google search, clicked through to a secondary page, and then hit the BACK button (I apologize if the formatting doesn't work in your comments window).

aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:03 -0500] "GET / HTTP/1.1" 200 14460 "http://www.google.com/search?hl=en&q=michael+martinez" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:03 -0500] "GET /pics/xenite_org.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:04 -0500] "GET /pics/sf-fandom.gif HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:04 -0500] "GET /pics/vme_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:04 -0500] "GET /pics/parma_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:04 -0500] "GET /pics/michael_portrait_2.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:04 -0500] "GET /pics/ume_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:45 -0500] "GET /books/understanding_middle-earth.html HTTP/1.1" 200 18168 "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:45 -0500] "GET /pics/xenite_org.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:46 -0500] "GET /pics/sf-fandom.gif HTTP/1.1" 304 - "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:46 -0500] "GET /pics/ume_cover_1.jpg HTTP/1.1" 200 9596 "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:46 -0500] "GET /pics/ume_intro_page.jpg HTTP/1.1" 200 20311 "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:46 -0500] "GET /pics/ume_cover_3.jpg HTTP/1.1" 200 9077 "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:46 -0500] "GET /pics/michael_portrait_2.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:47:47 -0500] "GET /pics/ume_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/books/understanding_middle-earth.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

What you do not see here is where I actually hit the BACK button. My browser (Internet Explorer) did not refetch the page. And yet, if I were to go back to Google and click through again, that click through would be recorded - in fact, it was:

aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:49 -0500] "GET / HTTP/1.1" 200 14460 "http://www.google.com/search?hl=en&q=michael+martinez" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:49 -0500] "GET /pics/xenite_org.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:50 -0500] "GET /pics/sf-fandom.gif HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:50 -0500] "GET /pics/michael_portrait_2.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:50 -0500] "GET /pics/vme_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:50 -0500] "GET /pics/parma_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
aaa.bbb.ccc.ddd - - [02/Feb/2007:07:51:50 -0500] "GET /pics/ume_cover.jpg HTTP/1.1" 304 - "http://www.michael-martinez.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

On the basis of this one test alone, I have demonstrated that your premise is false under at least one very common condition.

The ease with which people can test browser-server activity and data relationships underscores why people are not taking your arguments seriously. When it is this simple to show that key statements in your position are wrong, you have a serious credibility issue.

I have omitted nothing from my server log data as posted here, except the IP address from which I visited Google and my Web site. Hitting the BACK button produced no result.

Regardless of whether a browser can be configured to refetch the page automatically, your statements about the implications of hitting the BACK button are not consistent with what is actually happening for many surfers.

Michael Martinez
February 02, 2007, 5:03AM


Michael, first of all, thanks for trying this analysis! I appreciate the time and effort.

Unfortunately you've misinterpreted the point a bit here. The back button itself certainly does not cause a page reload on all web pages. A page reload is caused on many web sites which utilize dynamic pages or nocache directives (which includes many advertising and commercial sites). Perhaps my wording in this post "when the landing page is reloaded in the browser" was a bit ambiguous, but we spell this point out very clearly in our August report.

Code placed on an advertiser's landing page expressly for the purpose of tracking visits to that page, usually has a nocache directive to prevent that code (e.g. an image or JavaScript tracker) from getting cached. So in those cases the back button almost always reloads the tracker, and thus generates another entry in the log.

If you'd like to try the actual experiment, do what you did above but with a dynamically generated / nocache page and then see if you can tell the difference in your logs between the original page load (the one that happens after the ad click) and subsequent reloads using any additional browser information. You can't - and that's where the actual tracking problems arise.

Shuman, February 02, 2007, 10:12AM


The stigma of Click Fraud has become damaging to Internet Marketing. The truth is that advertisers are willing that the Internet Visitors click on their ads for any purpose. These clicks have powerful eBranding effects and other advantages even if visitors do not need the products being advertised at that particular time of visit. However, the scare of click fraud in the mind of the Internet visitor and not knowing what click fraud really is has had deep damaging effects. In stigmatic ways, now visitors believe if they click and don't buy, they have committed fraud. This is the most damaging to the Internet Marketing and eMarketing strategies on the Internet as any past short-sightedness thinking. Remember the early times when people were afraid of Electric Poles in their neighborhood? What happened? Who really benefited? Not the electric companies, but the people.

I strongly agree with Google that Click Fraud, in terms of the huge magnitude it is projected now, does not exist. Google and Yahoo have developed powerful detection methods that either do not count the extra clicks as click fraud or they totally discount the clicks. Those who are trying to make a big deal out of this issue are either just plain stupid or jealous to Google and Yahoo success. If you ask me, these are the people who have committed click fraud in the past and now are angry that Google and Yahoo have terminated their accounts.

Corey Katir
March 21, 2007, 1:39AM


There still seems to be a somewhat large disparity between what you are saying here and what third party click fraud consultants are reports - notably that they de-duplicate results by the click IDs that sites like Google use to distinguish between unique clicks through to the advertiser.

Frankly, the numbers cannot be confirmed until we develop a set of readily available metrics that bring together the analysis from the portal's side together with the analysis from the advertiser's side.

Google and third-party consultants need to realise that you will not convince anyone with proprietary analytics and that arguing over semantics at this point is only damaging the industry's image.

Matt O'Kane

Matthew O'Kane
April 20, 2007, 9:06AM


Thanks for the feedback, Matt!

This article was written more than two months ago, and I've found myself asked about click fraud estimates a lot less since then.

I have actually had very positive discussions with third party click fraud auditing firms in recent months and see them moving in a good direction. Several companies have told us that they recognize the limitations of analyzing click fraud from a third party perspective and would never come out with click fraud estimates because of this. The services they are focusing on providing are also a lot broader than click fraud detection -- basically powerful, ROI-based, web analytics -- which also is the best way of monitoring for any instances, however rare, of undetected click fraud.

There are only a few firms still trying to promote average click fraud numbers, but I'm seeing fewer and fewer folks paying attention to them. I think that advertisers have also learned that averages are not meaningful to them, and are focusing on their own campaigns (for which we provide daily invalid clicks reports) for the real data.

It seems the industry overall is moving past arguing about average click fraud rates and onto focusing on working together to help advertisers -- which I think is positive for everyone.

Shuman, April 20, 2007, 7:18PM



Links: Links to this site
My speaking schedule for early 2008
ClickFraud Might Be Up
Blogs By Googlers
Google & Yahoo lanceren Ad Traffic Quality Center
Click Fraud, Google AdWords and gclid
Long Overdue Blogroll Updates
Google Dishonors War Dead
 
Copyright © 2003-2008 Shuman Ghosemajumder. All contents available under a Creative Commons License. Opinions on this web site are the author's own. Generated Friday, May 16th, 06:48:07 PM EST.