In
part 1, I wrote about the recent press releases from click fraud consulting firms on industry click fraud rates. In this post, I'd like to follow up on some of the issues we covered in our August
report,
"How
Fictitious Clicks Occur in Third-Party Click Fraud Audit Reports",
and explain why click fraud firms are still making egregious mistakes in (a)
click counting, and even more egregious mistakes in (b) click fraud estimation.
To begin, where do third-party click fraud numbers come from? At Google, whenever
we detect malicious activity against an advertiser's account, we mark those
clicks as invalid, and thus don't charge the advertiser for them. We utilize
a number of different automated techniques and algorithms, as well as proactive
manual analysis, to do this, analyzing hundreds of different factors. The analysis
that we see from third-party auditing firms (including ClickForensics) seems
to essentially rely on just one factor, which we call IP frequency. IP frequency
is the number of times an IP address clicks within a certain time window. If
it clicks too many times, it could be click fraud. On our end, this is a very
simple rule which runs in an automated fashion, protecting Google advertisers
24/7. Third-party firms sometimes find the same suspicious IP frequency patterns
that our systems do, and include them in their click fraud reports - leading
advertisers to request refunds for clicks they were never charged for in the
first place.
But that is actually not even the most common problem with their analyses.
What is far more common is that the reports we receive from them ask for refunds
for clicks which do not even exist. This more serious problem comes from the
issues we addressed in our August report on fictitious clicks. In that report,
we demonstrated the limits of web log based analysis for any analytics purpose
(including click fraud analysis) due to the way Internet Explorer, Firefox and
other browsers work. Unfortunately, that was a very technical report, which
was difficult for many readers to parse. I'll try to provide a simpler explanation
here.
Here's the problem: web logs, whether generated by an advertisers, or by third-party
code on an advertiser's site, cannot directly track ad clicks. Instead, they
track visits to a special landing page URL on the advertiser's site (e.g. http://example.com/?adwords
) as a proxy for how many ad clicks occurred. The assumption they're relying
upon is that each visit to that URL corresponds to a unique click, and vice
versa. But in practice this is not the case. Once a user visits that page, they
often browse through the site, navigating through sub pages, and then return
to the original landing page by hitting the back button. When the landing page
is reloaded in the browser, it appears in the web log as though additional ad
"clicks" are occurring. Google can count ad clicks reliably as a click
on a Google ad will cause the web browser to contact Google and then we redirect
it to the advertiser's landing page. A reload of the advertiser's landing does
not contact Google again. In addition, the referrer URL which is passed by the
browser when users hit the back button is actually the original referrer URL
(which says the page came from an ad click) which gets cached, so there is no
analysis which can be done based on logs alone which can resolve this. This
is where the fictitious clicks come from.
When one analyzes data from web logs under these default conditions, we find
that on average it leads to a 40% inflation of click estimates. You can think
of it this way: if an average of 1000 clicks occurred, a log based analysis
would estimate on average that there were 1400 clicks, 400 of which are fictitious
and did not actually occur.
Now consider the principal analytical tool of third-party click fraud firms:
IP frequency. When they see a user browsing through the site, and reloading
the landing page multiple times in a short time window, they will classify it
as click fraud - even though those "clicks" do not actually exist.
It also results in the misclassification of advertisers' best users (the ones
who are spending time browsing through their sites) as "fraudulent".
Thus, while click estimates were inflated by 40% on average, click fraud estimates
were inflated by much, much higher amounts. As we detailed in our report, we
found cases of firms reporting click fraud rates above 100% in some
instances due to this problem. We also found that in other instances, clicks
classified as "click fraud" by third-party firms produced sales at
the same rate as the "good" clicks. In other words, the identification
of click fraud by third-party firms was much worse than imprecise - it was not
even in the right ballpark, with nearly all of the "bad" clicks they
identified actually being fictitious.
The net result was that advertisers were consistently being given false data
from reports they trusted, which would actually hurt their advertising campaigns
if they acted on them. For example, if an advertiser is told certain keywords
have higher "fraud rates", they are likely to change their campaign
to eliminate spending on those keywords in favor of others, hurting the performance
on their campaigns when this information is false. The damage this can do to
advertisers' businesses can be quite large.
So is there a solution to this? Yes. Third-party analytics (not click fraud)
firms have been aware of the page reload issue for many years, and generally
use redirects (rather than web log based tracking) to avoid it. If one is tied
to using web site logs (or landing page code generating logs) however, the only
solution is to use the AdWords
auto-tagging feature. Auto-tagging has been available since 2005, and is
a feature which appends a unique ID to the landing page URL for every click,
so that the cases of (a) multiple clicks and (b) multiple reloads of the landing
page can be easily distinguished.
Two of the three firms we identified in our report, AdWatcher and ClickFacts,
have not made any changes we're aware of. That's discouraging to say the least.
ClickForensics claims to have fixed this problem a couple of months ago by requiring
their AdWords clients to use auto-tagging, yet despite such a significant change
in methodology, their new numbers are nearly the same as their old numbers.
Perhaps it hasn't yet been fully or correctly utilized, so the significant corrective
drop in their numbers is yet to come. Or perhaps their network is heavily skewed
toward non-Google advertisers, and thus they still cannot correct the problem
until Yahoo, MSN and others implement their own versions of auto-tagging. Until
then, considering that the total number of clicks they're counting could be
off by as much as 40%, and their click fraud estimates could be off by much
more, there's very little meaning in a difference of 0.1% from Q2 to Q4 - or
in any of their other inferred statistics. But most importantly, the fact that
they don't take into account the amount that Google already protects advertisers
against means that they're not even trying to measure actual click fraud.