AOL's search logs: the ultimate “Database Of Intentions”

Google's IMsearch [click to enlarge]

AOL Labs prompted a weekend of hyperventilation in the ‘blogosphere’ by publishing the search queries from 650,000 users. This mini-scandal may yet prove valuable, however, as it reveals an intriguing psychological study of the boundaries of what is considered acceptable privacy.

In his turgid book on Google – one so obsequious and unchallenging that Google bought thousands of copies to give away to its staff – former dot.bust publisher John Battelle enthused about something he called the “database of intentions”. The information collected by search engines, he trumpeted, would be a marketer’s dream, and tell us more about ourselves than we ever realized we could know. AOL’s publication is the first general release of such a database to the public.

But hold on a minute. Is it, really?

AOL’s data was anonymized, with user identification removed. The search logs contained 10.8m normalized queries from 658,086 unique users, collected between March 1 and 31 May this year, amounting to around a third of all queries made by its US users. The data has since been removed, but an AOL research paper which was built on the data can still be found, here [PDF, 228kb]. You may find it about as enlightening as similar studies we’ve covered before (for example, see People more drunk at weekends, researchers discover).

Although the user’s IDs were hidden, and didn’t contain information on what the user actually clicked on, some argued that the data permitted personally identifiable information to be inferred from the query logs.

Something called TechCrunch, a weblog devoted to hyping its publisher’s personal investments and companies created by his friends, explained how:

“The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with ‘buy ecstasy’ and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.”

Now, you may be thinking – that only serves people right for conducting vanity searches. But more seriously, there are dangers in following this line of reasoning.

It’s not only individuals who “ego surf”, it could be the individual’s spouse, a member of their family, a colleague, or even their web stalker. (I’ve had a few).

Similarly, is the query “buy ecstasy” necessarily the intention of a raver, or tweaker? It might be a parent, a neighborhood watch scheme, or a promoter, keen to stamp out drug dealing at his venue before an event.

So the “database of intentions”, then, turns out to be more more of “a database of inferences” – as reflective as it is of the inferrer as the web surfer.

And if, as TechCrunch weakly suggests, the act of typing “buy ecstasy” into a search is itself “evidence of a crime”, then there will be a lot of happy policeman out there this evening, for whom the business of catching criminals has just been made a lot easier.

The” precogs” of Phillip K Dick’s story Minority Report – who are able to predict crimes before they take place, thus allowing them to be prevented – will no longer be necessary. Plod will simply be able issue a pre-emptive warrant for a crime that never took place, on the basis of a user’s Google results, no?

So that’s one line of sloppy thinking dealt with. It ignores another, however.

Leave No Trace

Law enforcement agencies, particularly in the US, tend to receive more strict oversight than corporations. The immediate harm for ordinary citizens comes not from paranoid SF fantasies, but from the “database of inferences” being exploited for commercial gain.

More lives are affected every day by the actions of banks, insurance companies and HMOs, than they are by data-mining cops. If your LiveJournal blog contains more frowns than smileys, you may well need to be prescribed a course of an anti-depressant. If your lifestyle involves risky situations in night spots, you may well need to pay a higher insurance premium. Yet such invasive data mining is the inexorable conclusion of overestimating the value of this harvest of so-called “machine wisdom”.

People would be rightly be outraged if Big Pharma, banks and the insurance business created “inference profiles” based on one’s data trail. But, wait! That’s what they already do. Human decision-making is playing an increasingly smaller role in whether credit applications are approved, or what kind of health care is permissible. When corporations do this, they are making implicit moral choices – that one person is more or less than deserving than another – but obscuring the decision behind a smokescreen of technology babble.

The addition of an internet clickstream to the mass of data they already possess about you is but a small, incremental step.

So why not tackle this problem at source?

The only solution to the problem of data abuse – and it’s only an inadequate, and very partial answer – is to ensure the data isn’t there to abuse in the first place. If search engines were required to delete their users’ queries as soon as they were made, and to leave no trace, this would greatly diminish the dangers of false inference by law enforcement officials, health companies, banks, HMOs, and anyone else seduced by the lure of a faulty algorithm.

Data that doesn’t exist is also less vulnerable to being stolen.

This would disappoint law enforcement officials, many corporations, and most of all the search engines themselves – Google CEO Eric Schmidt has boasted of building a “Google that knows more about you.”

If that takes a regulatory agency, to ensure search engines “Leave No Trace”, so be it. And meanwhile, drooling over such bad metaphors as “Database of Intentions”, or “Collective Intelligence”, is going to make data abuse more, and not less likely.

Tags: