The NY Times (and few others) wrote a story entitled Microsoft Finds Cancer Clues in Search Queries:
Microsoft scientists have demonstrated that by analyzing large samples of search engine queries they may in some cases be able to identify internet users who are suffering from pancreatic cancer, even before they have received a diagnosis of the disease.
The scientists said they hoped their work could lead to early detection of cancer.
Little below you read:
The researchers reported that they could identify from 5 to 15 percent of pancreatic cases with false positive rates of as low as one in 100,000.
Before you run to Bing with the hope of having some miraculous diagnosis, ask yourself: what's the probability that, being "positive" at the bing test, you actually have pancreatic cancer?
(Spoiler: it's 50%, and in the indented paragraphs below there is some math you can skip.)
There is a simple formula for that, called the Bayes formula (or theorem):
is the probability that you score positive at the bing test, regardless if you have or not the cancer. This the probability that you score positive and have the cancer ("true positive" probability times probability of having the cancer) plus the probability that you score positive without the cancer ("false positive" probability times 1 - probability of having the cancer).
The article states that the "false positive rates [are] of as low as one in 100,000". The "true positive" is reported as 5-15%, which I will approximate to 10%.
The probability that a random person will be diagnosed with pancreatic cancer is about 10 out of 100,000 (see cancer.org).
We now have all the numbers to compute the probability that, having microsoft diagnosing the cancer, a user actually has it:
This is not difficult to understand. It's common sense. The probability of being rightfully diagnosed by bing is 1 out of 100,000 (10 out of 100,000 people have the cancer, and of these 10 only one receives the diagnosis). Of the remaining healthy ones (practically 100,000), 1 person is diagnosed by mistake. Therefore, bing, out of 100,000 people, will diagnosed 2 people: one is the "true positive", the other the "false positive". The probability of being a true positive is 50% then.
Now, there is some difference between the original article by the two Microsoft researchers (Ryen White and Eric Horvitz) and the NY Times' one. White and Horvitz are not sensationalist in their publication. For a reason –in 2008, they wrote an article on "cyberchondria", or "unfounded escalation of concerns about common symptomatology, based on the review of search results and literature on the Web". Their belief, I believe, is that the Web is danger place for diagnosing yourself.
Nonetheless, the tone of the Time's article is slightly sensationalist. I do believe that 50% is better than nothing. But I do believe that writing about a "false positive rates of as low as one in 100,000" is misleading. Particularly when the article does not report that the final confidence of the diagnosis is 50%.
Articles like this, IMHO, are prone to lead to cyberchondria, and it would be a pity for White and Horvitz to achieve exactly the opposite result they (supposedly) had in mind when writing their original piece.
Below, a few books I read on the topics. From the most technical (a great tutorial on Bayesian statistics) to medicine ("The patient will see you now", on how health is being disrupted by data analysis), and the always enjoyable "Numbers Behind Numb3rs".