How Bing Can Identify Early-Stage Cancers

Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results

John Paparrizos, MSc,
Ryen W. White, PhD ⇑ and
Eric Horvitz, MD, PhD

Author Affiliations

Columbia University, New York, NY; and Microsoft Research, Redmond, WA
Corresponding author: Ryen W. White, PhD, Microsoft Research, One Microsoft Way, Redmond, WA 98052; e-mail: ryenw@microsoft.com.

Introduction:

Pancreatic adenocarcinoma poses a difficult and resistant challenge in oncology. It is the fourth leading cause of cancer death in the United States and is the sixth leading cause of cancer death in Europe.¹ The illness is frequently diagnosed too late to be treated effectively^2,3 and can progress from stage I to stage IV in just over 1 year.⁴ Approximately 75% of patients with pancreatic adenocarcinoma who are not candidates for surgery will die within 1 year of diagnosis, and only 4% will survive for 5 years postdiagnosis.⁵

Early signs and symptoms of pancreatic adenocarcinoma are subtle and often present as nonspecific symptoms that appear and evolve over time. The symptoms often do not become salient until the disease has metastasized. We studied a nontraditional, yet promising direction for the early detection of pancreatic adenocarcinoma. The approach centers on the analysis of signals from Web search logs. Specifically, we examined the feasibility of detecting “fingerprints” of the early rise of pancreatic adenocarcinoma via population-scale statistical analyses of the activity logs of millions of people performing searches on sets of relevant symptoms.

Symptoms and Risk Factors

We reviewed the signs, symptoms, and risk factors associated with pancreatic adenocarcinoma. We developed a symptom set covering the following concerns: yellowing sclera or skin, blood clot, light stool, loose stool, enlarged gall bladder, dark urine, floating stool, greasy stool, dark or tarry stool, high blood sugar, sudden weight loss, taste changes, smelly stool, itchy skin, nausea or vomiting, indigestion, abdominal swelling or pressure, abdominal pain, constipation, and loss of appetite. Synonyms for each symptom were identified (eg, symptom: yellowing sclera or skin, synonym: jaundice; symptom: abdominal pain, synonyms: belly pain, stomach ache). We also identified risk factors (eg, pancreatitis, alcoholism) and their associated synonyms (see Lowenfels and Maisonneuve³⁴), describing attributes or characteristics that may increase the likelihood of developing pancreatic adenocarcinoma. The symptoms and the risk factors were mapped to terms in search queries.

Extracting Pancreatic Adenocarcinoma Searchers and Symptom Searchers

To identify positive and negative cases in generating a learned model, we built a data set of searchers from two groups (Fig 1A). Pancreatic adenocarcinoma searchers (A) includes all searchers who inputted one or more queries matching the expression [(pancreas OR pancreatic) AND cancer]. We considered searchers with a diagnosis of pancreatic adenocarcinoma (B) as the subset of searchers (A) who issued one or more experiential diagnostic queries. Symptom searchers (C) includes all searchers with one or more queries related to pancreatic adenocarcinoma symptoms or synonyms (see Symptoms and Risk Factors).

View larger version:

PowerPoint Slide for Teaching

FIG 1.

(A) Venn diagram depicting the sets of searchers used in the search log analysis: pancreatic adenocarcinoma searchers (A), pancreatic adenocarcinoma searchers with experiential diagnostic queries (B), and those who searched for pancreatic adenocarcinoma symptoms (C). |A ∪ C| (ie, the total number of searchers in our original, prefiltered data set) was 9.2 million. Positives are sourced from B ∩ C and negatives are sourced from C \ A. Relative set sizes in the diagram are not to scale. (B) Schematic illustrating the query timelines used in the selection of positive and negative cases. S₀ refers to the first symptom query and Exp₀ is the first experiential diagnostic query. α is the duration of the symptom lookup period, which is meant to be approximately equal in the aggregate for the positives and negatives. β is the duration of the period of diagnosis, set to 1 week in the current study.

The full search histories of 9.2 million searchers comprise the union of (A) and (C) in Figure 1A. We used a statistical topic classifier developed for use by the Bing search service to identify all health-related queries. We also applied statistical classifiers developed by Bing to make inferences about searchers’ ages and gender. Using these statistical models as filters, we identified searchers for whom > 20% of their queries were health related. We excluded those searchers, given the high likelihood that they were health care professionals.³⁵ A total of 7.4 million searchers remained, among whom 479,787 were pancreatic adenocarcinoma searchers. As additional features for statistical analysis, we used a classifier that provides distributions of topics for queries and clicked results.³⁶ We also considered the dominant geolocation for each searcher using a table that links their Internet provider address to locations.

Positive and Negative Cases

We created query timelines for those labeled as experiential diagnostic searchers and exploratory symptom searchers, and drew sets of observations from these timelines to construct a risk-stratification model. Figure 1B summarizes the strategies for identifying positives and negatives. Query timelines are aligned across searchers based on the point when people issued the first experiential diagnostic query. To ensure sufficient data about each searcher, we removed from the study those with fewer than five search sessions (comprising a sequence of search actions with no more than 30 minutes between actions)^15,17 spanning five different days. This reduced the population to 6.4 million searchers, with a mean total duration (time between first and last queries) of 210.32 days (standard deviation of 182.93 days and interquartile range of 120 days).

Positive cases

To identify experiential pancreatic adenocarcinoma searchers, we defined first-person diagnostic queries for pancreatic adenocarcinoma (Exp₀) based on an exploration of logs. Queries admitted as experiential diagnostic queries included such phrases as “Just diagnosed with pancreatic cancer,” “Why did I get cancer in pancreas,” and “I was told I have pancreatic cancer, what to expect.” From the set of pancreatic adenocarcinoma searchers, 3,203 matched the diagnostic query patterns. Experiential searchers must have searched for at least one symptom before Exp₀. This generated 1,072 query timelines of experiential searchers containing periods of symptom lookup followed by the diagnostic query (33.5% of all experiential diagnostic searchers). The symptom lookup period starts when the first symptom is detected in our symptom set (mean duration [α] = 109.34 days, standard deviation = 49.66 days). For positives, the symptom lookup period terminates at least 1 week before diagnosis (β = 1 week) to reduce the likelihood of overlap between them (which could add noise to model training and testing), while allowing us to understand predictive performance with minimal lead times.

Negative cases

To generate a set of searchers we considered negative for pancreatic adenocarcinoma, we sampled from those who searched for pancreatic adenocarcinoma symptoms but who did not search for pancreatic adenocarcinoma directly anywhere in their timeline. We reduced the number of negatives via a sampling procedure to include only those with symptom lookup durations within three standard deviations of the mean of the positives (n = 3,025,046). The resultant positive and negative distributions are statistically indistinguishable using two-sample Kolmogorov-Smirnov tests for temporal duration (D = 0.005, P = .7017) and number of queries (D = 0.003, P = .7681), even though the latter was not a filtering criterion.

Early Detection

We framed early detection as a binary classification challenge using a statistical classifier. We trained the classifier on features from query timelines of experiential pancreatic adenocarcinoma searchers and symptom-only searchers. Given concerns about false positives and the rarity of pancreatic adenocarcinoma, we focused on maintaining low false-positive rates (FPRs; ie, one wrong prediction in 100,000 correctly identified cases), while retaining a high imbalance ratio of positives and negatives (ie, 1,000 positives v millions of negatives).

The set of observations or features extracted from the symptom lookup period are grouped into five categories as follows: (1) searcher demographic information, including age/sex predictions and dominant location (Demographics); (2) session characteristics, query classes, and URL classes, including activity characteristics and the topics of queries issued and resources accessed (Search Characteristics); (3) characteristics about symptoms searched, including generic symptom searching (eg, number of distinct symptoms; Symptom General) and features for each symptom (Symptom Specific); (4) features that capture the temporal dynamics of the features (eg, increasing/decreasing over time, rate of change; Temporal), and (5) risk factors, including their presence in queries (Risk Factors).

The learned statistical model is based on the gradient boosted trees³⁷ method. Regularization methods were used to minimize the risk of overfitting. See Paparrizos et al³⁸ for details on the construction of the classifier. We used the statistical classifier to study our ability to perform early identification of searchers who would later make experiential diagnostic queries for pancreatic adenocarcinoma. To characterize the predictive power, we used the area under the receiver operator curve (AUROC) and the recall (true-positive rate [TPR]) at low FPRs as evaluation metrics. Model generalizability was assessed using 10-fold cross validation, stratified by searcher.

Journal Of Oncology Practice, Tuesday, June 14, 2016

(60)

Bing Cancers EarlyStage Identify