Modeling real-world activity using web search data can
provide a number of benefits. First, it can be more timely,
especially when the alternative is not electronically collected.
Influenza surveillance from the United States Centers for
Disease Control and Prevention (CDC), Influenza Sentinel
Provider Surveillance Network (ILINet) has a delay of one to
two weeks1
. For economic indicators like unemployment, this
delay is measured in months10. In contrast, search data can
“predict the present” since it is available as the target activity
happens10. Second, query data has good temporal and spatial
resolution. If an indicator of interest is incomplete (missing
time periods or regions, coarser temporal or spatial resolution,
etc.), query data can sometimes be used to fill in the gaps. For
example, influenza rate data from ILINet is only published by
the CDC at the national and regional level and is not published
for the off season13, but models based on query data can be
used to provide estimates year-round and at a state and
sometimes even city level, provided there is sufficient search
activity at that level1,14,15. Third, there can be considerable
expenses incurred in collecting data for traditional indicators.
Finally, while Internet users do not represent a random sample
of the United States population, this population has become
increasingly less biased over time and now represents 77% of
the adult population16. In the 18-29 subgroup, this number is
almost 90%. This is in contrast to traditional landline phone
surveys which must either under-represent this age group or
blend in cell-phone survey data at considerable difficulty and
expense17.