In the construction of Google Flu Trends model, the authors
identified search terms by performing correlations
between influenza-like illness data from the US CDC and
the top 50 million Google search queries performed in the
US over the corresponding period [8]. Such data is not
available to the public and an alternative approach to identification
of search terms was required; two approaches
Milinovich et al. BMC Infectious Diseases (2014) 14:690 Page 2 of 9
were used. Firstly terms related to diseases, the aetiological
agents and colloquialisms (such as “hep” for hepatitis or
“flu” for influenza) were manually identified. Secondly,
Google Correlate (www.google.com/trends/correlate) was
queried using monthly surveillance data (described above).
Google Correlate provides a list of up to 100 search terms
that correlate most highly with the query data. To account
for potential language shifts that may have affected search
behaviour [4], this was performed three times using surveillance
data covering the periods 2004–13, 2007–13 and
2011–13. Up to 300 search terms were downloaded from
Google Correlate for each notifiable disease (100 search
terms per period analysed) and manually sorted; any term
related to the queried notifiable disease was included,
regardless of the nature of the potential association
Suitable terms were combined with the manually identified
search terms to create a list of search terms (see
Additional file 1). No attempt was made to filter search
terms based upon biological plausibility; any term that
may be perceived to have any association with the
disease of interest was included.
Search frequencies for terms of interest were collected
through Google Trends (www.google.com/trends/). All
data extractions were performed on the 22nd of October,
2013. Google Trends was queried using each of the identified
terms at a national and state/territory level using
the entire time range available (2004–present). Google
Trends presents search frequency as a normalised data
series with values ranging from 0 to 100 (with 100 representing
the point with the highest search frequency and
other points scaled accordingly); functionality for exporting
search frequency data as a .CSV file is provided. For
the purpose of privacy, data are aggregated at a daily,
weekly or monthly level (or are restricted if there is insufficient
search volume). The level of aggregation applied is
determined by the period analysed and the search frequency;
the level of aggregation is not able to be specified
by the user. As the notifiable disease surveillance data
used was in monthly format, monthly indices of query
search frequencies were required. Monthly indices are displayed
graphically by Google Trends when querying periods
greater than 36 months; rather than downloading.
CSV files, a script was developed to scrape data from the
Google Trends webpage, allowing the problems associated
with the level of data aggregation to be overcome.