Note that, U(q) should be inferred without actually firing the
candidate query q in Google. We share some insights below
about what we should consider in inferring such utility.
Insight. Firstly, a retailer does not exist in isolation. There
are often a large number of peer retailers, which can reveal
useful insights of the domain. Imagine we are gathering
reviews for a hair salon (e.g., Salon Vim in Orchard). By
analyzing the domain data (i.e., Web pages) of other hair
salons, we can easily learn many useful patterns such as
salon name + stylist name. We can use these useful
patterns to guide what kind of queries we should choose
(e.g., “Salon Vim in Orchard, Alice”), to maximize the
utility. In summary, we propose to learn queries in a domainaware
manner. We emphasize that, such domain data can
be easily obtained in advance; e.g., we can Google salon
name + branch for each hair salon in the domain, fetch
their top 20 pages and finally use Y to analyze the content
relevance so as to find the useful patterns.
Secondly, a query does not exist in isolation. Multiple
queries are needed to gather more target pages. That is,
there exist a context of past queries that were already
fired for the target retailer. Given the time, bandwidth and
sometimes financial costs to query through a commercial
search engine, it is imperative to become context-aware:
accounting for the past queries to eliminate redundancy
between queries. Consider an example for getting hair salon
A’s review. Alice and service are both useful queries
on their own, but their respective top result pages from
Google may overlap. Such redundancy implies that, a set
of individually best queries is not necessarily the best set
of queries collectively. Thus, in addition to the candidate
queries themselves, we propose to account for the queries
from previous iterations, in order to capture the redundancy.