Privacy-Preserving Data Mining as a Service in the Cloud
The discovery of frequent patterns, association rules, and correlation relationships among huge amounts of data is useful to business intelligence.
A typical example of frequent itemset mining is market basket analysis.
This process analyzes customer buying habits by finding associations between the different items that customers place in their shopping baskets.
The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items customers frequently purchase together.
For a decade, there has been a growing interest in data mining as a service.
In this paradigm, a company (data owner) that lacks data storage, computational
resources, and expertise, stores its data in the cloud and outsources the mining tasks to the cloud service provider (server).
Without doubt, data mining as a service offers valuable benefits to business intelligence.
However, it also presents a serious privacy problem; that is, the server has access to company data and could learn business secrets from it.
To protect a company’s data privacy and yet enable the server to perform association rule mining on the data in the cloud, a naïve solution is for the data owner to hide the meanings of items in its transaction database by substituting items with unique numbers
(where the same item is substituted by the same number and different items are substituted by different numbers).
This one-to-one substitution approach doesn’t hide the frequencies of items. If the server
has some background knowledge (for example, information on the frequencies of some items), it can reidentify them, particularly the most frequent items.
For example, if bread is the most frequent item in retail transaction databases, the server can conclude that the most frequently occurring number refers to bread in the transformed database.
To prevent background-knowledge-based attacks,Wai Kit Wong and his colleagues proposed a
one-to-n item mapping that transforms transactions nondeterministically.
The basic idea is to add fake items to the transaction database.
However, fabrication of false data degrades the accuracy of data analytics,and the proposed method has two weaknesses that can be exploited.
First, each fake item has the same probability of being added to each transaction,
and thus appears with similar frequency when the number of transactions is large.
Second, fake items are added to transactions independently of the items already present. As a result, each fake item is independent of all other items.
This second observation holds even if the frequency of each fake item is different.
Ian Molloy and his colleagues presented a frequency-analysis-based attack to Wong and colleagues’ algorithm.
The attack could remove the independently added fake items by detecting the low
correlations between items, and some of the top frequent items were reidentified successfully.