New Research Initiatives:
An Encrypted System for Sensitive Data  Recovering Statistical Histograms Exactly While Preserving Privacy
By Steven Kou (National University of Singapore)
Starting in 2016, RMI will be embarking on three new research initiatives: (1) An encrypted system for sensitive data; (2) A pricing and risk management system for Chinese bonds; (3) Forwardlooking return and volatility. Starting from this newsletter, we will introduce these initiatives sequentially with each new issue.
In the era of big data, collecting and analyzing data while still preserving people¡¯s privacy needs is an important issue. In finance people may not want to share their sensitive data, such as salary information and trading positions. However, the summary statistics of the data, e.g. histograms, perhaps not the original data, may be useful for regulators and for other people in the peer group.
There is extensive literature in cryptography studying multiparty secret sharing, starting from a classical paper by Adi Shamir in 1979, who received a Turing prize for his works. However three main challenges remain in finding a suitable algorithm to collect statistical information from data while preserving people¡¯s privacy: (1) the simplicity of an algorithm; (2) computational errors associated with an algorithm; and (3) the computational time. For example, a simple algorithm, known perhaps even hundreds of years ago, is to add a random noise of mean zero to each sample before collecting the data. This way if the sample size is large enough, e.g. millions, then by adding all collected numbers together the random noises tend to cancel out and the approximate sample mean can be recovered. However, though simple and taking very little computational time, this algorithm does not satisfy the second challenge mentioned above, as it is not easy to quantify the precise error of the algorithm, due to the difficulty in knowing how large the sample size needs to be for the approximation to work well. Furthermore, the above algorithm only recovers the sample mean, and not the whole distribution. To give another example, a multiparty protocol was implemented in 2008 in Denmark, where 1200 farmers participated in an auction to determine the market price of sugarbeet contracts without disclosing their own bid and ask prices. However, the computation took about half an hour in a supercomputer, thus not being entirely satisfactory in terms of the third challenge mentioned above.
In a recent paper by Ning Cai of the Hong Kong University of Science and Technology and Steven Kou of the NUS RMI titled ¡°Recovering Statistical Histograms Exactly While Preserving Privacy¡±, the authors proposed an algorithm that allows each participant to freely share the encrypted numbers of their sensitive data, while the central administration can use the algorithm to recover the whole statistical histogram of the original data exactly. Furthermore, the algorithm guarantees that no one else (even the central administration) can recover the true value of the original data, as the best unencrypted information that can be recovered are the histogram as well as some uniform distributions. Furthermore, the algorithm is easy to implement and only takes several minutes to run in their numerical illustration.
The proposed algorithm provides a way for each participant to freely share the encrypted numbers of their sensitive data. There are at least two potential applications. First, many financial institutions are required by the law to submit sensitive financial information, such as balance sheets and trading positions, to the regulators, who will then report some summary statistics, e.g. 95% quantile. To compute the 95% quantile, one needs more information beyond the mean and variance, and thus their algorithm seems to be useful. Secondly, in many public votes the privacy of participants needs to be preserved. To find out who gets the majority votes, one needs to compute the mode of the histogram not just the sample mean. Thus, this algorithm, which can recover the whole histogram exactly, seems to be applicable.
