Visualizations of anonymized data using privacy preserving procedures
- Successful participation of Human-Computer Interaction and Data Visualization
- Experience with Java
In the FreeMove project, we are investigating the sharing of movement data to improve our cities and neighbourhoods. However, sharing one's movement data can be a risk to privacy. Therefore, FreeMove investigates and develops methods and recommendations to users, data-collectors and data consumers on how to share and use movement data in a privacy-preserving way.
Everybody produces lots of personal data-points every day, for example, by using mobile phones and surfing the internet. On one hand, personal data is sensitive and we don't want our private data leaked. On the other hand, personal data (especially movement data) is essential to identify areas of necessary improvements of traffic infrastructure and public transport. In order to best find patterns, which show such need, the data needs to be visualized to the data-analysts.
To compromise between those two legitimate intentions, data is usually anonymized in some way before it is used. However, simply removing personal data such as names is not sufficient, because an attacker could still identify individuals by combining known information such as gender, birth date and zip-code with publicly known information. For this reason there are advanced methods (such as k-anonymity, l-diversity, t-closeness and differential privacy) to anonymize the data, such that the data can be donated with different guarantees about how hard it is to reverse information about the original data.
However, these anonymization methods might interfere with the properties of the visualizations. Several different diagrams are used to give a rough overview of a data-set, like histograms, box plots, scatter-plots, heat-maps and contour plots. Each of these visualization could be paired with each of the anonymization approaches and would result in different properties with respect to its usefulness and privacy.
The objectives of this thesis is investigation into Visualization of Anonymized Data, by (1) implementation of a reusable tool to combine (a) different anonymization methods with (b) different visualization methods and (2) evaluating the resulting combinations with a user study.
The objectives are flexible, depending on your interest and level.
(4) Possible Procedure
- Research into existing anonymization and visualization methods
- There are some existing implementations of anonymization methods (6, 6b most promising)
- As library for the visualization we suggest Altair (7)
- Implement a prototype to anonymize and visualize data with different methods,
- similar to Avraam et al. (1)
- The implementation should be reusable, either as a library or a tool, which can read in arbitrary data
- Apply the prototypes to an existing dataset, for example the "Census Income" Dataset (8)
- Capture and compare the properties of the output, for example
- Which anonymization strategy is suitable for which kind of data?
- What properties of the data are lost?
- Which anonymization strategy protects privacy the best?
- What else is remarkable about the visualizations?
- User tests with the different combinations
1) Avraam, Demetris, et al. “Privacy Preserving Data Visualizations.” EPJ Data Science, vol. 10, no. 1, 1, SpringerOpen, Dec. 2021, pp. 1–34. epjdatascience.springeropen.com, https://doi.org/10.1140/epjds/s13688-020-00257-4.
(2) Bhattacharjee, Kaustav, et al. “Privacy-Preserving Data Visualization: Reflections on the State of the Art and Research Opportunities.” Comput. Graph. Forum, vol. 39, no. 3, 2020, pp. 675–92, https://doi.org/10.1111/cgf.14032.
(3) Zhang, Dan, et al. “Investigating Visual Analysis of Differentially Private Data.” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, Feb. 2021, pp. 1786–96. IEEE Xplore, https://doi.org/10.1109/TVCG.2020.3030369.
(4) Li, Ninghui, et al. “T-Closeness: Privacy Beyond k-Anonymity and l-Diversity.” 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 106–15. IEEE Xplore, https://doi.org/10.1109/ICDE.2007.367856.
(5) LeFevre, Kristen, et al. “Incognito: Efficient Full-Domain K-Anonymity.” Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data - SIGMOD ’05, ACM Press, 2005, p. 49. DOI.org (Crossref), https://doi.org/10.1145/1066157.1066164.
(5a) https://github.com/DunniAdenuga/Incognito (Java Implementation k-anonymity)
(6) “How to Protect Dataset Privacy Using Python and Pandas - DZone Security.” Dzone.Com, https://dzone.com/articles/an-easy-way-to-privacy-protect-a-dataset-using-pyt. Accessed 6 Sept. 2021. (Implementation no longer available)
(6a) https://github.com/qiyuangong/Mondrian (k-anonymity only)
(6b) https://github.com/Nuclearstar/K-Anonymity (k-anonymity, l-diversity and t-closeness)
(6c) https://pypi.org/project/arkhn-arx/ (k-anonymity and l-diversity)
(6d) https://github.com/leo-mazz/crowds (k-anonymity only)
(6e) http://cs.utdallas.edu/dspl/cgi-bin/toolbox/javadoc/index.html (JavaDoc only)
(6f) https://github.com/kanonymity/anonymisation (k only, french)
(6g) https://github.com/AXLEproject/axle-ola-prototype (k only)
(6h) https://arx.deidentifier.org/ (Java API only)
(7) Basic Statistical Visualization — Altair 4.1.0 Documentation. https://altair-viz.github.io/getting_started/starting.html. Accessed 6 Sept. 2021.
(8) "Census Income" dataset. https://archive.ics.uci.edu/ml/datasets/adult. Accessed 13. Sept. 2021