This project allows data providers to publish static knowledge graphs (KGs). The project has three main contributions: k-Attribute Degree principle, two information loss metrics, and the clusters-based KG anonymization.
k-Attribute Degree Principle
k-Attribute Degree (k-ad) protects data owners from being re-identified with a confidence higher than 1/k. k-ad can replace k-anonymity principles designed for relational data and directed graphs (Paired-k-degree).
Information Loss Metrics
Information loss metrics evaluate the quality of anonymized KGs.
Attribute and Degree Information Loss Metric (ADM): combines information loss metrics used for relational data and graphs' anonymization. Attribute Truthfulness and Degree Information Loss metric (ATDM): evaluates the truthfulness of association rules extracted from anonymized KGs. The truthfulness of a rule is measured by a classifier. I implemented the classifier in PyTorch by combining shallow-embeddings of association rules in the raw KGs.
Clusters-Based Knowledge Graph Anonymization
While all previous work use a fixed clustering algorithm to anonymize data, the clusters-based knowledge graph anonymization algorithm (CKGA) allows data providers to use any clustering algorithms to generate the anonymized KGs. To this end, CKGA must solve two challenges: (1) how to make all clustering algorithms work with information loss metrics? and (2) how to make anonymized KGs satisfy k-ad no matter what algorithms are used.
I address (1) by modifying node2vec (PyTorch) to generate users' vectors such that the Euclidean distances between two vectors are similar to information loss of anonymizing their corresponding users. Since most clustering algorithms accept vectors in Euclidean space as their inputs, CKGA can execute them with the generated vectors.
(2) is remedied by developing an algorithm to modify clusters generated by clustering algorithms to make them valid (i.e., having at least k users). Finally, I design a generalization algorithm to generalize users' attributes and relationships such that the attribute values and relationship out-/in-degrees of those in the same clusters are identical.