Anonymizing Temporal Phrases in Natural Language Text to be Posted on Social Networking Services
Hoang, Anh-Tu, Tran, Minh-Triet, Yoshiura, Hiroshi, Sonehara, Noboru, and Echizen, Isao
In Digital-Forensics and Watermarking, pp. 437–451, 2014
Time-related information in text posted on-line is one type of personal information targeted by attackers, one reason that sharing information online can be risky. Therefore, time information should be anonymized before it is posted on social networking services. One approach to anonymizing information is to replace sensitive phrases with anonymous phrases, but attackers can usually spot such anonymization due to its unnaturalness. Another approach is to detect temporal passages in the text, but removal of these passages can make the meaning of the text unnatural. We have developed an algorithm that can be used to anonymize time-related personal information by removing the temporal passages when doing so will not change the natural meaning of the message. The temporal phrases are detected by using machine-learned patterns, which are represented by a subtree of the sentence parsing tree. The temporal phrases in the parsing tree are distinguished from other parts of the tree by using temporal taggers integrated into the algorithm. In an experiment with 4008 sentences posted on a social network, 84.53 % of them were anonymized without changing their intended meaning. This is significantly better than the 72.88 % rate of the best previous temporal phrase detection algorithm. Of the learned patterns, the top ten most common ones were used to detect 87.78 % the temporal phrases. This means that only some of the most common patterns can be used to the anonymize temporal phrases in most messages to be posted on an SNS. The algorithm works well not only for temporal phrases in text posted on social networks but also for other types of phrases (such as location and objective ones), other areas (religion, politics, military, etc.), and other languages.