Catching Fake Reviews via Linguistic Analysis

Bloomberg Businessweek, in its September 29, 2011 edition, featured the research performed by Yejin Choi, assistant professor in the Department of Computer Science at Stony Brook leading the Natural Language Processing (NLP) lab. This research resulted from collaboration with researchers at Cornell University: Jeff Hancock, Claire Cardie, and Myle Ott, and presented at the Association for Computational Linguistics (ACL) 2011. The research paper is publically available from the the ACL anthology.

The core of this research is statistical techniques that can distinguish different writing styles between deceptive reviewers and truthful reviewers. By analyzing contrastive examples of deceptive reviews and truthful reviews, computers can learn the difference in linguistic patterns that characterize deceptive reviews and truthful reviews.

The initial big challenge was that in order for statistical techniques work, we needed contrastive samples of truthful reviews and deceptive reviews. Although there is a vast array of consumer reviews available online, those were not immediately useful for this study, because it is nearly impossible for anyone to sit down and accurately determine which reviews are real and which are the fakes. Reviewers won’t tell us the truth. Therefore, we did what a desperate hotel owner might do: we created our own fake reviews by hiring 400 Amazon Mechanical Turkers to write positive reviews for 20 hotels in Chicago. For truthful reviews, we gathered reviews from the most popular Chicago hotels, which are less likely spam targets.

A startling fact we found is that people are not very good at detecting deceptive reviews. In fact, they couldn’t do much better than chance, while statistical algorithms could identify deceptive reviews with accuracy as high as 90%. This study shows that human has strong truth bias, that is, people tend to believe what they see.

Statistical analysis provides us surprising insights into fake reviewers. For instance, we find that truthful reviewers naturally focus on spatial details (e.g., bathroom, floor, small, location) of their experience, while deceptive reviewers have difficulties in filling in spatial information. As a result, deceptive reviewers will focus on other types of information, such as why they went to Chicago (e.g., vacation, business), or whom they went with (e.g., family, husband).

We also found that deceptive reviews demonstrate the characteristics of imaginative writing, i.e., frequent usage of verbs and adverbs, while truthful reviews demonstrate the characteristics of informative writing, i.e., frequent usage of nouns and adjective (except superlatives, which are more dominant in deceptive reviews due to the exaggerating tendency of deceptive reviewers). After all, the deceptive reviewers must make up a description of a hotel that they never have been to, hence they unconsciously rely on the imaginative writing style.

Another unexpected finding is that fake reviewers tend to overdo “self-referencing”, that is, they overuse words such as “I”, “me”, “my”, “mime”, as if they try to underline their existence and credibility. In many previous deception research in socio-cognitive sciences however, the opposite has been reported. That is, liars tend to “self-distance” themselves from their lies by avoiding self-referencing. This research therefore brings up the importance of domain dependent deceptive cues, rather than universal cues.

This work has been also featured in numerous other media including ABC News and New York Times.