There has been some buzz in the last week or so around a paper detecting the credibility of tweets. As Twitter has grown, so has spam. Moreover, during crises such as tsunamis, the Arab spring, the UK riots and so on, one sees deliberate strategies to spread disinformation. The effects of this can range from inconvenience, to wasted time and resources to being downright dangerous. Thus, a system to filter out such disinformation would be valuable for media, first responders and others.
In Credibility Ranking of Tweets during High Impact Events, PSOM 2012, Gupta and Kamaguru attempted to identify features that helped identify credible information versus opinion versus spam. The built a supervised learning model using features of the tweets and the senders and ran those through a regression model and re-ranking scheme that they claim provided a reasonable predictive ability.
To do this, they collected 35 million tweets from 6 million users based on search terms from trending topics during summer 2011. These were high impact events meaning they produced at least 25k tweets and persisted as a trending topic for more than 48 hours. They then selected 14 trending topics---including UK riots, Hurricane Irene, Steve Jobs resignation and the bomb blast in Mumbai---and annotated the associated tweets to flag whether each tweet was credible. In short, a human had to decide whether each tweet was definitely credible / seems credible / definitely incredible / I can't decide and whether it contained information or not, was unrelated to the news event, or was spam. An example of the three classes is shown below.
They found that 30% of tweets provided information, 17% was credible and 14% was spam. They then proceeded to extract a large set of features from the tweets, the sender and their user profile. These features included:
Tweet: number of words, number of characters, whether it was a retweet, whether it contained a URL, whether it contained emoticons, number of hashtags etc.
Sender: age, number of followers, number of friends, description length, whether it was a verified account etc.
They performed a logistic regression (1 = credible, 0 = not credible/spam/opinion) and found the following significant indicators (p < 0.001):
- number of characters and unique characters present in tweet --- likely tweets with hashtags, @mentions and URLs contain more unique characters and so are more informative
- presence of swear words --- indicates opinion
- inclusion of pronouns and presence of sad / happy emoticons --- fact based tweets are less personal
- p<0.01: presence of URL --- often linked to images, video and resources related to the event.
They found that "low number of happy emoticons [:-), :)] and high number of sad emoticons [:-(, :(] act as strong predictors of credibility". Now, most of the high impact (arguably 13/14) events in their list are negative so one could imagine that for a general model that covered both positive and negative events, this feature would be use of emoticons: happy emoticons for happy events, sad emoticons for sad events. However, one could also argue that use of happy emoticons such as :-) could mean a joke, that the comment is not credible. We will see data from a second study below that finds the same results so it is more likely to be the latter: sad emoticons only are strongly linked to credible information.
They also used SVM and pseudo relevance feedback (PRF) to improve results. They ranked documents using RankSVM and using a unigram model identified the top K tweets and reranked them using a BM25 similarity metric. (This metric is a function of TF, IDF, K, tweet length, and some constants. See paper for full description.) To evaluate the results they used the Normalized Discounted Cumulative Gain metric. For the top 25 tweets, they achieved 0.37 NDCG, a statistically significant (p < 0.05) improvement over non ranked data. Unfortunately, there is scant further details about their analysis. I have no idea as to the accuracy, precision, recall or F-scores of their classifier. Saying that an additional layer of analysis improved results is not as useful as detailing fully how good the initial results were.
Gupta and Kamaguru are not the only ones working in this field. Castillo et al. (2011) published a more compelling paper entitled "Information Credibility on Twitter" with a more detailed statistical analysis. They too sucked up tweets from trending topics, this time over a 2 month period. They used Mechanical Turk to annotate the newsworthiness and credibility and asked the evaluators to write a sentence justifying their assessment. The extracted a sets of features, similar but larger and more detailed than above.
They tried a number of learning schemes including SVM, decision trees, decision rules, and Bayes networks. Results across these techniques were comparable with the best coming from a J48 decision tree. Their J48 classifier achieves an accuracy equal to 89%. For detecting newsworthy tweets they achieved an impressive F1 score of 0.924. For predicting credibility they also found that a J48 tree worked best and found the following significant features: presence of questions mark (non credible), sentiment (positive: no credible, negative: credible), presence of negative emoticons (credible). More active users are more credible (I wonder if this is related to professional media organizations such as CNN, Reuters, local news organizatins whose business it is to be both credible and active.) Together, these feature produced a model with 86% accuracy. See table below.
In both these studies, the features make sense. Active users, tweeting frequently with lots of followers, linking to relevant material and who have a reasonably fleshed out user profile. With 86% accuracy from a balanced training set, such features and associated model could form the basis of a reasonably good filter system for a twitter stream and in the case of emergency services could save valuable resources and possibly lives.
P.S. it is worth checking out the truthiness system from http://truthy.indiana.edu/