Location
Room 120, New Computer Science Building
Event Description

Analysis of uniform random sampling algorithm for Twitter
Abstract:
The daily volume of Tweets generated by Twitter is around 500 million, and the impact of this data on applications ranging from public safety, opinion mining, news broadcast, etc., is increasing day by day. Analyzing large volumes of Tweets for various applications would require techniques that scale well with the number of Tweets. In this talk we discuss a theoretical formulation for sampling Twitter data. Metrics to quantify the statistical representativeness of the Tweet samples are introduced, and results on the number of samples sufficient to obtain highly representative Tweet samples are derived. These
statistical metrics quantify the representativeness or goodness of the sample in terms of restoring public sentiments associated with these frequent keywords. Sampling a sufficient number of Tweets uniformly and randomly could serve as a first step before using other sophisticated summarization methods to generate summaries for human use. Experiments conducted on real Twitter data are provided as examples to show how the bounds behave in practise. Moreover, we compare different kinds of random sampling algorithms in these experiments. The bounds derived are attractive since they do not depend on the total
number of Tweets in the universe. Although these ideas and techniques are specific to Twitter, they could find applications in other areas as well.

Brief Bio:
Vikas Joshi is a research scientist in the Information Management group of IBM India Research Labs (IRL). He joined IBM IRL in 2012. He is working on social media analytics to analyze multiple streams of social media data and extract relevant insights to help law enforcement agencies. He is interested in broad areas of machine learning, speech recognition and text analytics.

Event Title
CSE 600: Vikas Joshi from IBM Research Labs