Project stage I: Building ground truth dataset

Project stage I: Building a ground truth dataset

To understand how rumors spread in microblogs, we need to first build a ground truth dataset of known rumors that we can further analyze. The first stage of the project requires you to collect the relevant data about valid (truthful) and rumor tweets on different topics, and classify the tweets as valid or rumors. You need to plan and develop the methodology for building the ground truth dataset. You will be using this dataset for rest of the project stages. To get an idea about how this is currently done, take a look at “related readings” at the end of this page.

In this stage you have to collect and process tweets about different topics. You have to collect at least 100 rumor tweets (propagating false information) and 100 valid tweets (propagating true information) for at least 3 topics stated below. However feel free to select upto total 5 different topics (2 more additional topics).

Boston bombings: News source
Hurricane Sandy: News source
Facebook shutting down on March 15, 2011: News source

You can use Twitter’s built in search feature to extract tweets on a topic. It supports related keyword search, hashtag search etc. Here is a link to the Twitter Search API documentation: https://dev.twitter.com/docs/api/1.1/get/search/tweets.

Once you have collected the tweets on a topic, you need to classify them as rumor tweets or valid tweets. One way to classify tweets is to go over them manually. This might be quite laborious and we would encourage you to come up with automated ways for tweet classification. As these are known rumors, you could rely on any crowdsourced information from the microblogging service or any third party rumor aggregation services. Any information contained within tweets such as hashtags or particular phrases could also be valuable information cues for detecting rumor tweets. Please note that while classifying the tweets you can do a finer grained classification such as whether the tweet is endorsing a rumor, or refuting a rumor, neutral about a rumor or questioning it. These finer grained classification will be immensely helpful for the next stages of the project. For the type of finer grained classifications please go through section 4.1 of the first suggested reading.

Your solution for this stage will be primarily graded based on:

a. Tweet collection and classification methodology.
b. Efficiency and scalability of your proposed solution.
c. Quantity of tweets collected.
d. Quality of classifications (more fine grained classification means better quality).

Note: Please DO NOT delete the collected raw data after cleaning. You may need this raw data for later stages of the project.

Expected output

The various project milestones for stage 1 are below.

1. Status update presentation
Date of submission : 28th May 2013, 11:59pm

During the 29th May 2013 class, each group must give a short 5 minute status update presentation. Please note that we will cut off each presentation at exactly 5 mins, and you will have additional 5 mins to get feedback and answer any questions. The presentations have to be uploaded to the submission site by 28th May at the latest. The presentations should be in ppt/pptx/pdf format and should have the following name: projectPhase1.status.<lastname>.ppt/pptx/pdf

The purpose of this presentation is to give a status update, present your planned methodology for tweet collection and classification, and to get feedback on it.

2. Dataset of valid and rumor tweets
Date of submission : 14th Jun 2013, 11:59pm

For each topic, you have to submit two text data files: one for rumor tweets and other for valid tweets. Please note the following guidelines:

a. The File name must have the following format: < topic-name-in-one-word >.< rumor/valid>.<Lastname1>. <Lastname2>.
b. Each file must contain at least 100 tweets.
c. Each line of file must have at least the following fields at the beginning (in the same order!): <TweetId> <Tweet> <UserId> <TimeStamp>
You can also add additional fields after these, if you want after these fields, e.g. geo-tag information, retweet information, or any other meta data that comes with the crawled tweet.
d. You must use the following separator for the fields: <<sscs-ss13:??>>
For example:
Filename : bostonBombing.rumor.mondal.kulshrestha.viswanath
File Content: 100762367886723<<sscs-ss13:??>>Boston Bombings are done by CIA<<sscs-ss13:??>> 2237832772<<sscs-ss13:??>>2013-03-23 16:47:40

3. Final Report
Date of submission : 14th Jun 2013, 11:59pm

Please provide a final report describing your method for collecting the final datasets of valid and rumor tweets. Here are some guidelines:

a. The final report should be be at least 2 pages (single line spaced, font size: 10 point, single column).
b. The name of the file should be: <Lastname1>.<Lastname2>.stage1.report.pdf
c. In this report, you must describe the method you used to collect the tweets and your solution for tweet classification. Please make sure to describe why you chose the proposed solution and how is it better than the other possible solutions.
d. You must include a section in the report about high level statistics for your collected dataset. The purpose of this section is to comment upon the quality of the tweets collected. For this purpose you could include the following measures:
- i. How large the tweet set is, both in terms of number of tweets and users?
- ii. How many of the tweets are retweets as opposed to original tweets?
- iii. How many tweets contain urls?
- iv. Do the tweets mention people or contain small conversation snippets ?
- v. How fast did the rumor tweets and truthful tweets appear?
With each of these measures/plots, you must point out the key insight gained from it, and why it shows that the quality of tweets you have collected is good (diverse number of users, diverse number of distinct rumors or valid tweets etc. )
e. Please give reference to any other sources you may be making use of for the purpose of tweet collection, cleaning or classification

How to submit

You need to submit a compressed (.tar.gz) folder with dataset files, and final report. The submitted file should have the name <Lastname1>.<Lastname2>.stage1.tar.gz

You should upload your submission in the project submission site .

Seminar: Readings in Social Computing Systems

Project stage I: Building a ground truth dataset

Expected output

How to submit

Suggested Readings