Seminar: Readings in Social Computing Systems

Project stage I: Building a ground truth dataset

To understand how rumors spread in microblogs, we need to first build a ground truth dataset of known rumors that we can further analyze. The first stage of the project requires you to collect the relevant data about valid (truthful) and rumor tweets on different topics, and classify the tweets as valid or rumors. You need to plan and develop the methodology for building the ground truth dataset. You will be using this dataset for rest of the project stages. To get an idea about how this is currently done, take a look at “related readings” at the end of this page.

In this stage you have to collect and process tweets about different topics. You have to collect at least 100 rumor tweets (propagating false information) and 100 valid tweets (propagating true information) for at least 3 topics stated below. However feel free to select upto total 5 different topics (2 more additional topics).

You can use Twitter’s built in search feature to extract tweets on a topic. It supports related keyword search, hashtag search etc. Here is a link to the Twitter Search API documentation: https://dev.twitter.com/docs/api/1.1/get/search/tweets.

Once you have collected the tweets on a topic, you need to classify them as rumor tweets or valid tweets. One way to classify tweets is to go over them manually. This might be quite laborious and we would encourage you to come up with automated ways for tweet classification. As these are known rumors, you could rely on any crowdsourced information from the microblogging service or any third party rumor aggregation services. Any information contained within tweets such as hashtags or particular phrases could also be valuable information cues for detecting rumor tweets. Please note that while classifying the tweets you can do a finer grained classification such as whether the tweet is endorsing a rumor, or refuting a rumor, neutral about a rumor or questioning it. These finer grained classification will be immensely helpful for the next stages of the project. For the type of finer grained classifications please go through section 4.1 of the first suggested reading.

Your solution for this stage will be primarily graded based on:

Note: Please DO NOT delete the collected raw data after cleaning. You may need this raw data for later stages of the project.

Expected output

The various project milestones for stage 1 are below.

1. Status update presentation
Date of submission : 28th May 2013, 11:59pm

During the 29th May 2013 class, each group must give a short 5 minute status update presentation. Please note that we will cut off each presentation at exactly 5 mins, and you will have additional 5 mins to get feedback and answer any questions. The presentations have to be uploaded to the submission site by 28th May at the latest. The presentations should be in ppt/pptx/pdf format and should have the following name: projectPhase1.status.<lastname>.ppt/pptx/pdf

The purpose of this presentation is to give a status update, present your planned methodology for tweet collection and classification, and to get feedback on it.

2. Dataset of valid and rumor tweets
Date of submission : 14th Jun 2013, 11:59pm

For each topic, you have to submit two text data files: one for rumor tweets and other for valid tweets. Please note the following guidelines:

3. Final Report
Date of submission : 14th Jun 2013, 11:59pm

Please provide a final report describing your method for collecting the final datasets of valid and rumor tweets. Here are some guidelines:

How to submit

You need to submit a compressed (.tar.gz) folder with dataset files, and final report. The submitted file should have the name <Lastname1>.<Lastname2>.stage1.tar.gz

You should upload your submission in the project submission site .

Suggested Readings




Imprint / Data Protection