Anyone who has talked to me about my research in the last year and a half knows I’m constantly frustrated by the challenges of capturing and storing Twitter data (not to mention sharing – that’s another blog post). I hired a couple of undergrads to help me write scripts to automatically collect data and store it in a relational MySQL database where I can actually use it. We chose to use the streaming API because we limit data by person rather than by content. The Twitter Search API can handle only about 10 names at a time in the “from” or “mentions” query parameters. Since we’re studying over 1500 people, we’d have to run 150 different searches to get data for everyone. Using the Streaming API has its problems too – most notably that any time the script fails, we miss some data.
Below, I provide some info and links to two different scripts for collecting data from Twitter. Both are written in Python. One uses the Streaming API and one uses the Search API. Depending on your needs, one will be better than the other. The two store data slightly differently as well. They both parse tweets into relational MySQL databases, but the structure of those databases differs. You’ll have to decide which API gets you the data you need and how you want your data stored.
Both options come with all the caveats of open-source software developed within academia. We can’t provide much support, and the software will probably have bugs. Both scripts are still in development though, so chances are your issue will get addressed (or at least noticed) if you add it to the Issues on GitHub. If you know Python and MySQL and are comfortable setting and managing cron jobs and maybe shell scripts, you should be able to get one or both of them to work for you.
Option 1: pyTwitterCollector and the Streaming API
When to use this option:
- You want to collect data from Twitter Lists (e.g., Senators in the 113th Congress)
- You want data from large groups of specific users
- You want data in real-time and aren’t worried about the past
- You need to run Python 2.7
- You want to cache the raw JSON to re-parse later
What to watch out for:
- Twitter allows only one standing connection per IP so running multiple collectors is complicated
- You need to anticipate events since the script doesn’t look back in time
Originally written in my lab, pyTwitterCollector uses the streaming API to capture tweets in real time. You can get the pyTwitterCollector code from GitHub.
Option 2: TwitterGoggles and the Search API
When to use this option:
- You want data about specific terms (e.g., Obamacare)
- You want data from before the script starts (how far back you can go changes on Twitter’s whim)
- You can run Python 3.3
What to watch out for
- Complex queries may need to be broken into more than one job (what counts at complicated is up to Twitter – if it’s too complicated, the search just fails with no feedback)