:::: MENU ::::
Posts tagged with: python

Fetching articles from the New York Times API

I’m working on a paper for the Midwest Political Science Association meeting in which we analyze whether policy issues appear first in Congress’s tweets on in the popular press. We’re using all articles from the New York Times, including those from the Associated Press, Reuters, and other providers, as “popular press” content. In order to collect the articles, I wrote a Python script that loops through dates we’re interested in, saves the JSON provided by the Times Articles API, and finally parses that JSON into a CSV for use in Stata/R/whatever.

You may be wondering why I bother storing the JSON at all. For three reasons, really:

  1. Something could go wrong if I didn’t, and I’d have to fetch it again;
  2. The New York Times is nice enough to allow programmatic access to its articles, but that doesn’t mean I should query the API every time I want data; and,
  3. I may need different subsets or forms of the same “raw” data for different analyses.

So, instead of querying the API and parsing its response all at once, I query it once and cache the raw data, lessening the burden on the Times API and saving a local copy of the “raw” data. Then, I parse that raw data into whatever format I need – in this case a tab-delimited file with only some of the fields – and leave the raw data alone. Next time I have a research question that relies on the same articles, I can just re-parse the stored JSON files into whatever format helps me answer my new question.

The script is a work in progress. See the repo README on Github for info about planned changes.

Why “raw” data? Well, even the JSON that the Times provides has been processed. The Times has chosen some data to give me and some data to keep to itself (e.g., the full text of articles). The data the API returns in raw to me, meaning it’s my starting point. Whether data is ever really raw, or if that’s even a useful term, I leave up to you for now.

The Code

import urllib2
import json
import datetime
import time
import sys
import argparse
import logging
from urllib2 import HTTPError
 
# helper function to iterate through dates
def daterange( start_date, end_date ):
    if start_date <= end_date:
        for n in range( ( end_date - start_date ).days + 1 ):
            yield start_date + datetime.timedelta( n )
    else:
        for n in range( ( start_date - end_date ).days + 1 ):
            yield start_date - datetime.timedelta( n )
 
# helper function to get json into a form I can work with       
def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input
 
# helpful function to figure out what to name individual JSON files        
def getJsonFileName(date, page, json_file_path):
    json_file_name = ".".join([date,str(page),'json'])
    json_file_name = "".join([json_file_path,json_file_name])
    return json_file_name
 
# helpful function for processing keywords, mostly    
def getMultiples(items, key):
    values_list = ""
    if len(items) > 0:
        num_keys = 0
        for item in items:
            if num_keys == 0:
                values_list = item[key]                
            else:
                values_list =  "; ".join([values_list,item[key]])
            num_keys += 1
    return values_list
 
# get the articles from the NYTimes Article API    
def getArticles(date, api_key, json_file_path):
    # LOOP THROUGH THE 101 PAGES NYTIMES ALLOWS FOR THAT DATE
    for page in range(101):
        try:
            request_string = "http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=" + date + "&end_date=" + date + "&page=" + str(page) + "&api-key=" + api_key
            response = urllib2.urlopen(request_string)
            content = response.read()
            if content:
                articles = convert(json.loads(content))
                # if there are articles here
                if len(articles["response"]["docs"]) >= 1:
                    json_file_name = getJsonFileName(date, page, json_file_path)
                    json_file = open(json_file_name, 'w')
                    json_file.write(content)
                    json_file.close()
                # if no more articles, go to next date
                else:
                    return
            # else:
            #     break
            time.sleep(3)
        except HTTPError as e:
            logging.error("HTTPError on page %s on %s (err no. %s: %s) Here's the URL of the call: %s", page, date, e.code, e.reason, request_string)
        except: 
            logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
            continue
 
# parse the JSON files you stored into a tab-delimited file
def parseArticles(date, csv_file_name, json_file_path):
    for file_number in range(101):
        # get the articles and put them into a dictionary
        try:
            file_name = getJsonFileName(date,file_number, json_file_path)
            in_file = open(file_name, 'r')
            articles = convert(json.loads(in_file.read()))
            in_file.close()
        except IOError as e:
			logging.error("IOError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
			continue
 
        # if there are articles in that document, parse them
        if len(articles["response"]["docs"]) >= 1:  
            # open the CSV for appending
            try:
                out_file = open(csv_file_name, 'ab')
            except IOError as e:
    			logging.error("IOError: %s %s", date, file_number, e.errno, e.strerror)
    			continue
 
            # loop through the articles putting what we need in a CSV   
            try:
                for article in articles["response"]["docs"]:
                    # if (article["source"] == "The New York Times" and article["document_type"] == "article"):
                    keywords = ""
                    keywords = getMultiples(article["keywords"],"value")
 
                    # should probably pull these if/else checks into a module
                    variables = [
                        article["pub_date"], 
                        keywords, 
                        str(article["headline"]["main"]).decode("utf8").replace("\n","") if "main" in article["headline"].keys() else "", 
                        str(article["source"]).decode("utf8") if "source" in article.keys() else "", 
                        str(article["document_type"]).decode("utf8") if "document_type" in article.keys() else "", 
                        article["web_url"] if "web_url" in article.keys() else "",
                        str(article["news_desk"]).decode("utf8") if "news_desk" in article.keys() else "",
                        str(article["section_name"]).decode("utf8") if "section_name" in article.keys() else "",
                        str(article["snippet"]).decode("utf8").replace("\n","") if "snippet" in article.keys() else "",
                        str(article["lead_paragraph"]).decode("utf8").replace("\n","") if "lead_paragraph" in article.keys() else "",
                        ]
                    line = "\t".join(variables)
                    out_file.write(line.encode("utf8")+"\n")
            except KeyError as e:
                logging.error("KeyError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
                continue
            except (KeyboardInterrupt, SystemExit):
                raise
            except: 
                logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
                continue
 
            out_file.close()
        else:
            break
 
# Main function where stuff gets done
 
def main():
    parser = argparse.ArgumentParser(description="A Python tool for grabbing data from the New York Times Article API.")
    parser.add_argument('-j','--json', required=True, help="path to the folder where you want the JSON files stored")
    parser.add_argument('-c','--csv', required=True, help="path to the file where you want the CSV file stored")
    parser.add_argument('-k','--key', required=True, help="your NY Times Article API key")
    # parser.add_argument('-s','--start-date', required=True, help="start date for collecting articles")
    # parser.add_argument('-e','--end-date', required=True, help="end date for collecting articles")
    args = parser.parse_args()
 
    json_file_path = args.json
    csv_file_name = args.csv
    api_key = args.key    
    start = datetime.date( year = 2013, month = 1, day = 1 )
    end = datetime.date( year = 2013, month = 1, day = 1 )
    log_file = "".join([json_file_path,"getTimesArticles_testing.log"])
    logging.basicConfig(filename=log_file, level=logging.INFO)
 
    logging.info("Getting started.") 
    try:
        # LOOP THROUGH THE SPECIFIED DATES
        for date in daterange( start, end ):
            date = date.strftime("%Y%m%d")
            logging.info("Working on %s." % date)
            getArticles(date, api_key, json_file_path)
            parseArticles(date, csv_file_name, json_file_path)
    except:
        logging.error("Unexpected error: %s", str(sys.exc_info()[0]))
    finally:
        logging.info("Finished.")
 
if __name__ == '__main__' :
    main()

Setting up an EC2 instance for TwitterGoggles

TwitterGoggles requires Python 3.3. I’m new to Python, and 3.3 is (relatively) new to everyone. So, getting help is both necessary and challenging. I want to run TwitterGoggles on Amazon EC2 instances, so I’m setting up an AMI that has all of the requirements:

  • gcc 4.6.3
  • git 1.8.1.4
  • mlocate 0.22.2
  • MySQL 5.5
  • Python 3.3

I started with an Amazon Linux AMI and installed the stuff I needed. You can save yourself some trouble by launching an instance with my AMI: ami-e73b558e.

Install Dependencies

  1. Update the system
    sudo yum update
  2. Install C compiler so we can install Python
    sudo yum install gcc
  3. Install software yum can take care of for us
    sudo yum groupinstall "Development tools"
    sudo yum install -y mysql git mlocate
  4. Update the DB locate uses to find your stuff
    sudo updatedb

Install Python 3.3.1

Here’s the best guide: http://www.unixmen.com/howto-install-python-3-x-in-ubuntu-debian-fedora-centos/

Basically you have to

  1. Download the release you want. In my case
    wget http://www.python.org/ftp/python/3.3.1/Python-3.3.1.tgz
  2. Extract the compressed files and switch the directory
    gunzip Python-3.3.1.tgz
    tar xf Python-3.3.1.tar
    cd Python-3.3.1
  3. Configure, compile, and install
    sudo ./configure --prefix=/opt/python3
    sudo make
    sudo make install
  4. Add python3 to your path
    export PATH=$PATH:/opt/python3/bin

Install easy_install-3.3

I ran into some problems related to a missing “zlib” errors. I reinstalled zlib from source, then reconfigured and reinstalled Python 3.3.1. Once that worked, I was able to install and use easy_install-3.3 for module management.

wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.39.tar.gz
tar xf distribute-0.6.39.tar.gz
cd distribute-0.6.39
sudo python3 setup.py install


Two Python scripts for gathering Twitter data

Anyone who has talked to me about my research in the last year and a half knows I’m constantly frustrated by the challenges of capturing and storing Twitter data (not to mention sharing – that’s another blog post). I hired a couple of undergrads to help me write scripts to automatically collect data and store it in a relational MySQL database where I can actually use it. We chose to use the streaming API because we limit data by person rather than by content. The Twitter Search API can handle only about 10 names at a time in the “from” or “mentions” query parameters. Since we’re studying over 1500 people, we’d have to run 150 different searches to get data for everyone. Using the Streaming API has its problems too – most notably that any time the script fails, we miss some data.

Below, I provide some info and links to two different scripts for collecting data from Twitter. Both are written in Python. One uses the Streaming API and one uses the Search API. Depending on your needs, one will be better than the other. The two store data slightly differently as well. They both parse tweets into relational MySQL databases, but the structure of those databases differs. You’ll have to decide which API gets you the data you need and how you want your data stored.

Both options come with all the caveats of open-source software developed within academia. We can’t provide much support, and the software will probably have bugs. Both scripts are still in development though, so chances are your issue will get addressed (or at least noticed) if you add it to the Issues on GitHub. If you know Python and MySQL and are comfortable setting and managing cron jobs and maybe shell scripts, you should be able to get one or both of them to work for you.

Option 1: pyTwitterCollector and the Streaming API

When to use this option:

  • You want to collect data from Twitter Lists (e.g., Senators in the 113th Congress)
  • You want data from large groups of specific users
  • You want data in real-time and aren’t worried about the past
  • You need to run Python 2.7
  • You want to cache the raw JSON to re-parse later

What to watch out for:

  • Twitter allows only one standing connection per IP so running multiple collectors is complicated
  • You need to anticipate events since the script doesn’t look back in time

Originally written in my lab, pyTwitterCollector uses the streaming API to capture tweets in real time. You can get the pyTwitterCollector code from GitHub.

Option 2: TwitterGoggles and the Search API

When to use this option:

  • You want data about specific terms (e.g., Obamacare)
  • You want data from before the script starts (how far back you can go changes on Twitter’s whim)
  • You can run Python 3.3

What to watch out for

  • Complex queries may need to be broken into more than one job (what counts at complicated is up to Twitter – if it’s too complicated, the search just fails with no feedback)

Originally written by Phil Maconi and Sean Goggins, TwitterGoggles uses the search API to gather previously posted tweets. You can get the TwitterGoggles code from GitHub.