:::: MENU ::::
Posts tagged with: congress

Who said it first – Congress or the press?

Sometimes Congress, sometimes the press, it turns out. Matt Shapiro and I wrote a paper for this month’s Midwest Political Science Association meeting in which we analyzed the timing of tweets with hashtags and New York Times articles with keywords and found

… news coverage and Twitter activity from the previous day are good predictors of news coverage and Twitter attention on any given day.

We wondered whether political issues popular on Twitter were popular in the press as well and whether issues cropped up among politicians on Twitter or in the press first. So, we retrieved all the articles available from the New York Times Article API for 2013 and all of the tweets Twitter would let us have for members of Congress (see links to code for collecting data below). We focused on hashtags and article keywords for six policy areas: budget, immigration, environment, energy, the Affordable Care Act (ACA), and marginalized groups (e.g., LGBT, military veterans, Latinos, etc.) and compared the timelines of when those issues were referenced in tweets and in articles.

The tables below show the results of our regressions. For most of the issues, they were similarly popular in the press and on Twitter on the same day. However, for immigration, Twitter activity in the past is a better predictor of news coverage than prior news coverage. For marginalized groups, neither prior news nor prior tweets are good predictors of a day’s news, suggesting that attention both in the news and on Twitter is spotty (or bursty) for marginalized groups.

The strong correlations between issues’ Twitter activity and news coverage on the same day (see models labeled “b” in the tables below) suggest, at least, that the press and Congress are giving attention to similar issues.

  Budget Immigration Environment
  (1a) (1b) (2a) (2b) (3a) (3b)
Previous day’s news .377*** .361*** .202*** .176*** .169*** .169***
Previous day’s tweets .366*** .114* .287*** .224*** .111** .115**
Same day’s tweets .351*** .184*** -.012
F-statistic 145.03 119.71 34.29 27.75 7.97 5.31
R2 0.45 0.50 0.16 0.19 0.04 0.04
N 364 364 364 364 364 364


  Energy ACA Marginalized
  (4a) (4b) (5a) (5b) (6a) (6b)
Previous day’s news .171*** .171*** .382*** .296*** .074 .078
Previous day’s tweets .129** .115** .201*** .073 .007 .035
Same day’s tweets .042 .313*** -.081
F-statistic 9.45 6.50 64.02 57.87 1.00 1.39
R2 0.05 0.05 0.26 0.33 0.01 0.01
N 364 364 364 364 364 364

Note: Each count of articles and tweets is a standard score, and beta coefficients for each predictor are reported. Predictors’ significance are indicated with asterisk where *, **, *** represent p<0.1, p<0.05, p<0.001, respectively.

Python Code for Collecting the Data

Collecting and Connecting On- and Offline Political Network Data

I gave a talk at the DIMACS Workshop on Building Communities for Transforming Social Media Research Through New Approaches for Collecting, Analyzing, and Exploring Social Media Data at Rutgers University last week. Here are my slides and roughly what I said:

Many of today’s talks are about gathering big social data or automating its analysis. I’m going to focus instead on connecting disparate data sources to increase the impacts of social media research.

Most of my work is about public policy, civic action, and how social media plays a role in each. Today, I’ll talk mainly about a study of how Congress uses Twitter and how that use influences public discussion of policy.

I’m in a department of humanities but have degrees in information and work experience in web development. My position allows me to witness divides in how different disciplines think about data and research, and I’ve included a few comments in my talk about why those divides matter.

Often when we study social media we’re looking at trends or trying to generalize about or understand whole populations, but I’m interested in a specific subset of people, what they do online, and how that online activity influences the offline world. This focus allows me to connect what we know about people offline with what they do online quite reliably. For instance, I can connect data such as party affiliation, geolocation, gender, tenure in Congress, chamber, voting record, campaign contributions, the list goes on, because these values are known for members of Congress. Getting their data from Twitter is trickier. Govtrack does a nice job keeping track of official accounts, but politicians often have 2 or 3 accounts – for campaign messaging, for personal use – and there are many bogus accounts out there. Once you find their accounts, you can use a variety of tools to capture their Twitter data. I wrote my own in Python and MySQL because most other free tools focus on hashtags and the Search API, and those don’t return the Twitter data I need.

So, back to the study. There’s plenty of hype in the popular press about how politicians wield social media influence to impact policy. Look at how Obama used the internet to become President! I wondered whether these claims are true – was social media providing a new route to influence for members of Congress?

Turns out, not so much. The people positioned to exert the most influence online occupy similar positions of power offline.

How can I be sure? Network theory, especially social capital theory, provides tools for making judgments about the relative power of people as a function of their positions in a network. And luckily, I’m not the only one who thinks network theory is a good way to interrogate power in Congress. Other researchers have used network theory to analyze relationships among bill co-sponsors, roll call votes, congressional committees, and press events. I’ll focus just on cosponsorship because it’s the most widely used measure of legislative influence. I compared legislative influence with a person’s ability to control the spread of information.

To do so, I used Jim Fowler’s cosponsorship network data and his measure of influence – connectedness – which is a weighted closeness centrality measure for the network analysts among us.

My data is a network of mentions among members of Congress. I have a few hundred thousands tweets from 2008 – present in which Congress mention one another about 75,000 times. Every mention creates an explicit, directed link between two members, and these links form the network I’m interested in.

Fowler’s data is an undirected, implicit network in that members are connected through their affiliation with legislation. To use Fowler’s data, I needed a way to connect members of Congress on Twitter to members in his data, a sort of key, if you will. Keith Poole’s ICPSR ID is a widely used unique identifier for individual members of Congress (that Fowler also uses) so, I developed a mapping of Twitter ID to ICPSR ID.

So Fowler used bill cosponsorship networks to figure out who wields influence over what legislation gets addressed and eventually passed. Turns out members with high connectedness are more effective at convincing their peers to vote with them. I used the same algorithm to measure who wields influence over the spread of information online. Being able to spread information quickly allows politicians to control the conversation. We know from studies of framing, for instance, that the first frame is likely to get traction and essentially constrain future discussions of an issue. What I found was that people with legislative influence also control information. While this correlation isn’t necessarily causal in either direction, what matters is that the same people wield influence both online and off. I can tell you a little more about those people too, because I was able to connect online behavior data with off-line demographic and political data.

Members of Congress who control the conversation, or at least are in a position to, are male House Republicans. If you’re a female, Democrat, feminist scholar like me, that’s scary. What are some of the implications of that information control? Well, for starters, male House Republicans nearly never talk about issues facing marginalized groups such as pay inequality, discrimination, and poverty. Instead, the online conversation is about the ACA, gun control, and the debt. Whether you think we should be talking more about poverty or the ACA, I expect you’d agree that some diversity in both topics and talkers would be welcome. But that’s not what I’m seeing. Instead, I’m seeing male House Republicans controlling the spread of information and attention through the Congressional network and to all of its followers.

We’ve now taken a whirlwind tour through one of my studies of political communication on Twitter. What did we find that matters? First, social media data is most useful when we connect it with other data. Second, social media is not providing an alternate route to power for members of Congress. Third, maleness and Republicanness are the most reliable routes to influence online. From my view, these results paint a pretty bleak picture were social media doesn’t actually challenge the status quo, and groups that wield disproportionate, and often oppressive, influence offline do so online as well. This isn’t quite the democratizer or equalizer I was hoping for, but as I work to understand what’s happening among citizens, maybe I’ll see something different.

I’d rather not end on a depressing note, so let me end on a call to action instead. The technical expertise required to do this study – to collect Fowler’s data, to collect Twitter data, to do the statistical and network analyses – may seem second nature to many of us here, but they are not to the people best equipped to interrogate this data. Let’s not measure of impact by the size of our dataset or the lines of code we had to write to get it. For instance, I have a colleague in political science at IIT who knows much more than I do about legislative influence and political communication. He shouldn’t also be asked to learn R and Python to contribute to the discussion about social media’s role in influencing public policy discussions. I hope we can remove, or at least diminish, the technical barriers for subject matter experts and scientists of other stripes to use [big] [open] [social] data and that we can change graduate education to train students in both social theory and technical tools.

See tweets from the conference: http://seen.co/event/dimacs-workshop-on-building-communities-for-social-media-research-core-building-rutgers-university-new-brunswick-nj-2014-4203

Access Fowler’s papers and data: http://jhfowler.ucsd.edu/cosponsorship.htm

Access Poole’s ICPSR ID data and information: http://www.voteview.com/icpsr.htm

Fetching articles from the New York Times API

I’m working on a paper for the Midwest Political Science Association meeting in which we analyze whether policy issues appear first in Congress’s tweets on in the popular press. We’re using all articles from the New York Times, including those from the Associated Press, Reuters, and other providers, as “popular press” content. In order to collect the articles, I wrote a Python script that loops through dates we’re interested in, saves the JSON provided by the Times Articles API, and finally parses that JSON into a CSV for use in Stata/R/whatever.

You may be wondering why I bother storing the JSON at all. For three reasons, really:

  1. Something could go wrong if I didn’t, and I’d have to fetch it again;
  2. The New York Times is nice enough to allow programmatic access to its articles, but that doesn’t mean I should query the API every time I want data; and,
  3. I may need different subsets or forms of the same “raw” data for different analyses.

So, instead of querying the API and parsing its response all at once, I query it once and cache the raw data, lessening the burden on the Times API and saving a local copy of the “raw” data. Then, I parse that raw data into whatever format I need – in this case a tab-delimited file with only some of the fields – and leave the raw data alone. Next time I have a research question that relies on the same articles, I can just re-parse the stored JSON files into whatever format helps me answer my new question.

The script is a work in progress. See the repo README on Github for info about planned changes.

Why “raw” data? Well, even the JSON that the Times provides has been processed. The Times has chosen some data to give me and some data to keep to itself (e.g., the full text of articles). The data the API returns in raw to me, meaning it’s my starting point. Whether data is ever really raw, or if that’s even a useful term, I leave up to you for now.

The Code

import urllib2
import json
import datetime
import time
import sys
import argparse
import logging
from urllib2 import HTTPError
# helper function to iterate through dates
def daterange( start_date, end_date ):
    if start_date <= end_date:
        for n in range( ( end_date - start_date ).days + 1 ):
            yield start_date + datetime.timedelta( n )
        for n in range( ( start_date - end_date ).days + 1 ):
            yield start_date - datetime.timedelta( n )
# helper function to get json into a form I can work with       
def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
        return input
# helpful function to figure out what to name individual JSON files        
def getJsonFileName(date, page, json_file_path):
    json_file_name = ".".join([date,str(page),'json'])
    json_file_name = "".join([json_file_path,json_file_name])
    return json_file_name
# helpful function for processing keywords, mostly    
def getMultiples(items, key):
    values_list = ""
    if len(items) > 0:
        num_keys = 0
        for item in items:
            if num_keys == 0:
                values_list = item[key]                
                values_list =  "; ".join([values_list,item[key]])
            num_keys += 1
    return values_list
# get the articles from the NYTimes Article API    
def getArticles(date, api_key, json_file_path):
    for page in range(101):
            request_string = "http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=" + date + "&end_date=" + date + "&page=" + str(page) + "&api-key=" + api_key
            response = urllib2.urlopen(request_string)
            content = response.read()
            if content:
                articles = convert(json.loads(content))
                # if there are articles here
                if len(articles["response"]["docs"]) >= 1:
                    json_file_name = getJsonFileName(date, page, json_file_path)
                    json_file = open(json_file_name, 'w')
                # if no more articles, go to next date
            # else:
            #     break
        except HTTPError as e:
            logging.error("HTTPError on page %s on %s (err no. %s: %s) Here's the URL of the call: %s", page, date, e.code, e.reason, request_string)
            logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
# parse the JSON files you stored into a tab-delimited file
def parseArticles(date, csv_file_name, json_file_path):
    for file_number in range(101):
        # get the articles and put them into a dictionary
            file_name = getJsonFileName(date,file_number, json_file_path)
            in_file = open(file_name, 'r')
            articles = convert(json.loads(in_file.read()))
        except IOError as e:
			logging.error("IOError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
        # if there are articles in that document, parse them
        if len(articles["response"]["docs"]) >= 1:  
            # open the CSV for appending
                out_file = open(csv_file_name, 'ab')
            except IOError as e:
    			logging.error("IOError: %s %s", date, file_number, e.errno, e.strerror)
            # loop through the articles putting what we need in a CSV   
                for article in articles["response"]["docs"]:
                    # if (article["source"] == "The New York Times" and article["document_type"] == "article"):
                    keywords = ""
                    keywords = getMultiples(article["keywords"],"value")
                    # should probably pull these if/else checks into a module
                    variables = [
                        str(article["headline"]["main"]).decode("utf8").replace("\n","") if "main" in article["headline"].keys() else "", 
                        str(article["source"]).decode("utf8") if "source" in article.keys() else "", 
                        str(article["document_type"]).decode("utf8") if "document_type" in article.keys() else "", 
                        article["web_url"] if "web_url" in article.keys() else "",
                        str(article["news_desk"]).decode("utf8") if "news_desk" in article.keys() else "",
                        str(article["section_name"]).decode("utf8") if "section_name" in article.keys() else "",
                        str(article["snippet"]).decode("utf8").replace("\n","") if "snippet" in article.keys() else "",
                        str(article["lead_paragraph"]).decode("utf8").replace("\n","") if "lead_paragraph" in article.keys() else "",
                    line = "\t".join(variables)
            except KeyError as e:
                logging.error("KeyError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
            except (KeyboardInterrupt, SystemExit):
                logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
# Main function where stuff gets done
def main():
    parser = argparse.ArgumentParser(description="A Python tool for grabbing data from the New York Times Article API.")
    parser.add_argument('-j','--json', required=True, help="path to the folder where you want the JSON files stored")
    parser.add_argument('-c','--csv', required=True, help="path to the file where you want the CSV file stored")
    parser.add_argument('-k','--key', required=True, help="your NY Times Article API key")
    # parser.add_argument('-s','--start-date', required=True, help="start date for collecting articles")
    # parser.add_argument('-e','--end-date', required=True, help="end date for collecting articles")
    args = parser.parse_args()
    json_file_path = args.json
    csv_file_name = args.csv
    api_key = args.key    
    start = datetime.date( year = 2013, month = 1, day = 1 )
    end = datetime.date( year = 2013, month = 1, day = 1 )
    log_file = "".join([json_file_path,"getTimesArticles_testing.log"])
    logging.basicConfig(filename=log_file, level=logging.INFO)
    logging.info("Getting started.") 
        for date in daterange( start, end ):
            date = date.strftime("%Y%m%d")
            logging.info("Working on %s." % date)
            getArticles(date, api_key, json_file_path)
            parseArticles(date, csv_file_name, json_file_path)
        logging.error("Unexpected error: %s", str(sys.exc_info()[0]))
if __name__ == '__main__' :

Who in Congress talks to Each Other?

On Twitter, at least, most of the communication is between members of the same party. That’s not all that surprising given the polarized Congress and a slew of recent social science findings about homogeneous connections among users. I still think it’s interesting though.

A couple months ago I blogged about using geometric mean instead of simple edge weight and reciprocation measures, and I put that to use recently on data from Congress’s mentioning on Twitter between March 2012 and October 2012. The images below show the resulting graph using various geometric mean thresholds to determine whether or not an edge should display.

Geometric Mean 1 or Greater

Geometric Mean 1 or Greater

The image above includes all reciprocal relationships, regardless of how one-sided those relationships were. The yellow edges mean that there were mentions across party. We see more here than Adamic and Glance did among political bloggers, but the red (Republican to Republican) and blue (Democrat to Democrat) mentions clearly occur much more frequently.

Geometric mean 10 or greater

Geometric Mean 10 or Greater

In this image, the threshold for display was 10. That means these people are mentioning each other pretty often. A couple things jump out right away – first that the network is quite fragmented. The reciprocal network looks like a single component (I’ll have to check to be sure), but this one clearly has multiple components. Second, there are very few between-party links. Near the bottom, we can see Senators McCain, Graham, Lieberman, and Ayotte. I’d love to hear how Sen. Ayotte ended up in a conversation with those guys. Near the top, there’s another bipartisan conversation between Representatives Yoder and Cleaver from neighboring states; off to the right there’s another between Senators Moran and Warner. I’ll also look into why those guys are chatty.

Those two groups in the middle, where names are overlapping too much to read, have just within-party mentions. One group of Republicans talk amongst themselves, and one group of Democrats do as well. Then, just to the left of center, we see Sen. Grassley talking to himself to/from multiple accounts. That makes me think there’s a problem with the data. But, that’s why I put stuff here first – I can blog while I clean data and before I write the paper. The other groups are mostly representing the same state or from the same party. This exercise definitely presents a whole slew of new interesting questions to ask and answer.

UPDATE: Those two accounts for Sen. Grassley are actually his accounts – ChuckGrassley and GrassleyOffice.

UPDATED: Why didn’t the isolates go away?

I’m giving in. I’m finally learning how to do social network analysis R. What made me switch (from only UCINet and NodeXL)? Well, all my data lives in a MySQL database, and I have networks with millions of edges. R makes it really easy to connect to MySQL and create a data frame from data found there. That saves me about 20 minutes every time I want to do some analysis. No more selecting and downloading data and crashing UCINet and Excel, just

con <- dbConnect(MySQL(), user="user", password="pass", dbname="TwitterCollector", host="localhost")

mentions <- dbGetQuery(con, "SELECT * FROM tweet_mentions WHERE source_user_id IN (SELECT user_id FROM congress_attributes) AND target_user_id IN (SELECT user_id FROM congress_attributes)")

And I have all of Congress’s mentions of one another ready to go. Phew!

All I did today was get those connections setup, get some data in data frames for R to use, and then draw some rudimentary graphs like this one:

Mentions - Full network

I’m glad to see output, but I’m confused about why my isolate deleting functions didn’t work. Here’s how I tried to delete isolates:

mention_graph_no_iso <- delete.vertices(mention_graph, V(mention_graph)[degree(mention_graph)==0])

But I still see isolates in my graph. In fact, this one is even messier:

Mentions - no isolatesUPDATE:

The isolates weren’t in the graph object, but I forgot to rerun the layout after removing them. So, once I did that, I got a less messy graph (see below). I also cleaned up my code by moving the deleting isolates code to a function. I got the original function online but can’t find the page. Will post the URL here when I do. I made a small change to the function, and here it is:

delete.isolates <- function(graph, mode = 'all') {
isolates <- which(degree(graph, mode = mode) == 0)
delete.vertices(graph, isolates)


CSCW paper and poster about Congress on Twitter

My colleagues and I will present a paper and a poster at CSCW 2013 in San Antonio in February. Both submissions are based on data we collected from Twitter around politicians and their use of social media.

What’s Congress Doing on Twitter? (paper)

With Jahna Otterbacher and Matt Shapiro, this paper reports our first summary stats about who’s using Twitter and what they’re accomplishing. Using data from 380 members of Congress’ Twitter activity during the winter of 2012, we found that officials frequently use Twitter to advertise their political positions and to provide information but rarely to request political action from their constituents or to recognize the good work of others. We highlight a number of differences in communication frequency between men and women, Senators and Representatives, Republicans and Democrats. We provide groundwork for future research examining the behavior of public officials online and testing the predictive power of officials’ social media behavior.

Read the paper

Read my guest post at Follow the Crowd about the paper

“I’d Have to Vote Against You”: Issue Campaigning via Twitter (poster)

With Andrew Roback, one of my great graduate students, this poster focuses on the citizen side of the Twitter conversation. Specifically, using tweets posted with #SOPA and #PIPA hashtags and directed at members of Congress, we identify six strategies constituents employ when using Twitter to lobby their elected officials. In contrast to earlier research, we found that constituents do use Twitter to try to engage their officials and not just as a “soapbox” to express their opinions.

Read the Extended Abstract