:::: MENU ::::
Browsing posts in: Research

Looking for (Lesbian) Love: Social Media Subtext Readings of Rizzoli and Isles

Here’s the abstract of my paper that was just accepted to IR16, the annual conference of the Association of Internet Researchers. This will be my first trip to IR, and I’m really excited to participate. See you in Phoenix!

Introduction

Using Fiske’s (1989) semiotic supermarket metaphor, I examine how Twitter users mix and match moments from Rizzoli and Isles to create a coherent lesbian subtext. To do so, I use tweets containing the portmanteau hashtag #Rizzles or the related tag #Gayzzoli posted during two different episodes of the show. Live tweeting affords us an opportunity to eavesdrop on viewers’ listening activities and provides data useful for testing theories about reading/viewing and participation. Here, I demonstrate the utility of analyzing live tweeting and provide examples of how live tweeters publicly read resistant subtexts.

Fiske (1987) argues that readers are able to assemble their own texts from television works by “’[listening’] more or less attentively to different voices” within the work (95). Though he didn’t introduce the term semiotic supermarket until later (Fiske, 1989), Fiske does provide a semiotic framing that is useful for analyzing social media readings of television texts. For instance, he argues that viewers exploit contradictions within the texts to locate their own social identities within the text (Fiske, 1986).

I argue that we should understand the lesbian subtext reading of Rizzoli and Isles as precisely this kind of polysemic reading. I show how #Rizzles readers locate their own social identities within the text of the show and then use social media to share those locations with others publicly.

Background on the Show

Rizzoli and Isles is a police procedural based on mystery novels written by Tess Gerritsen and produced by TNT. The title characters are Detective Jane Rizzoli, played by Angie Harmon, and medical examiner Dr. Maura Isles, played by Sasha Alexander. The characters in both are written as straight, heterosexual women who are also close friends. The creators[1] and actors[2] of the novels and shows have acknowledged the lesbian subtext readings. My analysis covers episodes from the fourth (“We Are Family”) and fifth season (“The Best Laid Plans”), so please be aware that the remainder of the paper contains spoilers. 

Collecting Tweets

I used TwitterGoggles (Maconi, 2013) to collect tweets containing either of the hashtags #Rizzles or #Gayzzoli. I’ve limited my analysis here to tweets posted on the date of the original U.S. broadcast of each of the episodes.

Live Tweeting and the Semiotic Supermarket

#Rizzles and #Gazzoli viewers mark common lesbian and romantic tropes almost immediately. When we first see the characters together, they are jogging shortly after Maura’s been cleared for physical activity after donating a kidney. Jane is trying to encourage Maura to keep jogging when she doesn’t feel well.

Dialogue Individual Frames
RIZZOLI: You’ll feel much better when you get back in shape. Ok? C’mon.

 

ISLES: Are you saying I’m fat and out of shape?

 

RIZZOLI: No, I am saying that you have got to stop hoping that they are going to send you some “thank you for your kidney” fruit basket.

Figure 1. Jane encourages Maura to keep going.

Figure 1. Jane encourages Maura to keep going.

 

Figure 2. Jane reacts to Maura's question.

Figure 2. Jane reacts to Maura’s question.


Viewers responded with tweet such as

“Are you saying I’m fat and out of shape? Oh look. First lover’s quarrel of season 4. #rizzoliandisles #gayzzoli” (Mirettesvertes, 2013)

“Where (sic) 90 seconds in and they’re already like an old married couple! #rizzles” (Nate, 2013)

“Nobody does bickering married couple like Rizzoli & Isles. #gayzzoli” (Marie, 2013)

We can already see from this first scene and these three tweets that the viewers are assembling a text in which Rizzoli and Isles enact love and marriage, not just friendship. None of these tweeters explicitly mention Jane’s or Maura’s gender, so they are marking their behavior not necessarily as lesbian but as romantic love. We can also see that viewers are not locating just their own social identities, but sometimes, as in Nate’s case, others’ subordinate identities. Nate describes himself as, “Just your average 35 yr old guy who enjoys Gilmore Girls, HTGAWM, Castle, NCIS & Scandal (& more) and writes fanfic. Feminist. Livetweeter. Liza Weil’s #1 fan,” in his own Twitter description and compares Jane and Maura to a married couple even though he doesn’t identify as a lesbian.

I introduced this project by situating it as a polesemic reading in line with Fiske’s Hart to Hart examples (1986). Rizzoli and Isles differs from Fiske’s examples because the show already resists dominant readings by having two female lead characters who have a relationship independent of their relationship to other characters. For example, Rizzoli and Isles passes the Bechdel test[3] each episode – the characters are often talking about their work (solving murders) or their own lives (struggles with their parents) without talking about men. In reading a lesbian subtext of the show, #Rizzles and #Gayzzoli tweeters are not just resisting the text but arguing that the text itself should have done resistance differently. The twin paucities of straight female friendships and loving lesbian relationships involving series lead(s) depicted in television and film in the U.S. makes Rizzoli and Isles an easy target for subtext readers. The show is susceptible to lesbian subtext readings precisely because viewers don’t see straight female friendships often enough for one to seem like an acceptable canonization.

Conclusion

The live tweets viewers post allow us to watch their readings of television episodes unfold. This data, especially when coupled with the episode text, allows us to test our theories of audience and participation. I demonstrated this approach and provided evidence of resistant readings that publicly mark polysemic moments in the text. These moments in Rizzoli and Isles are especially interesting because the multiple meanings are marked by viewers who don’t share a singular social identity.

References

Bechdel, A. (1988). The Rule. In More Dykes to Watch Out for: Cartoons (p. 22). Firebrand Books. Retrieved from http://dykestowatchoutfor.com/wp-content/uploads/2014/05/The-Rule-cleaned-up.jpg

Fiske, J. (1986). Television: Polysemy and Popularity. Critical Studies in Mass Communication, 3(4), 391–408.

Fiske, J. (1987). Television Culture. Methuen.

Fiske, J. (1989). Understanding Popular Culture. Routledge.

Hickey, W. (2014, April 1). The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women. Retrieved from http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/

Maconi, P. (2013). TwitterGoggles [source code]. Retrieved from https://github.com/pmaconi/TwitterGoggles

Marie [buknerd] (2013, June 25) Nobody does bickering married couple like Rizzoli & Isles. #gayzzoli [Tweet]. Retrieved from https://twitter.com/buknerd/status/349694021876203520

Mirettesvertes [mirettesvertes] (2013, June 25) “Are you saying I’m fat and out of shape?” Oh look. First lover’s quarrel of season 4. #rizzoliandisles #gayzzoli [Tweet]. Retrieved from https://twitter.com/mirettesvertes/status/349693940485734400

Nate [mrschimpf] (2013, June 25) Where 90 seconds in and they’re already like an old married couple! #rizzles [Tweet]. Retrieved from https://twitter.com/mrschimpf/status/349693974237286400

Penguin, Awkward [socawkpenguin78] (2013, June 25) The closer we get to the #RizzoliandIsles premiere, the more nervous I’m getting. What if it’s just a giant beardfest? #gayzzoli #Rizzles [Tweet]. Retrieved from https://twitter.com/socawkpenguin78/status/349625007711862784

[1] http://www.tessgerritsen.com/fanfic-and-rizzles/

[2] https://www.youtube.com/watch?v=CUu27ig9Wgw

[3] A popular tool for measuring gender bias in Hollywood, the “Bechdel test” is named for a comic strip by Alison Bechdel (1988). Her strip’s characters claim a movie passes if it (1) has at least two named women who (2) have a conversation with each other that (3) is not about a man. A recent Five Thirty Eight analysis found that only half of movies pass the test (Hickey, 2014), demonstrating that media “passing the test” is not the norm.



SSH Tunnel for Tableau on a Mac

I use Tableau to explore my data, and usually my data is stored in MySQL on a database server that allows only local connections. Therefore, I need to use SSH tunneling to connect to my data. Tableau doesn’t support tunneling natively, but Macs are equipped with OpenSSH by default and make tunneling easy.

My database server allows only local connections through MySQL’s standard port using the port and bind-address settings for MySQL. Here’s the relevant part of my.cnf:
#
# * Basic Settings
#
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address = 127.0.0.1

To use an SSH tunnel and Tableau to connect to a MySQL server:

    1. Open iTerm (or Terminal)
    2. ssh -NC user@host -L 9999:127.0.0.1:3306
    3. Open Tableau
    4. Choose “Connect to data” from the Tableau start screen Tableau Start Screen
    5. Choose “MySQL” from the Connect listTableau Connect Options
    6. Enter 127.0.0.1 and port 9999 into Tableau’s MySQL login window. Don’t forget to include your database username and password. Click “Connect”!Tableau MySQL Login
    7. When you’re done with Tableau, you can kill the tunnel with a Ctrl+C in iTerm.

Troubleshooting

Make sure you have MySQL’s ODBC driver installed first, or no amount of tunneling will help you.

Use “127.0.0.1” as the server, not “localhost”.


Analyzing Language for Persuasion Markers in Twitter Discussions

In studying the #GamerGate discussion(s) on Twitter, I’m using a variety of theories including persuasion, community action, and paranoid style in politics. I could use some help making sense of what I’m seeing, so I’ll be blogging as I go. Please contact me or use the comment functions if you have ideas.

First up, language and social influence. Why this approach? The argument about what #GamerGate is – a discussion about ethics in game journalism or a coordinated harassment effort – could be explored in part by examining the language participants use. I’m interested especially in whether users are persuasive in their languages – can they convince readers that the discussion is about what they claim it’s about? [The mainstream press says, “no“, “no“, “no“, you get the idea.] I’m not advocating for that argument. As you can see from the links in this paragraph, it’s been resolved. I’m curious, rather, about whether the language used by tweeters indicates an argument was even happening, and I’m using the presence/absence of persuasion markers to figure that out.

Some Background Research

A number of language features are connected to social influence, especially in online communication [1]:

  • lexical diversity
  • powerful language
  • language intensity

Lexical diversity refers to a class of measures of the range of vocabulary a writer uses. Often, lexical diversity is measured using a type-token ratio (number of unique words [types] divided by total number of words [tokens]) [2]. Low linguistic diversity leads to low evaluations from readers [3] and “negatively impacts credibility and influence” [1, p. 598]. A number of social factors have been shown to impact lexical diversity including anxiety [4] and writing apprehension [5], both of which produced writing with low lexical diversity.

Powerful language is defined by what it’s missing, namely linguistic features such as tag questions (e.g. “isn’t it?”), hedges (e.g., “sort of”, “kinda”), hesitations (e.g., “um”), fragments, and intensifiers (e.g., “really”) [6]. Writers who use powerful language are perceived as more competent and authoritative [7], and their arguments are judged as more persuasive [6]. Several studies have found that women tend to use less powerful language style than men [8, 910].

Many researchers use Bowers’ [11, p. 345] definition of language intensity as “the quality of language which indicates the degree to which the speaker’s attitude toward a concept deviates from neutrality.” Intensity is often conveyed through emotionality [12] and is measured using some variety of a scale in which words are labeled according to their intensity. Some popular scales include Jones and Thurstone [13] and Burgoon and Miller [14]. Intense language is associated with persuasion [12], resistance to persuasion [14], perceived credibility [12], attitude-behavior consistency [15]. Receivers have been shown to tolerate more intense messages from men than from women [16].

My Expectations

Based on this research about the relationships between linguistic style and social influence, I expect influential tweeters to have high lexical diversity and use powerful, intense language. By influential, I mean those tweeters who are able to control the message. I’m thinking retweets are decent proxies for persuasion and influence – I assume people RT persuasive arguments/authors. I also expect those most committed to the ethics in gaming argument to use the most intense language – they seem the least likely to be persuaded otherwise, based on mainstream media reports. I’m also wondering whether the “ethics in journalism” argument is failing because its supporters do not use persuasive language. Rather, posts like “Actually, it’s about ethics” are not persuasive. So, maybe that argument is losing because (a) there’s a crap-ton of harassment happening that makes it wrong/irrelevant, and (b) even when it’s not harassing, the language used isn’t very persuasive.

My Measures

Lexical diversity was calculated using a standard type:token ratio (unique words divided by total words).

LD = wu/wt    (1)

I created a measure for powerful language using a similar ratio approach. First, I calculated the ratios of common hedges (e.g., “i feel like”, “probably”) and intensifiers (e.g., “really”), and I then used the inverse of the combined ratio (total hedges divided by total words) to measure the power of the language used.

P=1(wh/wt)    (2)

I used a ratio of high intensity markers (e.g., “very”, “strongly”) to low intensity markers (e.g., “poor”, “mildly”) to measure language intensity.

LI =ih/il    (3)

My Code

You’re right. This project is also a great excuse for me to learn more Python. If you’ve ever talked to me about code in person, you know how that makes me feel [not awesome]. But, here we are. I’m writing some utilities for automatically analyzing tweets, and the code is available on GitHub. That code assumes you used TwitterGoggles to collect and parse the tweets.

References

[1] D. Huffaker, “Dimensions of Leadership and Social Influence in Online Communities,” Human Communication Research, vol. 36, no. 4, pp. 593–617, Oct. 2010.

[2] J. J. Bradac, J. W. Bowers, and J. A. Courtright, “Three Language Variables in Communication Research: Intensity, Immediacy, and Diversity,” Human Communication Research, vol. 5, no. 3, pp. 257–269, 1979.

[3] J. J. Bradac, C. W. Konsky, and R. A. Davies, “Two Studies of the Effects of Linguistic Diversity Upon Judgments of Communicator Attributes and Message Effectiveness,” Communication Monographs, vol. 43, no. 1, pp. 70–79, Mar. 1976.

[4] S. V. Kasl and G. F. Mahl, “Relationship of disturbances and hesitations in spontaneous speech to anxiety,” Journal of Personality and Social Psychology, vol. 1, no. 5, pp. 425–433, 1965.

[5] J. A. Daly, “The Effects of Writing Apprehension on Message Encoding,” Journalism Quarterly, vol. 54, no. 3, pp. 566–572, Sep. 1977.

[6] T. Holtgraves and B. Lasky, “Linguistic Power and Persuasion,” Journal of Language and Social Psychology, vol. 18, no. 2, pp. 196–205, Jun. 1999.

[7] J. J. Bradac, M. R. Hemphill, and C. H. Tardy, “Language Style on Trial: Effects of ‘Powerful’ and ‘Powerless’ Speech Upon Judgments of Victims and Villains,” Western Journal of Speech Communication: WJSC, vol. 45, no. 4, pp. 327–341, Fall 1981.

[8] F. Crosby and L. Nyquist, “The Female Register: An Empirical Study of Lakoff’s Hypotheses,” Language in Society, vol. 6, no. 3, pp. pp. 313–322, 1977.

[9] V. Savicki, D. Lingenfelter, and M. Kelley, “Gender Language Style and Group Composition in Internet Discussion Groups,” Journal of Computer-Mediated Communication, vol. 2, no. 3, pp. 0–0, 1996.

[10] S. C. Herring, “Gender and power in online communication,” in The Handbook of Language and Gender, J. Holmes and M. Meyeroff, Eds. 2003, pp. 202–228.

[11] J. W. Bowers, “Language intensity, social introversion, and attitude change,” Speech Monographs, vol. 30, no. 4, pp. 345–352, Nov. 1963.

[12] M. A. Hamilton and J. E. Hunter, “The effect of language intensity on receiver evaluations of message, source, and topic,” Persuasion: Advances through meta-analysis, pp. 99–138, 1998.

[13] L. V. Jones and L. L. Thurstone, “The psychophysics of semantics: an experimental investigation,” Journal of Applied Psychology, vol. 39, no. 1, pp. 31–36, 1955.

[14] M. Burgoon and G. R. Miller, “Prior attitude and language intensity as predictors of message style and attitude change following counterattitudinal advocacy.,” Journal of Personality and Social Psychology, vol. 20, no. 2, p. 246, 1971.

[15] P. A. Andersen and T. R. Blackburn, “An Experimental Study of Language Intensity and Response Rate in E Mail Surveys,” Communication Reports, vol. 17, no. 2, pp. 73–82, Summer 2004.

[16] M. Burgoon, S. B. Jones, and D. Stewart, “Toward a Message-Centered Theory of Persuasion: Three Empirical Investigations of Language Intensity1,” Human Communication Research, vol. 1, no. 3, pp. 240–256, 1975.


#GamerGate vs #StopGamerGate2014 By the Numbers – 10/20 edition

Edited on 10/20: Added info about specific users, more numbers.

Carly Kocurek, one of my smart and savvy IIT colleagues, pointed out that the #GamerGate and #StopGamerGate2014 discussions on Twitter are worth examining. So, I fired up a TwitterGoggles instance to track those hastags and these others she recommended:

#quinnspiracy
#gamergate
#notyourshield
#StopGamerGate2014
#academicANDfeminist
#gamerfruit

I saw @Gaming_Sparrow‘s tweet comparing the popularity of the two main hashtags. @ybika asked for some response, so I ran a couple quick queries on the data I’d collected and found totally different numbers.

From 10/17/2014 – 10/20/2014, I see

#GamerGate

33,039 users

278,548 tweets

and

#StopGamerGate2014

6,303 users

16,099 tweets

A few things could be happening:

  • Keyhole may be using a case-sensitive search, and mine is case-insensitive
  • Keyhole and I are getting different data from Twitter
  • #GamerGate has gained popularity and is now used by every side of the argument

No more time to process this today, but I’ll come back to it. What do you think is going on?


More info added later on 10/20:

I’ve seen a couple other tweets or posts about the number of users and the distribution of #gamergate tweets (e.g., Waxpancake: 100 people posts 24% of the tweets). It’s difficult to compare my data to his/theirs because I don’t know how Keyhole and Waxpancake are collecting their data. I contribute to TwitterGoggles on GitHub and know it much better. Of course, it still relies on the Twitter Search API, so there’s lots I don’t know about what’s not in my data. Anyway, here are some things I noticed while looking at my data.

#gamergate dominates other tags

Here’s a quick graph I made using Tableau. In this chart, the x-axis represents time where each bar is an hour, and the y-axis represents the number of tweets posted. The colors of the stacked bars map to the hashtags that appear in the tweet: blue for #gamergate only, orange for #stopgamergate2014 only, and green for tweets with both tags. This graph isn’t designed to make detailed comparisons easy – it’s just to show how incredibly popular #gamergate is compared to other tags. I also found it interesting that tweets contain both tags since they’re mostly at odds. Of course, one of the tweets with both is my own because I wanted it to show up in both conversations. Though, I may regret posting at all. Isn’t that the problem?

@mfreema55 asked me to post a higher-res image and explain the time info. So, here you go. The hours are GMT – so the graph says when people in the U.S. get off work, they start tweeting about this stuff.

Some of the most active voices change their names

Twitter assigns accounts unique user id’s, but users can change their full names if they’d like. A few accounts in the #gamergate conversation (I use the term broadly to refer to all the data associated with the hashtags above) have changed their names while tweeting. For instance, @nahalennia changed zir* name from “You Didn’t Listen” (160 tweets) to “The Future You Choose” (590 tweets) at some point in the last 3 days. So did @PsychokineticEX. Zir changed names from ADMIRALOF#GAMERGATE (528 tweets) to THE ADMIRAL (174 tweets). Both accounts are among the top 25 most active.

Users can also change their handles (the part after the @), but that seems far less common in this group. User #2815636153 is an interesting exception. Zir used names “and_next_name,” “my_next_name,” “need_next_name,” “the_next_name,” “their_next_name,” and “your_next_name” this weekend.

Skewed distribution of tweets/user

Like much of online activity, a few people are responsible for most of the content. This isn’t the most skewed distribution I’ve ever seen, but it’s definitely skewed. Or, it has a long tail. Depends on how you look at it. I haven’t normalized this (for anything, including how many tweets this account usually post), but that would be interesting too. I.e., maybe @SomeKindaBoogin just tweets constantly, so it’s not suprising that zir tweeted in this conversation a lot. Again, this graph isn’t about details. It’s unreadable at that level because I wanted to show you how incredibly long this tail is. Even if just a few people are incredibly active, there are still thousands of people engaging at some level. That’s exciting.

Tweets per user


 

* I’m using gender-neutral pronouns since I don’t know who these accounts belong to, whether they are owned by a person or a group, and since it makes sense to use gender-neutral pronouns when talking about harassment and safety.

 


Summary Stats about #StoptheNSA Twitter Activity

I gave a talk at Social Media Week Chicago with Prof. Ed Lee from IIT Chicago-Kent College of Law this week. We are studying a number of online political protests including the February 11, 2014 #StoptheNSA protest spearheaded by the Day We Fight Back. Here are the summary statistics about that day on Twitter:

Total Tweets: 98,515 (2.8x as many tweets about NSA as usual)
Original tweets N: 48,374 (49%)
Retweets N: 50,141 (51%)

We collected over 2M tweets with the hashtags #StoptheNSA, #NSA, and #daywefightback from February 7, 2014 to April 8, 2014 and found that the protest did

  • create high volume activity spikes
  • involve many and diverse users
  • reach huge audiences
  • generate attention

The graph below plots the number of tweets with any of those hashtags by day. You can see a great spike on February 11, the actual day of the protest, and another spike on March 25, the day President Obama gave public remarks about the NSA at the Hague. You can also see that on the day of the protest, the #StoptheNSA hashtag was quite popular, but it mostly disappeared by the time President Obama spoke at the Hague. The general #NSA hashtag, though, received continued attention throughout this time period. Even though the protest’s own hashtag died out, the protest was likely able to generate additional, lasting interest in the NSA.

#StoptheNSA Tweets by Day

The number of tweets posted using the #StoptheNSA, #NSA, or #daywefightback hashtags.

You can learn more about our findings from the slides for our talk. If you’d like to monitor and understand your own Twitter campaign, please contact me.


Who said it first – Congress or the press?

Sometimes Congress, sometimes the press, it turns out. Matt Shapiro and I wrote a paper for this month’s Midwest Political Science Association meeting in which we analyzed the timing of tweets with hashtags and New York Times articles with keywords and found

… news coverage and Twitter activity from the previous day are good predictors of news coverage and Twitter attention on any given day.

We wondered whether political issues popular on Twitter were popular in the press as well and whether issues cropped up among politicians on Twitter or in the press first. So, we retrieved all the articles available from the New York Times Article API for 2013 and all of the tweets Twitter would let us have for members of Congress (see links to code for collecting data below). We focused on hashtags and article keywords for six policy areas: budget, immigration, environment, energy, the Affordable Care Act (ACA), and marginalized groups (e.g., LGBT, military veterans, Latinos, etc.) and compared the timelines of when those issues were referenced in tweets and in articles.

The tables below show the results of our regressions. For most of the issues, they were similarly popular in the press and on Twitter on the same day. However, for immigration, Twitter activity in the past is a better predictor of news coverage than prior news coverage. For marginalized groups, neither prior news nor prior tweets are good predictors of a day’s news, suggesting that attention both in the news and on Twitter is spotty (or bursty) for marginalized groups.

The strong correlations between issues’ Twitter activity and news coverage on the same day (see models labeled “b” in the tables below) suggest, at least, that the press and Congress are giving attention to similar issues.

  Budget Immigration Environment
  (1a) (1b) (2a) (2b) (3a) (3b)
Previous day’s news .377*** .361*** .202*** .176*** .169*** .169***
Previous day’s tweets .366*** .114* .287*** .224*** .111** .115**
Same day’s tweets .351*** .184*** -.012
F-statistic 145.03 119.71 34.29 27.75 7.97 5.31
R2 0.45 0.50 0.16 0.19 0.04 0.04
N 364 364 364 364 364 364

 

  Energy ACA Marginalized
  (4a) (4b) (5a) (5b) (6a) (6b)
Previous day’s news .171*** .171*** .382*** .296*** .074 .078
Previous day’s tweets .129** .115** .201*** .073 .007 .035
Same day’s tweets .042 .313*** -.081
F-statistic 9.45 6.50 64.02 57.87 1.00 1.39
R2 0.05 0.05 0.26 0.33 0.01 0.01
N 364 364 364 364 364 364

Note: Each count of articles and tweets is a standard score, and beta coefficients for each predictor are reported. Predictors’ significance are indicated with asterisk where *, **, *** represent p<0.1, p<0.05, p<0.001, respectively.

Python Code for Collecting the Data


Collecting and Connecting On- and Offline Political Network Data

I gave a talk at the DIMACS Workshop on Building Communities for Transforming Social Media Research Through New Approaches for Collecting, Analyzing, and Exploring Social Media Data at Rutgers University last week. Here are my slides and roughly what I said:

Many of today’s talks are about gathering big social data or automating its analysis. I’m going to focus instead on connecting disparate data sources to increase the impacts of social media research.

Most of my work is about public policy, civic action, and how social media plays a role in each. Today, I’ll talk mainly about a study of how Congress uses Twitter and how that use influences public discussion of policy.

I’m in a department of humanities but have degrees in information and work experience in web development. My position allows me to witness divides in how different disciplines think about data and research, and I’ve included a few comments in my talk about why those divides matter.

Often when we study social media we’re looking at trends or trying to generalize about or understand whole populations, but I’m interested in a specific subset of people, what they do online, and how that online activity influences the offline world. This focus allows me to connect what we know about people offline with what they do online quite reliably. For instance, I can connect data such as party affiliation, geolocation, gender, tenure in Congress, chamber, voting record, campaign contributions, the list goes on, because these values are known for members of Congress. Getting their data from Twitter is trickier. Govtrack does a nice job keeping track of official accounts, but politicians often have 2 or 3 accounts – for campaign messaging, for personal use – and there are many bogus accounts out there. Once you find their accounts, you can use a variety of tools to capture their Twitter data. I wrote my own in Python and MySQL because most other free tools focus on hashtags and the Search API, and those don’t return the Twitter data I need.

So, back to the study. There’s plenty of hype in the popular press about how politicians wield social media influence to impact policy. Look at how Obama used the internet to become President! I wondered whether these claims are true – was social media providing a new route to influence for members of Congress?

Turns out, not so much. The people positioned to exert the most influence online occupy similar positions of power offline.

How can I be sure? Network theory, especially social capital theory, provides tools for making judgments about the relative power of people as a function of their positions in a network. And luckily, I’m not the only one who thinks network theory is a good way to interrogate power in Congress. Other researchers have used network theory to analyze relationships among bill co-sponsors, roll call votes, congressional committees, and press events. I’ll focus just on cosponsorship because it’s the most widely used measure of legislative influence. I compared legislative influence with a person’s ability to control the spread of information.

To do so, I used Jim Fowler’s cosponsorship network data and his measure of influence – connectedness – which is a weighted closeness centrality measure for the network analysts among us.

My data is a network of mentions among members of Congress. I have a few hundred thousands tweets from 2008 – present in which Congress mention one another about 75,000 times. Every mention creates an explicit, directed link between two members, and these links form the network I’m interested in.

Fowler’s data is an undirected, implicit network in that members are connected through their affiliation with legislation. To use Fowler’s data, I needed a way to connect members of Congress on Twitter to members in his data, a sort of key, if you will. Keith Poole’s ICPSR ID is a widely used unique identifier for individual members of Congress (that Fowler also uses) so, I developed a mapping of Twitter ID to ICPSR ID.

So Fowler used bill cosponsorship networks to figure out who wields influence over what legislation gets addressed and eventually passed. Turns out members with high connectedness are more effective at convincing their peers to vote with them. I used the same algorithm to measure who wields influence over the spread of information online. Being able to spread information quickly allows politicians to control the conversation. We know from studies of framing, for instance, that the first frame is likely to get traction and essentially constrain future discussions of an issue. What I found was that people with legislative influence also control information. While this correlation isn’t necessarily causal in either direction, what matters is that the same people wield influence both online and off. I can tell you a little more about those people too, because I was able to connect online behavior data with off-line demographic and political data.

Members of Congress who control the conversation, or at least are in a position to, are male House Republicans. If you’re a female, Democrat, feminist scholar like me, that’s scary. What are some of the implications of that information control? Well, for starters, male House Republicans nearly never talk about issues facing marginalized groups such as pay inequality, discrimination, and poverty. Instead, the online conversation is about the ACA, gun control, and the debt. Whether you think we should be talking more about poverty or the ACA, I expect you’d agree that some diversity in both topics and talkers would be welcome. But that’s not what I’m seeing. Instead, I’m seeing male House Republicans controlling the spread of information and attention through the Congressional network and to all of its followers.

We’ve now taken a whirlwind tour through one of my studies of political communication on Twitter. What did we find that matters? First, social media data is most useful when we connect it with other data. Second, social media is not providing an alternate route to power for members of Congress. Third, maleness and Republicanness are the most reliable routes to influence online. From my view, these results paint a pretty bleak picture were social media doesn’t actually challenge the status quo, and groups that wield disproportionate, and often oppressive, influence offline do so online as well. This isn’t quite the democratizer or equalizer I was hoping for, but as I work to understand what’s happening among citizens, maybe I’ll see something different.

I’d rather not end on a depressing note, so let me end on a call to action instead. The technical expertise required to do this study – to collect Fowler’s data, to collect Twitter data, to do the statistical and network analyses – may seem second nature to many of us here, but they are not to the people best equipped to interrogate this data. Let’s not measure of impact by the size of our dataset or the lines of code we had to write to get it. For instance, I have a colleague in political science at IIT who knows much more than I do about legislative influence and political communication. He shouldn’t also be asked to learn R and Python to contribute to the discussion about social media’s role in influencing public policy discussions. I hope we can remove, or at least diminish, the technical barriers for subject matter experts and scientists of other stripes to use [big] [open] [social] data and that we can change graduate education to train students in both social theory and technical tools.

See tweets from the conference: http://seen.co/event/dimacs-workshop-on-building-communities-for-social-media-research-core-building-rutgers-university-new-brunswick-nj-2014-4203

Access Fowler’s papers and data: http://jhfowler.ucsd.edu/cosponsorship.htm

Access Poole’s ICPSR ID data and information: http://www.voteview.com/icpsr.htm


Fetching articles from the New York Times API

I’m working on a paper for the Midwest Political Science Association meeting in which we analyze whether policy issues appear first in Congress’s tweets on in the popular press. We’re using all articles from the New York Times, including those from the Associated Press, Reuters, and other providers, as “popular press” content. In order to collect the articles, I wrote a Python script that loops through dates we’re interested in, saves the JSON provided by the Times Articles API, and finally parses that JSON into a CSV for use in Stata/R/whatever.

You may be wondering why I bother storing the JSON at all. For three reasons, really:

  1. Something could go wrong if I didn’t, and I’d have to fetch it again;
  2. The New York Times is nice enough to allow programmatic access to its articles, but that doesn’t mean I should query the API every time I want data; and,
  3. I may need different subsets or forms of the same “raw” data for different analyses.

So, instead of querying the API and parsing its response all at once, I query it once and cache the raw data, lessening the burden on the Times API and saving a local copy of the “raw” data. Then, I parse that raw data into whatever format I need – in this case a tab-delimited file with only some of the fields – and leave the raw data alone. Next time I have a research question that relies on the same articles, I can just re-parse the stored JSON files into whatever format helps me answer my new question.

The script is a work in progress. See the repo README on Github for info about planned changes.

Why “raw” data? Well, even the JSON that the Times provides has been processed. The Times has chosen some data to give me and some data to keep to itself (e.g., the full text of articles). The data the API returns in raw to me, meaning it’s my starting point. Whether data is ever really raw, or if that’s even a useful term, I leave up to you for now.

The Code

import urllib2
import json
import datetime
import time
import sys
import argparse
import logging
from urllib2 import HTTPError
 
# helper function to iterate through dates
def daterange( start_date, end_date ):
    if start_date <= end_date:
        for n in range( ( end_date - start_date ).days + 1 ):
            yield start_date + datetime.timedelta( n )
    else:
        for n in range( ( start_date - end_date ).days + 1 ):
            yield start_date - datetime.timedelta( n )
 
# helper function to get json into a form I can work with       
def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input
 
# helpful function to figure out what to name individual JSON files        
def getJsonFileName(date, page, json_file_path):
    json_file_name = ".".join([date,str(page),'json'])
    json_file_name = "".join([json_file_path,json_file_name])
    return json_file_name
 
# helpful function for processing keywords, mostly    
def getMultiples(items, key):
    values_list = ""
    if len(items) > 0:
        num_keys = 0
        for item in items:
            if num_keys == 0:
                values_list = item[key]                
            else:
                values_list =  "; ".join([values_list,item[key]])
            num_keys += 1
    return values_list
 
# get the articles from the NYTimes Article API    
def getArticles(date, api_key, json_file_path):
    # LOOP THROUGH THE 101 PAGES NYTIMES ALLOWS FOR THAT DATE
    for page in range(101):
        try:
            request_string = "http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=" + date + "&end_date=" + date + "&page=" + str(page) + "&api-key=" + api_key
            response = urllib2.urlopen(request_string)
            content = response.read()
            if content:
                articles = convert(json.loads(content))
                # if there are articles here
                if len(articles["response"]["docs"]) >= 1:
                    json_file_name = getJsonFileName(date, page, json_file_path)
                    json_file = open(json_file_name, 'w')
                    json_file.write(content)
                    json_file.close()
                # if no more articles, go to next date
                else:
                    return
            # else:
            #     break
            time.sleep(3)
        except HTTPError as e:
            logging.error("HTTPError on page %s on %s (err no. %s: %s) Here's the URL of the call: %s", page, date, e.code, e.reason, request_string)
        except: 
            logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
            continue
 
# parse the JSON files you stored into a tab-delimited file
def parseArticles(date, csv_file_name, json_file_path):
    for file_number in range(101):
        # get the articles and put them into a dictionary
        try:
            file_name = getJsonFileName(date,file_number, json_file_path)
            in_file = open(file_name, 'r')
            articles = convert(json.loads(in_file.read()))
            in_file.close()
        except IOError as e:
			logging.error("IOError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
			continue
 
        # if there are articles in that document, parse them
        if len(articles["response"]["docs"]) >= 1:  
            # open the CSV for appending
            try:
                out_file = open(csv_file_name, 'ab')
            except IOError as e:
    			logging.error("IOError: %s %s", date, file_number, e.errno, e.strerror)
    			continue
 
            # loop through the articles putting what we need in a CSV   
            try:
                for article in articles["response"]["docs"]:
                    # if (article["source"] == "The New York Times" and article["document_type"] == "article"):
                    keywords = ""
                    keywords = getMultiples(article["keywords"],"value")
 
                    # should probably pull these if/else checks into a module
                    variables = [
                        article["pub_date"], 
                        keywords, 
                        str(article["headline"]["main"]).decode("utf8").replace("\n","") if "main" in article["headline"].keys() else "", 
                        str(article["source"]).decode("utf8") if "source" in article.keys() else "", 
                        str(article["document_type"]).decode("utf8") if "document_type" in article.keys() else "", 
                        article["web_url"] if "web_url" in article.keys() else "",
                        str(article["news_desk"]).decode("utf8") if "news_desk" in article.keys() else "",
                        str(article["section_name"]).decode("utf8") if "section_name" in article.keys() else "",
                        str(article["snippet"]).decode("utf8").replace("\n","") if "snippet" in article.keys() else "",
                        str(article["lead_paragraph"]).decode("utf8").replace("\n","") if "lead_paragraph" in article.keys() else "",
                        ]
                    line = "\t".join(variables)
                    out_file.write(line.encode("utf8")+"\n")
            except KeyError as e:
                logging.error("KeyError in %s page %s: %s %s", date, file_number, e.errno, e.strerror)
                continue
            except (KeyboardInterrupt, SystemExit):
                raise
            except: 
                logging.error("Error on %s page %s: %s", date, file_number, sys.exc_info()[0])
                continue
 
            out_file.close()
        else:
            break
 
# Main function where stuff gets done
 
def main():
    parser = argparse.ArgumentParser(description="A Python tool for grabbing data from the New York Times Article API.")
    parser.add_argument('-j','--json', required=True, help="path to the folder where you want the JSON files stored")
    parser.add_argument('-c','--csv', required=True, help="path to the file where you want the CSV file stored")
    parser.add_argument('-k','--key', required=True, help="your NY Times Article API key")
    # parser.add_argument('-s','--start-date', required=True, help="start date for collecting articles")
    # parser.add_argument('-e','--end-date', required=True, help="end date for collecting articles")
    args = parser.parse_args()
 
    json_file_path = args.json
    csv_file_name = args.csv
    api_key = args.key    
    start = datetime.date( year = 2013, month = 1, day = 1 )
    end = datetime.date( year = 2013, month = 1, day = 1 )
    log_file = "".join([json_file_path,"getTimesArticles_testing.log"])
    logging.basicConfig(filename=log_file, level=logging.INFO)
 
    logging.info("Getting started.") 
    try:
        # LOOP THROUGH THE SPECIFIED DATES
        for date in daterange( start, end ):
            date = date.strftime("%Y%m%d")
            logging.info("Working on %s." % date)
            getArticles(date, api_key, json_file_path)
            parseArticles(date, csv_file_name, json_file_path)
    except:
        logging.error("Unexpected error: %s", str(sys.exc_info()[0]))
    finally:
        logging.info("Finished.")
 
if __name__ == '__main__' :
    main()

My Personal Wayback machine: While Reading “Tricks of the Trade”

I haven’t blogged about my own research process in a while. Tonight I thought I might, and I found this incomplete post in my “Drafts” folder. I originally drafted it in the fall of 2009. Rather than expand it now, I thought I’d just release it into the wild. I’m not editing it because I like the hopeful, curious tone it takes. I haven’t felt like this about research in a while, but maybe I will again having read my own thoughts. Find the original draft after the jump.

Continue Reading