Apr 28, 2010

Social Network Analysis using R and Gephis

After learning the basics of R, I decided to learn something harder last week. I picked Social Network Analysis (SNA) to learn the concepts of SNA and R. My primary interest in SNA is visual exploration of networks, so I needed to find a tool first. 

Which tool to use for visual SNA? Features needed:
1) graphical representation of network
2) visually navigate the graph (zoom in/out, drag) to explore large graphs
3) manipulate the graph (filter nodes, edit/delete/group nodes and same for edges)
4) free, preferably open source. I'm not currently interested in commercial tools (can't justify their steep price tag for my experiments).

I found out that R has good libraries like SNA (checkout Drew Conway's tutorial) and iGraph (see this tutorial) for social network analysis. However, they lack features to deal with large graphs (nodes > 200, edges > 500 seem to make the process slow and the plots unusable), navigate and manipulate the graph visually. All I could do was plot simple networks (shown below).


So I continued my hunt for a good tool for visual SNA and discovered Gephis, an open source app for visual exploration. Think "photoshop for graphs".

WARNING: SNA with Gephis is addictive. I've had my share of sleepless nights and dreams of nodes and edges




After you download Gephi, checkout Gephi quick start guide to get your bearings.

I played with Gephi for several hours to learn it (its kewl) and to impress my daughter (her dad is no fireman, who saves people, but he can do nifty things with a computer ;) I  was able to discover interesting facts from the data, including:

- Avg. degree: the number of edges/connections attached to a node
- Network diameter: The longest path between the nodes in the graph
- Average path length: In how many steps (on avg) can one can reach any node from any other node in the graph
- Degree power law: The higher this number, the more unequal is the distribution of connections within the network, which means that some nodes are very well-connected and some are not at all
- Average clustering coefficient: Shows how well the nodes are embedded in their neighborhood i.e. is there a "small world" effect within the network
- Modularity: The higher this parameter, the more defined are the communities within the network. A result of 0.4 or more is usually considered meaningful
- Betweenness centrality was calculated for each node, which shows how often the node appears on the shortest path between any two random nodes in the network. The higher this parameter, the more influential the node is. The nodes which have high betweenness centrality are not necessarily the ones that have the most connections and don't have to be the most "popular" ones Here's a video of Gephi features (older version)



Gephi Features Tour from gephi on Vimeo.


I was also able to discover interesting patterns in the data, like the communities that emerge, popular people and the connectors. See the graph below to see nodes with different colors (communities), size (popularity) and how most connections between the two communities flow through a few nodes (connections)




Web rendition of these graphs is also possible. Checkout this visualization of Perl authors (you can drag the graph with left click and dragging your mouse; you can zoom-in/out on the graph with your mouse wheel)

Lessons learned so far:
- I quickly realized that you need a good machine for using Gephi effectively (a good video card, enough memory and fast CPU)
- There is a great, active community behind Gephi, so expect frequent releases to resolve critical issues and new features. I'm waiting for this month's release that fixes some issues I've faced :-) If you know of other tools to visually explore graphs, please leave a comment.  

Which network data I used? There's a wealth of network data available today, including social networking sites, phone logs, work history, chat logs, email logs etc. I decided to create test data for email traffic to test the hypothesis of "who we send emails to or receive emails from" is a good indicator of our social network. My test data has 50,000+ nodes and 150,000+ edges.

I used R to create the create data for graph. R is good at handling millions of rows of data and is powerful for data manipulation (cleaning, creating edge lists, adjacency matrix etc.). I used it to format raw email traffic test data into graph formats (edgelist, adjacency matrix etc.) It took me a couple of hours to write code for creating the data set to feed into Gephi. Using R to solve a real need has been a good learning experience so far.

Leave a comment, if you're interested in seeing the code.





Update: I've posted the code in this post

15 comments:

  1. Great to see you are so enthusiastic about R and Gephi for analyzing and visualizing complex network data.
    Although I am working on a different domain (molecular nutrition) I am facing the same challenges you were able to tackle. I would appreciate if you could share (parts of) your code, especially how to transform data structures in R such way it can be handled by Gephi.
    Best wishes,
    Guido

    guido[ dot ]hooiveld[ at ]wur[ dot ]nl

    ReplyDelete
  2. Thanks Guido. I've posted the code I used in a new post -> http://prasoon.blogspot.com/2010/06/data-preparation-code-for-social.html

    ReplyDelete
  3. AnonymousJuly 19, 2010

    "It took me a couple of hours to write code for creating the data set to feed into Gephi."

    Does that mean you have an export routine from R to gexf? that would be very helpful for a lot of people (i guess)

    ReplyDelete
  4. Sorry for the delayed response to the last comment. Yes, I've a R routine to write data in gexf format. I've posted it in a new post -> http://www.rcasts.com/2010/06/data-preparation-code-for-social.html

    ReplyDelete
  5. Hi. I am interested in including one of your graphs in a presentation I am working on (with proper attribution of course).
    Would that be possible ?

    ReplyDelete
  6. Sure Laurent (@lalquier). Pls feel free to refer to any material on my blog. Thanks

    ReplyDelete
  7. Hi,
    Great post shared by author. I appreciate it. Thanks for sharing this informative post. Keep posting updates.

    ReplyDelete
  8. Hey by using your all charts and graphs we are getting much more benefit.....Thank you

    ReplyDelete
  9. Very interesting, thanks for sharing

    ReplyDelete
  10. Hey Its really good job buddy you are giving the clear cut ideas on Social Network Analysis which has become the backbone of our society .
    Its also giving the opportunity to create communities and through this people are increasing the memberlist of their community

    ReplyDelete
  11. A very interesting presentation. I had no idea that social network analysis actually exists, and now I am thinking: this is what a professional writes about. I am specialized on SEO, trying to become an inbound marketeer, and that means i must improve and learn more about promoting a business on social media sites, along with another 20 domains i must cover. Since I am more of a mathematical person, I love to learn and read this kind of data. I hope I can start to analyze social network influential members soon, as I could reach them to promote my recent project - Hotel Cluj
    Thank you again, and I will follow you.

    ReplyDelete
  12. So very interested to see the code! Please share :)

    ReplyDelete
  13. Hi , can i use gephi program if I Am not a programmer ? and from where i can find data base that i want to analyse?
    thank you.

    ReplyDelete
  14. Hi I would like to see your code. Is it possible to see your code?

    ReplyDelete
  15. Hi,

    I have been trying to plot my data using igraph but most times R exits or takes a really long time. The graphs are not even visually appealing.
    My data consists of 50,000+ nodes and about 4Million+ edges. Would it be possible to do it with gephi?
    Let me know. Thanks

    ReplyDelete