Apr 28, 2010

Social Network Analysis using R and Gephis

After learning the basics of R, I decided to learn something harder last week. I picked Social Network Analysis (SNA) to learn the concepts of SNA and R. My primary interest in SNA is visual exploration of networks, so I needed to find a tool first. 

Which tool to use for visual SNA? Features needed:
1) graphical representation of network
2) visually navigate the graph (zoom in/out, drag) to explore large graphs
3) manipulate the graph (filter nodes, edit/delete/group nodes and same for edges)
4) free, preferably open source. I'm not currently interested in commercial tools (can't justify their steep price tag for my experiments).

I found out that R has good libraries like SNA (checkout Drew Conway's tutorial) and iGraph (see this tutorial) for social network analysis. However, they lack features to deal with large graphs (nodes > 200, edges > 500 seem to make the process slow and the plots unusable), navigate and manipulate the graph visually. All I could do was plot simple networks (shown below).

So I continued my hunt for a good tool for visual SNA and discovered Gephis, an open source app for visual exploration. Think "photoshop for graphs".

WARNING: SNA with Gephis is addictive. I've had my share of sleepless nights and dreams of nodes and edges

After you download Gephi, checkout Gephi quick start guide to get your bearings.

I played with Gephi for several hours to learn it (its kewl) and to impress my daughter (her dad is no fireman, who saves people, but he can do nifty things with a computer ;) I  was able to discover interesting facts from the data, including:

- Avg. degree: the number of edges/connections attached to a node
- Network diameter: The longest path between the nodes in the graph
- Average path length: In how many steps (on avg) can one can reach any node from any other node in the graph
- Degree power law: The higher this number, the more unequal is the distribution of connections within the network, which means that some nodes are very well-connected and some are not at all
- Average clustering coefficient: Shows how well the nodes are embedded in their neighborhood i.e. is there a "small world" effect within the network
- Modularity: The higher this parameter, the more defined are the communities within the network. A result of 0.4 or more is usually considered meaningful
- Betweenness centrality was calculated for each node, which shows how often the node appears on the shortest path between any two random nodes in the network. The higher this parameter, the more influential the node is. The nodes which have high betweenness centrality are not necessarily the ones that have the most connections and don't have to be the most "popular" ones Here's a video of Gephi features (older version)

Gephi Features Tour from gephi on Vimeo.

I was also able to discover interesting patterns in the data, like the communities that emerge, popular people and the connectors. See the graph below to see nodes with different colors (communities), size (popularity) and how most connections between the two communities flow through a few nodes (connections)

Web rendition of these graphs is also possible. Checkout this visualization of Perl authors (you can drag the graph with left click and dragging your mouse; you can zoom-in/out on the graph with your mouse wheel)

Lessons learned so far:
- I quickly realized that you need a good machine for using Gephi effectively (a good video card, enough memory and fast CPU)
- There is a great, active community behind Gephi, so expect frequent releases to resolve critical issues and new features. I'm waiting for this month's release that fixes some issues I've faced :-) If you know of other tools to visually explore graphs, please leave a comment.  

Which network data I used? There's a wealth of network data available today, including social networking sites, phone logs, work history, chat logs, email logs etc. I decided to create test data for email traffic to test the hypothesis of "who we send emails to or receive emails from" is a good indicator of our social network. My test data has 50,000+ nodes and 150,000+ edges.

I used R to create the create data for graph. R is good at handling millions of rows of data and is powerful for data manipulation (cleaning, creating edge lists, adjacency matrix etc.). I used it to format raw email traffic test data into graph formats (edgelist, adjacency matrix etc.) It took me a couple of hours to write code for creating the data set to feed into Gephi. Using R to solve a real need has been a good learning experience so far.

Leave a comment, if you're interested in seeing the code.

Update: I've posted the code in this post


  1. Great to see you are so enthusiastic about R and Gephi for analyzing and visualizing complex network data.
    Although I am working on a different domain (molecular nutrition) I am facing the same challenges you were able to tackle. I would appreciate if you could share (parts of) your code, especially how to transform data structures in R such way it can be handled by Gephi.
    Best wishes,

    guido[ dot ]hooiveld[ at ]wur[ dot ]nl

  2. Thanks Guido. I've posted the code I used in a new post -> http://prasoon.blogspot.com/2010/06/data-preparation-code-for-social.html

  3. AnonymousJuly 19, 2010

    "It took me a couple of hours to write code for creating the data set to feed into Gephi."

    Does that mean you have an export routine from R to gexf? that would be very helpful for a lot of people (i guess)

  4. Sorry for the delayed response to the last comment. Yes, I've a R routine to write data in gexf format. I've posted it in a new post -> http://www.rcasts.com/2010/06/data-preparation-code-for-social.html

  5. Hi. I am interested in including one of your graphs in a presentation I am working on (with proper attribution of course).
    Would that be possible ?

  6. Sure Laurent (@lalquier). Pls feel free to refer to any material on my blog. Thanks

  7. Hi,
    Great post shared by author. I appreciate it. Thanks for sharing this informative post. Keep posting updates.

  8. Hey by using your all charts and graphs we are getting much more benefit.....Thank you

  9. Very interesting, thanks for sharing

  10. Hey Its really good job buddy you are giving the clear cut ideas on Social Network Analysis which has become the backbone of our society .
    Its also giving the opportunity to create communities and through this people are increasing the memberlist of their community

  11. A very interesting presentation. I had no idea that social network analysis actually exists, and now I am thinking: this is what a professional writes about. I am specialized on SEO, trying to become an inbound marketeer, and that means i must improve and learn more about promoting a business on social media sites, along with another 20 domains i must cover. Since I am more of a mathematical person, I love to learn and read this kind of data. I hope I can start to analyze social network influential members soon, as I could reach them to promote my recent project - Hotel Cluj
    Thank you again, and I will follow you.

  12. So very interested to see the code! Please share :)

  13. Hi , can i use gephi program if I Am not a programmer ? and from where i can find data base that i want to analyse?
    thank you.

  14. Hi I would like to see your code. Is it possible to see your code?

  15. Hi,

    I have been trying to plot my data using igraph but most times R exits or takes a really long time. The graphs are not even visually appealing.
    My data consists of 50,000+ nodes and about 4Million+ edges. Would it be possible to do it with gephi?
    Let me know. Thanks

  16. Hi, kindly send me code for data!

  17. i like your codes thanks for sharing Documents Translation & webdevelopment Services best web development and translation service

  18. I am grateful to you and expect more number of posts like these. Thank you very much.
    instagram comments

  19. Thanks for sharing your work, Really a hard work...

  20. Thanks for sharing the post.. parents are worlds best person in each lives of individual..they need or must succeed to sustain needs of the family.

  21. can you please send me to code? I think this is super usefull!!!
    thanks a lot!

  22. can you please send me to code? I think this is super usefull!!!
    thanks a lot!

  23. can you please send me to code? I think this is super usefull!!!
    thanks a lot!

  24. For the Virtual SNA this was the best thing to do, furthermore they have discovered such more ideas which are even giving the so true and fair story after all, will of course look forward to have such more ideas for the future. essay writing service


  25. R is a free software environment for statistical computing and graphics it called "The R Project for Statistical Computing, It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

  26. My spouse and i shocked while using investigation anyone created to choose this certain release outstanding. Amazing task!

  27. This comment has been removed by the author.

  28. Your environment and explaining too good now i easily run this educational software thanks for share it reword this for me .


  29. Excellent post!!! The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.
    etl training in chennai

  30. Wonderful Post. With one of a kind substance, I truly motivate enthusiasm to peruse this post. I trust this article help huge numbers of them who looking this pretty data.
    SAS Training in Chennai | SAS course in Chennai | SAS Institutes in Chennai

  31. ery good website you have here but I was wondering if you knew of any message boards
    that cover the same topics discussed in this article? I'd really
    love to be a part of online community where I can get opinions from other
    knowledgeable individuals that share the same interest.
    If you have any suggestions, please let me know.
    Many thanks!
    Jasa Seo Murah Terbaik Terpercaya antara Jasa Seo

  32. Hi there. Does Gephi allow us to create a network that has clickable nodes which link us to particular website?

  33. Thanks for sharing this with us it is a worth read. xcellent post!!! Our Digital Marketing Training is tailored for beginners who want to learn how to stand out digitally, whether it is for their own business or a personal brand.

    Digital Marketing Training in Chennai

  34. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in R Tool, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on R Tool. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us:
    Name : Arunkumar U
    Email : arun@maxmunus.com
    Skype id: training_maxmunus
    Contact No.-+91-9738507310
    Company Website –http://www.maxmunus.com