Jun 2, 2010

Data preparation for Social Network Analysis using R and Gephi

I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi...

I quickly realized that using edge lists and adjacency matrix gets difficult as graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi's gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.

As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things...

1) R cannot handle data larger than my computer's RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used...

2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.

3) Learn memory management in R:
- By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command "memory.limit(size=3000)"
- Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly

Gephi wasn't able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things...

1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn't enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.

Edit {gephi_home}\etc\gephidesktop.conf file (e.g. C:\Program Files\Gephi-0.7\etc\gephidesktop.conf) and change the line
default_options="--branding gephidesktop -J-Xms64m -J-Xmx512m" to
default_options="--branding gephidesktop -J-Xms64m -J-Xmx1400m"

2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.

Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.

1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.

2) There are other sophisticated tools for visual SNA (social network analysis) like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.

3) Use a machine with a lot of RAM - both Gephi and R are memory hungry

By the way, here's the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I'm sure there are better ways to accomplish this. Please shout if you notice any.


  1. excellent work ! I also want to use both R and gephi to visualize graph and stats alltogether to explore very large dataset and your message gave me some clues on how the two might work together. I basically have the same problems you have experience.

  2. Thanks so much for this !!!!

  3. Thks for the tuto. You can also use fread from data.table to quickly load big data in R (~10 times faster than read.csv). the dplyr package has also really good features to deal with Big Data. See http://handsondatascience.com/BigDataO.pdf
    The rgexf package is also really interesting for your purpose. It allows to creates .gexf file directly in R. See http://www.vesnam.com/Rblog/viznets2/

  4. Karolina HjørdisMarch 06, 2015

    Wow.... Very Excellent work here...
    I want to personally appreciate your post. This is really amazing and great. I like the way of your writing and your thoughts. Kindly keep posting more amazing posts. Visit Best Dedicated Servers for the best dedicated hosting

  5. The process and the understanding totally been so important that one of course need to carry on with such kind of the application, that if he would survived then might be more chances of the success. assignment help australia

  6. Wonderful post and rewriting services provide many ways and also helped us to your writing essay thanks for share it .