Jun 2, 2010

Data preparation for Social Network Analysis using R and Gephi

I want to share my experience in generating the data for social network analysis using R and analyzing it using Gephi...


WHICH DATA STRUCTURE TO USE FOR LARGE GRAPHS?
I quickly realized that using edge lists and adjacency matrix gets difficult as graph size increases. So I needed an alternative graph format that was efficient (for storage) and flexible to capture details like edge weight. I chose Gephi's gexf file format as it can handle large graphs, and it supports dynamic and hierarchical structure. Checkout gexf comparison with other formats for details.


HOW TO HANDLE LARGE DATA SETS IN R?
As I tried to process millions of rows of email log to derive the edgelist, I realized a couple of things...

1) R cannot handle data larger than my computer's RAM. So I had to look for a way to use R for large data sets. R packages like RMySQL and SQLDF came in handy for this. SQLDF uses SQLlite, an in-memory database. If your data cannot fit into RAM then you can instruct SQLLITE to use persistent store for handling large data sets.
Note: There are many other ways to handle large data in R effectively, e.g. R multicore package for parallel processing, R on MapReduce/Hadoop, etc. Check out the presentation on high performance computing in R for other techniques like ff and bigmemory. Please shout if there are other ways that you used...

2) Some operations are better suited for database/RDBMS: I offloaded RDBMS-suited tasks to SQLlite, the default database used by SQLDF.

3) Learn memory management in R:
- By default R allocates ~1.5GB memory for its use. I allocated more memory for R to handle larger objects using the command "memory.limit(size=3000)"
- Remove unwanted objects from the R session e.g.
rm(raw_emails, emails, to_nodes,from_nodes,all_nodes, unique_nodes)
gc() # call garbage collection explicitly



LOADING THE GRAPH IN GEPHI
Gephi wasn't able to handle very large graph files (e.g. for files > 500MB size, Gephi was either too slow or stopped responding). So I had to do a couple of things...

1) Increase the amount of memory Gephi allocates for the JVM at startup: By default Gephi allocates 512MB memory for JVM. This wasn't enough to load the large graph file, so I increased the max. memory Gephi allocated for JVM to 1.4GB.

Edit {gephi_home}\etc\gephidesktop.conf file (e.g. C:\Program Files\Gephi-0.7\etc\gephidesktop.conf) and change the line
default_options="--branding gephidesktop -J-Xms64m -J-Xmx512m" to
default_options="--branding gephidesktop -J-Xms64m -J-Xmx1400m"

2) Decrease the file size by reducing the text in the graph file e.g. use shorter node_ids, edge_ids etc.

Also, Gephi complained about incorrect file format (it expects UTF-8 encoded XML files). I fixed this simply by opening the graph file generated by R in Textpad and saving it in UTF-8 format before feeding it to Gephi.


LESSONS LEARNED
1) R is more than a statistical tool. I was able to manipulate and clean large data sets (500+ million rows) easily. I will continue learning it. Its fun and rewarding.

2) There are other sophisticated tools for visual SNA (social network analysis) like Network Workbench
I will explore it for heavy analysis, but Gephi is very easy to use and continues to be my favorite.

3) Use a machine with a lot of RAM - both Gephi and R are memory hungry


MY CODE FOR GENERATING THE GRAPH
By the way, here's the R code I used for preparing the graph from email logs for social network analysis using R and Gephi. I'm sure there are better ways to accomplish this. Please shout if you notice any.

2 comments:

  1. excellent work ! I also want to use both R and gephi to visualize graph and stats alltogether to explore very large dataset and your message gave me some clues on how the two might work together. I basically have the same problems you have experience.

    ReplyDelete
  2. Thanks so much for this !!!!

    ReplyDelete