<<<<<<< HEAD
GraphChi  diskbased largescale graph computation
NOTE: This project has been recently moved from Google Code, and some of the wiki pages might be partly broken.
MIT Technology Review article about GraphChi: "Your laptop can now analyze big data"
NEW: Improved performance.
(October 21, 2013) Inedges are now loaded in parallel, improving performance on multicore machines significantly.
NEW: Graph contraction algorithms
Read about graph contraction technique, which we used to implemented efficient minimum spanning forest computation Graph Contraction Algorithms .
Discussion group
http://groups.google.com/group/graphchidiscuss
License
GraphChi is licensed under the Apache License, Version 2.0. Each source code file has full license information.
Introduction
GraphChi is a spinoff of the GraphLab ( http://www.graphlab.org ) project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.
GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertexcentric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertexcentric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.
The promise of GraphChi is to bring webscale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.
In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.
Even if you do require the processing power of highperformance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For highperformance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.
GraphChi is implemented in plain C++, and available as opensource under the flexible Apache License 2.0.
Java version
Javaversion of GraphChi: https://github.com/GraphChi/graphchijava
Publication
GraphChi is part of the OSDI'12 proceedings. PDF of the paper can be downloaded here: http://select.cs.cmu.edu/publications/paperdir/osdi2012kyrolablellochguestrin.pdf
Slides (OSDI talk): http://www.cs.cmu.edu/~akyrola/files/osditalkgraphchi.pptx
Collaborative Filtering Toolkit
Danny Bickson has ported several collaborative filtering algorithms from GraphLab to GraphChi: http://bickson.blogspot.com/2012/12/collaborativefilteringwithgraphchi.html
Features

Vertexcentric computation model (similar to GraphLab, Pregel or Giraph) ** Wrapper for GraphLab 2.1 API (GatherApplyScatter model)

Asynchronous, parallel execution, with (optional) deterministic scheduling (see semantics section at CreatingGraphChiApplications )

Can run graphs with billions of edges, with linear scalability, on a standard consumer grade machine ** Can also utilize large amounts of memory by preloading (caching), making it competitive on large servers: UsingMoreMemory

Multidisk striping  RAIDstyle operation, see MultipleDisksSupports

Works well on both harddrive and SSD.

Evolving graphs, streaming graph updates (can add and delete edges): https://github.com/GraphChi/graphchicpp/wiki/EvolvingAndStreamingGraphs

Easy to install, headersonly, no dependencies.

Tested on Mac OS X and Linux
Getting Started
Best way to get started is to start from the ExampleApps page. Prior to that, you need to download the source code (no configuration or installation is required).
For an introduction on writing your own applications, read CreatingGraphChiApplications.
Problems compiling on Mac
If compiler complains about missing "omp.h" (OpenMP library), here is a way you can install it:
(Contributed by Jose Pablo Gonzalez):
 Install homebrew ( http://brew.sh )
 brew tap homebrew/versions
 Install a compiler: brew install applegcc42
 Modify the Makefile to use the new compiler: CPP = g++4.2
 make
How GraphChi works
GraphChi is based on the Parallel Sliding Windows method which allows efficient asynchronous processing of mutable graphs from disk. See IntroductionToGraphChi for description.
Performance
NOTE: Following performance figures are somewhat outdated. Current performance should be better.
In the table below, we have picked some recent running time results for largescale graph problems from the literature, and run the same experiment using GraphChi. For GraphChi, we used a Mac Mini (2012 model), with 8 gigabytes of RAM and 256 gigabyte SSD drive.
While distributed clusters can solve the same problems faster than GraphChi on a single computer, for many purposes GraphChi's performance should be adequate. The numbers below do not include time for transferring the input to cloud or cluster, and usually do not include the graph loading time. GraphChi's running times include loading the graph and saving the results, but not preprocessing time. Preprocessing needs to be done only once per graph (you can run many different algorithms on the same preprocessed graph). The preprocessing times are listed in a separate table below.
Application  Input graph  Graph size  Comparison  GraphChi on Mac Mini (SSD)  Ref 
Pagerank  3 iterations  twitter2010  1.5B edges  Spark, 50 machines, 8.1 min  13 min  1 
Pagerank  100 iterations  ukunion  3.8B edges  Stanford GPS (Pregel), 30 machines, 144 min  581 min  2 
Webgraph Belief Propagation (1 iter.)  yahooweb  6.7B edges  Pegasus, 100 machines, 22 min  27 min  3 
Matrix factorization (ALS), 10 iters  Netflix  99M edges  GraphLab, 8core machine, 4.7 min  9.8 min  4 
Triangle counting  twitter2010  1.5B edges  Hadoop, 1636 machines, 423 mins  55 min  5 
Performance on hard drive: We repeated the same experiments on harddrive, and the running times are roughly double compared to SSD. The harddrive performance is thus sufficient for many purposes.
Configuration: membudget_mb 3000 execthreads 2 loadthreads 2 niothreads 2 io.blocksize 1048576
1  I. Stanton and G. Kliot. Streaming graph partitioning for large distributed graphs. 2012. 
2  S.Salihoglu and J.Widom. GPS: A Graph Processing System. Technical Report, pages 1–32, Apr. 2012. 
3  U. Kang, D. H. Chau, and C. Faloutsos. Inference of Beliefs on BillionScale Graphs. KDDLDMTA’10, pages 1–7, June 2010. 
4  http://graphlab.org/datasets.html (Retrieved June 30, 2012) 
5  S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011. 
Comparison to Giraph
Apache Giraph is an opensource implementation of the Pregel graph engine, built on top of Hadoop. Based on a recent talk by the main developer of Giraph ( http://www.youtube.com/watch?v=b5Qmz4zPjM ), Giraph running with 20 workers can run five iterations of !PageRank on a graph with 5 billion edges in approx. 75 minutes. Estimated from our results above, GraphChi can execute similar task in roughly the same time on just one machine. The structure of the input graph affects the runtime, so a direct comparison is not possible without using the same input.
Preprocessing times
Graph  Preprocessing time 
Netflix  1 min 
Twitter2010  10 min 
ukunion  33 min 
yahooweb  37 min 
======= graphchicpp
Clone of Graphchi, modified for Caltech CS156
53a7944cb28175c75bbfbab1da2b86d57e99bd82