Semi-structured data and P2P graph databases

In a previous post I introduced the Plasma graph query engine that I’ve been working on as part of my thesis project. With Plasma you can declaratively define queries and evaluate them against a graph database. The heart of the system is a library of dataflow query operators, and on top of them sits a fairly simplistic query “language”. (I put it in quotes because in a lisp based language like Clojure the line between a mini-language and an API gets blurry.) In this post I’ll write a bit about why I think graph databases could be an interesting foundation for next generation P2P networks, and then I’ll give some examples of performing distributed graph queries using Plasma. First I think it is important to motivate the use of a graph database though. While most of the marketing speak on the web regarding graph databases is all about representing social network data, this is just one of many potential applications.

Semi-structured data

Graph databases are interesting because they allow you to represent semi-structured data. For some of the original insight that led to this concept I recommend checking out Querying Semi-Structured Data by Serge Abiteboul. In short, he argues that there are many instances where a well defined schema cannot be known in advance. Semi-structured data, he argues, has an irregular structure that is implicitly defined by the data. Imagine, for example, that you would like to create a next generation wiki that allows you to organize and relate information in arbitrary ways. Often times a wiki will start with little to no structure, but over time you would like to regularize and structure parts of the data to make it searchable, sortable, and queryable. In a relational system this kind of evolution with the data is painful to impossible. Migrating from one schema to the next can be laborious and difficult, and often times you only know that you need to store a new kind of data at the moment the data is available. If at each such moment you have to re-design the schema and create a migration it will severely limit the evolution of the database structure. This is where I think graph databases will shine in the future. Think of them more like a file-system++, where you can gradually add more and more structure to your data, and at any time you can create new kinds of relations or start storing new types of data. As we form richer connections to each other over the Internet, I think this kind of structural information will become even more important.

A data driven P2P substrate

Currently P2P applications tend to operate in isolation. BitTorrent, Naptser, Gnutella, eDonkey and the other file sharing systems each live(d) in their own walled garden. There are many advantages to a system that provides a common substrate on top of which a variety of P2P apps can be built though. Besides the ability to make better use of network and compute resources, the major advantage is with regard to the data. If P2P apps/algorithms can query semi-structured data on a peer then we can create evolvable, content driven P2P networks. For example, peers can be clustered based on common interests, and bits of data from many peers can be gathered to create new views and new applications on existing data. In some ways this is similar to what the semantic-web is attempting to do: represent data in a machine readable fashion so we can create software to perform automated “reasoning” across the web. I just don’t think any kind of top-down, global ontology is realistic or practical though. In the long term we can agree to use some common schemas to better share data, and for this the semantic web efforts will be very useful, but in the meantime I think most data is and will be semi-structured. Every community, application and user will have their own ways of thinking about and relating information, and this needs to be taken into account. Furthermore, I want to live in a world where you have full control over your own data, and everything you care about can be local.

The cloud is great for many applications, especially big-data processing, but it also has many disadvantages that people tend to gloss over. For one, the speed of light is not going to increase with Moore’s law. No matter what we do cloud based systems will put hard limits on the latency for accessing data. In many instances this doesn’t matter, but for creating collaborative applications where we are editing media or hacking on code, I don’t think it will cut it. (Sure, some companies will have the resources to distribute replicas to nearby datacenters around the world, but do we want to limit ourselves to a few applications only offered by cloud-based mega corps?) More importantly, there are serious privacy and censorship problems with cloud based systems. I tend to trust Google, (and Facebook slightly less), but I still don’t like the idea that they have access to all of my personal communication and social network in plaintext. In the west these issues are currently more of a moral/intellectual conundrum, but for citizens of Iran, China, Egypt or Libya, this is a far more important issue that can literally mean prison or death. On the order of 1/4 to 1/5 of the world cannot safely access social networks. So whether it be for functional, moral or political reasons, I think there is still a huge and interesting space of fully distributed, P2P applications that has yet to arise.

P2P social networking

Think about a P2P rather than web based social network. Facebook (or Google+, which I’ve yet to get access to) would be even better if your whole social graph and all the data it entailed sat encrypted on your hard drive. Everything would be always available, low latency, and accessible in a way that lets users think of new ways to make use of their data, rather than waiting around for a new feed or API to get at images, updates, preferences, birthdays, or whatever you care about. You could make friends off the grid by sharing keys with people you meet, and whenever you hop online your local data would be synced with other peers. Cloud based proxy peers could be used to improve availability, but the primary source of data would always be at your fingertips. On top of such a system user communities could decide to store all kinds of interesting information pertinent to their interests, and they could make use of it in the ways they choose. The Clojure community, for example, could store code snippets, git repositories, interesting papers, book reviews, and hammock designs in their local data stores. People could then write new “apps” by issuing graph queries against their peer’s data, and creating new views on top of it. How does someone’s reading list correspond to their code? How about popping up an ad-hoc chat room with everyone currently using clojure.contrib.probabilities.monte-carlo to ask for help or find some example code. This is where I think the future of P2P could and should go. It is far more resistant to failure, difficult to impossible for governments and companies to censor and control, and much more empowering and interesting for the user.

Distributed Plasma

All that said, Plasma is a first experiment in trying to sew the seeds for such a platform. The idea is that each peer will store their data in a local graph database, and then they can choose what data to make available to the P2P network. Applications will access data on the network by issuing graph queries, which can transparently cross network boundaries to gather data in a declarative way, freeing the developer from the pain and suffering of network programming against a churning mass of peers. This is very much an experimental research project and there are already a number of things I can see where graph queries don’t cut it, but I think it does provide a lot of desired functionality. This post is long enough for today though, so if you got this far tune in tomorrow for some actual distributed graph query examples.

Facebook Vibes

So today Facebook’s video chat went live. I saw the headline on Hacker News and minutes later my mom popped up on Facebook chat, so what better time to give it a whirl. I hit download to give it a try, and after grabbing a jar file and running an install procedure we were up and running. The chat went pretty well, although it did hang and jitter a couple times. It got me wondering though, is this using my installed skype, or did it install a second copy of skype somewhere, or are they bundling the skype protocols in some kind of library? As any curious geek would, I started checking things out.

The download link grabs a file called FacebookVideoCalling.jar that’s only 27840 bytes in size, so it seemed highly unlikely that this was the entire codec. After scoping out the jar file a bit I found that it uses LiveConnect to let the java applet communicate with the javascript dom in the browser.

  protected JSObject util;
  // ...
  this.window = JSObject.getWindow(this);
  // ...
  String str1 = getParameter("appid");
  String str2 = getParameter("userid");

This way they can get your Facebook user ID, and interestingly they also grab an application ID. With this information they formulate a signed request to download the application. The interesting thing to note is that this installer supports two applications:

    if (paramString.equals("com.facebook.peep"))
      return this.window.getMember("VideoChatPlugin");
    if (paramString.equals("com.facebook.vibes")) {
      return this.window.getMember("MusicDownloadDialog");
    }

The video chat plugin, called peep, is what is downloaded now. At some point in the future they seem to be prepared to download another app though, called Facebook Vibes. I searched around to see what this is all about, and it seems that this is an unannounced feature that has yet to be released. The vibes app connects with a music download dialog in the page though, so I’m guessing that with this release we are seeing the seeds for Facebook’s upcoming music offering.

It’s too late to look into this further and I still haven’t found the codecs, but anyways, you heard it here first: Facebook Vibes.

The Plasma graph query engine

The Neo4J team recently blogged about a graph query language called Cypher, and reading their post got me motivated to write about something I’ve been working on for a while. Plasma is a graph query engine written in Clojure. Currently it sits on top of the Jiraph graph database that was written by Justin Balthrop and Lance Bradley for a geneology website, Geni. (It would be less than a days work to get it running on top of another graph database though.) The query engine is built using a library of query operators that are combined to form dataflow graphs, and it uses Zach Tellman’s asynchronous events library, Lamina, to provide concurrent, non-blocking execution of queries.

Query language examples

These query examples will be running against the test graph shown below (click for a big version). This graph represents the ACME company, which has three Swiss offices in Lugano, Geneva and Zurich. Each office has a manager, and the company produces a set of components which are combined to form widgets. (Have thoughts on more interesting example graphs? Please leave them in the comments because I’d like some more.)

ACME example graph

The full code for these examples, including the graph construction, is available in the github project here.

In the examples below I’ll be using the plasma.query namespace:

(require '[plasma.query :as query])

The functions in the query API operate on and produce a query plan, which can then be executed any number of times by passing it to the query function.

Select nodes with path expressions

; select all component nodes using a basic path expression
(query/path [:product :component])

; same query, using a regular expression for the component edge
(query/path [:product #"comp.*"])

; same query, with bindings to the product and component nodes
(query/path [product [:product]
             component [product :component]]))

A path expression starts at the root node of the graph by default, but you can start from another node by passing an argument to the query function when executed or by using the starting node’s id as the first element of the path.

The elements of a path expression are edge predicates. Keyword and regexp predicates will be matched against the :label property of an edge, and a map of predicates can be used to compare multiple edge properties.

Executing the queries above will return a list of maps containing the ids of the component nodes.

(query (query/path [:product :component]))

; result:
({:id "UUID:6df19085-0342-41a8-90b9-4d23f076e29d"}
 {:id "UUID:16e2c756-2f44-4414-a007-417a19ba71a0"}
 {:id "UUID:16e2c756-2f44-4414-a007-417a19ba71a0"}
 {:id "UUID:ae990134-acfd-40f0-aef5-413be7b1002b"}
 {:id "UUID:6729a33b-6cb7-44ad-921b-9a2ed0fe2bfc"})

Project node properties

(-> (query/path [comp [:product :component]])
  (query/project ['comp :label :cost]))

; result:
({:cost 24.0, :label :component-1}
 {:cost 8.0, :label :component-2}
 {:cost 8.0, :label :component-2}
 {:cost 75.0, :label :component-3}
 {:cost 120.0, :label :component-4})

You can also project multiple nodes along the path, and optionally rename properties in case of conflicts, like the :label property here.

(-> (query/path [p [:product] c [p :component]])
  (query/project ['p [:label :as :product-label]] ['c :label :cost]))

; result:
({:cost 24.0, :label :component-1, :product-label :alpha}
 {:cost 8.0, :label :component-2, :product-label :alpha}
 {:cost 8.0, :label :component-2, :product-label :beta}
 {:cost 75.0, :label :component-3, :product-label :beta}
 {:cost 120.0, :label :component-4, :product-label :beta})

Filter using the where clause

; Select the components that would cost more than $500 if using a 5x markup
(-> (query/path [component [:product :component]])
    (query/where (> (* 5 (:cost 'component)) 500))
    (query/project ['component :label :cost]))

; result:
({:cost 120.0, :label :component-4})

Note that the where form supports arbitrary expressions involving node properties. All arithmetic functions, boolean logic, bit operations and many java.lang.Math functions (e.g. trig, log, etc..) are supported and its easy to add more.

The where filter doesn’t always have to apply to the last node of a path expression. Here is a query that will select products with components made in Geneva.

(->
  (query/path [product   [:product]
               component [product :component]
               location  [component :made-in]])
  (query/where (= (:label 'location) :geneva-office))
  (query/distinct* 'product)
  (query/project ['product :label]))

; result:
({:label :beta})

Sort with order-by, limit, and choose results

; Get the most expensive product
(-> (query/path [product [:product]])
  (query/order-by 'product :price :desc)
  (query/project ['product :label :price])
  (query/limit 1))

; result:
({:price 600.0, :label :beta})

The choose operator can be used to select a random subset of the results.

; Get a random manager node
(-> (query/path [:location :managed-by])
    (query/choose 1))

; result:
({:id "UUID:8f53c132-4f7e-4ef9-a192-2ff069d4ac61"})

and more…

In addition you can compute aggregate values over specific properties using query/min, query/max, query/avg, and query/count*. There is also a basic graph construction API to make it easier to store sub-graphs programmatically. Checkout query.clj on github for examples.

Distributing Graph Queries

The Plasma engine has also been designed to support distributed queries, where a path expression can cross from one graph to another on the local machine, or from one machine to another over the network. In a future post I’ll show an example graph that contains proxy nodes, which act as bridges to remote graphs. Distributed path expressions can be used not only to query large graph databases spread across many servers, but they can also be the basis for a new kind of distributed system. I’m not sure what the right role for such a distributed graph network is yet, but I think it has some interesting possibilities for supporting a peer-to-peer infrastructure.

Overtone demo video

Sam put together a great demo video of live-coding with Overtone. It shows both programming real-time synthesizers and also making music in about 4 minutes. (His pimped out emacs config is also getting a bit of attention…)

Quick Intro to Live Programming with Overtone

For more join the hacker news discussion or checkout the project on github

This is the future of music. Well, one future anyways :-)

Debugging Clojure

I’ve been debugging some clojure code lately so I thought I’d gather some information here for others who might be doing the same. First, you should know about a couple tools that are available:

  • Clojure Debug Toolkit

    which attaches to a running JVM, and then lets you set breakpoints and evaluate code in the scope of a suspended stack frame

  • JSwat Debugger

    a full fledged Java debugger that lets you set breakpoints in Java and/or Clojure, view bytecode, variables, stacks, etc.

    use with Clojure described here

While I’m happy these tools are available, they don’t really jive with my development style so I’ve been looking for other strategies. My day-to-day setup is inside of gvim using the vimclojure plugin, and normally when I’m debugging it is in the midst of developing some code. As I grow my program I’m writing functions, interactively testing them on the repl, and occasionally adding logging when I need to follow more complicated program flow. So having to restart the JVM with debug options or connect externally with a separate program is too heavyweight. In fact, I’m looking for lighter-weight debug strategies, not heavier weight.

clojure.contrib.trace

One great tool I recently started using is clojure.contrib.trace. If you don’t use it, I recommend taking it out for a spin. In short, it lets you follow the program flow by printing input arguments and return values as functions get called.

Here, give it a try. Paste this implementation of the Collatz conjecture into the repl. (Inspired by the Aleph pipeline docs…) Collatz conjectured that if you take any number and then recursively divide by two when even or multiply by three and add one if it is odd, you will always arrive at 1. Seems pretty useless, although it is interesting to think about the sort of expansion and compression phases of the number as it makes its way to zero. I wonder if there are any natural processes that have a similar form? Anyway, we were tracing.

(use 'clojure.contrib.trace)

(defn up
  [v]
  (+ 1 (* 3 v)))

(defn down
  [v]
  (/ v 2))

(defn collatz
  [v]
  (cond
    (= 1 v) 1
    (even? v) (recur (down v))
    :else (recur (up v))))

Now instead of inserting println statements to follow the program flow, trace it like this:

collatz.core=> (dotrace [up down] (collatz 8))
TRACE t1012: (down 8)
TRACE t1012: => 4
TRACE t1013: (down 4)
TRACE t1013: => 2
TRACE t1014: (down 2)
TRACE t1014: => 1

Form debugging macro

Another helpful little tool I ran into somewhere is this debug macro:

(defmacro dbg
  [x]
  `(let [x# ~x] (println "dbg:" '~x "=" x#) x#))

It lets you easily wrap a form in your code, and then print the form and it’s result whenever it gets evaluated. For example, redefine the down function like this:

(defn down [v]
  (dbg (/ v 2)))

And now when run on the repl it will produce:

collatz.core=> (collatz 8)
dbg: (/ v 2) = 4
dbg: (/ v 2) = 2
dbg: (/ v 2) = 1
1
collatz.core=>

Tracking down wild threads

For one really frustrating bug I had recently in the Plasma query engine, I had a thread that was running wild and I couldn’t figure out either what thread it was or where it was hanging up. For tracking down threads I’ve got a couple tools for you. First, I found this top threads jconsole plugin very helpful. It lets you see all the running threads sorted by CPU usage like unix top, and if you select a thread you can view the current stackframe.

While that is definitely a nice tool, again I wanted something quick and easy to run on the repl. After looking into the JVM APIs used to get at this info, I put together a couple functions for my user.clj:

(import 'java.lang.management.ManagementFactory)

(defn threads
  "Get a seq of the current threads."
  []
  (let [grp (.getThreadGroup (Thread/currentThread))
        cnt (.activeCount grp)
        ary (make-array Thread cnt)]
    (.enumerate grp ary)
    (seq ary)))

(defn thread-grep
  "Find a thread by name."
  [name]
  (let [ptn (re-pattern name)]
    (filter #(re-find ptn (.getName %)) (threads))))

(defn get-thread
  "Lookup a thread by its numeric ID."
  [id]
  (first (filter #(= id (.getId %)) (threads))))

(defn thread-top
  "Return a seq of threads sorted by their total userland CPU usage."
  []
  (let [mgr (ManagementFactory/getThreadMXBean)
        cpu-times (map (fn [t]
                         [(.getThreadCpuTime mgr (.getId t)) t])
                    (threads))]
    (map
      (fn [[cpu t]] [cpu (.getName t) (.getId t) t])
      (reverse (sort-by first cpu-times)))))

This way you can quickly see which thread is pegging the cpu, or lookup a thread if you’ve assigned it a name. I haven’t added any helpers yet, but if you inspect the thread object with clojure.contrib.repl-utils/show you’ll see the methods you can use to view the current stacktrace or (danger!) stop the thread. Might be nice to get some tools like this into contrib.