Semi-structured data and P2P graph databases
In a previous post I introduced the Plasma graph query engine that I’ve been working on as part of my thesis project. With Plasma you can declaratively define queries and evaluate them against a graph database. The heart of the system is a library of dataflow query operators, and on top of them sits a fairly simplistic query “language”. (I put it in quotes because in a lisp based language like Clojure the line between a mini-language and an API gets blurry.) In this post I’ll write a bit about why I think graph databases could be an interesting foundation for next generation P2P networks, and then I’ll give some examples of performing distributed graph queries using Plasma. First I think it is important to motivate the use of a graph database though. While most of the marketing speak on the web regarding graph databases is all about representing social network data, this is just one of many potential applications.
Semi-structured data
Graph databases are interesting because they allow you to represent semi-structured data. For some of the original insight that led to this concept I recommend checking out Querying Semi-Structured Data by Serge Abiteboul. In short, he argues that there are many instances where a well defined schema cannot be known in advance. Semi-structured data, he argues, has an irregular structure that is implicitly defined by the data. Imagine, for example, that you would like to create a next generation wiki that allows you to organize and relate information in arbitrary ways. Often times a wiki will start with little to no structure, but over time you would like to regularize and structure parts of the data to make it searchable, sortable, and queryable. In a relational system this kind of evolution with the data is painful to impossible. Migrating from one schema to the next can be laborious and difficult, and often times you only know that you need to store a new kind of data at the moment the data is available. If at each such moment you have to re-design the schema and create a migration it will severely limit the evolution of the database structure. This is where I think graph databases will shine in the future. Think of them more like a file-system++, where you can gradually add more and more structure to your data, and at any time you can create new kinds of relations or start storing new types of data. As we form richer connections to each other over the Internet, I think this kind of structural information will become even more important.
A data driven P2P substrate
Currently P2P applications tend to operate in isolation. BitTorrent, Naptser, Gnutella, eDonkey and the other file sharing systems each live(d) in their own walled garden. There are many advantages to a system that provides a common substrate on top of which a variety of P2P apps can be built though. Besides the ability to make better use of network and compute resources, the major advantage is with regard to the data. If P2P apps/algorithms can query semi-structured data on a peer then we can create evolvable, content driven P2P networks. For example, peers can be clustered based on common interests, and bits of data from many peers can be gathered to create new views and new applications on existing data. In some ways this is similar to what the semantic-web is attempting to do: represent data in a machine readable fashion so we can create software to perform automated “reasoning” across the web. I just don’t think any kind of top-down, global ontology is realistic or practical though. In the long term we can agree to use some common schemas to better share data, and for this the semantic web efforts will be very useful, but in the meantime I think most data is and will be semi-structured. Every community, application and user will have their own ways of thinking about and relating information, and this needs to be taken into account. Furthermore, I want to live in a world where you have full control over your own data, and everything you care about can be local.
The cloud is great for many applications, especially big-data processing, but it also has many disadvantages that people tend to gloss over. For one, the speed of light is not going to increase with Moore’s law. No matter what we do cloud based systems will put hard limits on the latency for accessing data. In many instances this doesn’t matter, but for creating collaborative applications where we are editing media or hacking on code, I don’t think it will cut it. (Sure, some companies will have the resources to distribute replicas to nearby datacenters around the world, but do we want to limit ourselves to a few applications only offered by cloud-based mega corps?) More importantly, there are serious privacy and censorship problems with cloud based systems. I tend to trust Google, (and Facebook slightly less), but I still don’t like the idea that they have access to all of my personal communication and social network in plaintext. In the west these issues are currently more of a moral/intellectual conundrum, but for citizens of Iran, China, Egypt or Libya, this is a far more important issue that can literally mean prison or death. On the order of 1/4 to 1/5 of the world cannot safely access social networks. So whether it be for functional, moral or political reasons, I think there is still a huge and interesting space of fully distributed, P2P applications that has yet to arise.
P2P social networking
Think about a P2P rather than web based social network. Facebook (or Google+, which I’ve yet to get access to) would be even better if your whole social graph and all the data it entailed sat encrypted on your hard drive. Everything would be always available, low latency, and accessible in a way that lets users think of new ways to make use of their data, rather than waiting around for a new feed or API to get at images, updates, preferences, birthdays, or whatever you care about. You could make friends off the grid by sharing keys with people you meet, and whenever you hop online your local data would be synced with other peers. Cloud based proxy peers could be used to improve availability, but the primary source of data would always be at your fingertips. On top of such a system user communities could decide to store all kinds of interesting information pertinent to their interests, and they could make use of it in the ways they choose. The Clojure community, for example, could store code snippets, git repositories, interesting papers, book reviews, and hammock designs in their local data stores. People could then write new “apps” by issuing graph queries against their peer’s data, and creating new views on top of it. How does someone’s reading list correspond to their code? How about popping up an ad-hoc chat room with everyone currently using clojure.contrib.probabilities.monte-carlo to ask for help or find some example code. This is where I think the future of P2P could and should go. It is far more resistant to failure, difficult to impossible for governments and companies to censor and control, and much more empowering and interesting for the user.
Distributed Plasma
All that said, Plasma is a first experiment in trying to sew the seeds for such a platform. The idea is that each peer will store their data in a local graph database, and then they can choose what data to make available to the P2P network. Applications will access data on the network by issuing graph queries, which can transparently cross network boundaries to gather data in a declarative way, freeing the developer from the pain and suffering of network programming against a churning mass of peers. This is very much an experimental research project and there are already a number of things I can see where graph queries don’t cut it, but I think it does provide a lot of desired functionality. This post is long enough for today though, so if you got this far tune in tomorrow for some actual distributed graph query examples.
Jul 04 2011
Atom Feed
