Tuesday, December 22, 2009

Using GIT as a hierarchical database

A quick look into the 'plumbing'' of git (the Pro Git book is good) reveals that it can easily store (versioned) data that is not based on the contents of files checked into the repository. This means that it can be used as a database that is based on trees rather than tables or key:value pairs.

Git stores four kinds of objects: BLOBs, trees, commits, and tags. A BLOB is raw data : it is identified by the SHA1 digest of its contents, so the same data will never be stored twice. A tree is a hierarchy: it can contain BLOBs or other tree objects. A commit object is basically a snapshot of a hierarchy (tree) at a point in time, and a tag assigns a name to a BLOB, tree, or commit.

In terms of using git as a database, these components can be considered as follows:

BLOB: raw data (values).
Tree: Parent-child relations between data, and a mapping of a database entries with a raw value.
Commit: A version of the data.
Tag: A label, e.g. the name of a table, or a name for a particular version of the data.

Consider the expression 'x * y + 2'. It can be expressed as a tree:

+(* ( x, y), 2)

Another way to express this is in a filesystem-like tree:

/root/
/root/type # 'operator'
/root/value # '+'
/root/left/type # 'operator'
/root/left/value # '*'
/root/left/left/
/root/left/left/type # 'variable'
/root/left/left/value # 'x'
/root/left/right/
/root/left/right/type # 'variable'
/root/left/right/value # 'y'
/root/right/
/root/right/type # 'integer'
/root/right/value # '2'
 

In this schema, entries named 'left' and 'right' are trees, while entries named 'type' and 'value' are BLOBS. Each tree contains at least a type and value entry, with left and right entries existing only if children are present.

To begin, create a Git repo:

# mkdir /tmp/expr
# cd /tmp/expr
# git init

Next, enter the raw data into git using git-hash-object:

# echo 'operator' | git-hash-object -w --stdin
e196e4b5d49f41231b184a124b3ff52bb00ef9f6
# echo 'variable' | git-hash-object -w --stdin
6437a1374bfa97cfdac6611fc5d2d650abaf782b
# echo 'integer' | git-hash-object -w --stdin
82539ed1cd5cbb2c26a500a228a865d8fbbbcd02
# echo '*' | git-hash-object -w --stdin
72e8ffc0db8aad71a934dd11e5968bd5109e54b4
# echo '+' | git-hash-object -w --stdin
fd38861632cb2360cb5acb6253e7e281647f034a
# echo 'x' | git-hash-object -w --stdin
587be6b4c3f93f93c489c0111bba5596147a26cb
# echo 'y' | git-hash-object -w --stdin
975fbec8256d3e8a3797e7a3611380f27c49f4ac
# echo '2' | git-hash-object -w --stdin
0cfbf08886fca9a91cb753ec8734c84fcbe52c9f

Note that git-hash-object can be run without the -w (write) option in order to generate the SHA1 digest of the data without adding it to git.

The tree hierarchy can be created with git-update-index:

# # Root: Operator '+'
#git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/type'
# git-update-index --add --cacheinfo 100644 fd38861632cb2360cb5acb6253e7e281647f034a 'root/value'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/left/type'
# git-update-index --add --cacheinfo 100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 'root/left/value'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/left/type'
# git-update-index --add --cacheinfo 100644 587be6b4c3f93f93c489c0111bba5596147a26cb 'root/left/left/value'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/right/type'
# git-update-index --add --cacheinfo 100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 'root/left/right/value'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 'root/right/type'
# git-update-index --add --cacheinfo 100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 'root/right/value'

# git-write-tree
7fe6589700060de813d529c48937b7678c297d42
# git-ls-tree 7fe6589700060de813d529c48937b7678c297d42
040000 tree b430c300c59dd80764892644c637ba6e29785c09 root
# git-ls-tree -r 7fe6589700060de813d529c48937b7678c297d42
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb root/left/left/value
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac root/left/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/left/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 root/left/value
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 root/right/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f root/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a root/value
# git-ls-files
root/left/left/value
root/left/right/value
root/left/type
root/left/value
root/right/type
root/type
root/value
# git-ls-files --stage \*/value | cut -d ' ' -f 2 | git-cat-file --batch
587be6b4c3f93f93c489c0111bba5596147a26cb blob 2
x

975fbec8256d3e8a3797e7a3611380f27c49f4ac blob 2
y

72e8ffc0db8aad71a934dd11e5968bd5109e54b4 blob 2
*

0cfbf08886fca9a91cb753ec8734c84fcbe52c9f blob 2
2

fd38861632cb2360cb5acb6253e7e281647f034a blob 2
+

#
git-ls-files --stage \*/value | cut -d ' ' -f 2 | xargs git-show
x
y
*
2
+

An ls of the repository will show only the .git directory: the data is stored as BLOBs and paths in the GIT database, not as a directory tree on the filesystem.

In the first attempt, git-ls-files returns the database entries lexically ordered by path. A more database-like approach would be to store the nodes of the expression in a single directory (i.e. a table) and linking them via parent and child references. This highlights a small flaw in the plan: it is not possible to get the SHA1 of a tree without writing the index to disk.


# # Remove previous attempt
# rm -r .git
# git init

# # Node 1 Operator '+'

# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash_object -w --stdin` 'nodes/1/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash_object -w --stdin` 'nodes/1/type'

# # Node 2: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'nodes/2/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash-object -w --stdin` 'nodes/2/type'

# # Node 3: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'nodes/3/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/3/type'

# # Node 4: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'nodes/4/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/4/type'

# # Node 5: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'nodes/5/value'
# git-update-index --add --cacheinfo 100644 `echo 'integer' | git-hash-object -w --stdin` 'nodes/5/type'

# git write-tree
d986912eb66eb446ddaaa188a5cc1068c8cf951b
# git ls-tree `git ls-tree d986912eb66eb446ddaaa188a5cc1068c8cf951b | grep nodes | awk '{print $3}'`
040000 tree 945999b7c15c9b745333847bf3d341b7f74a784f 1
040000 tree 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 2
040000 tree 9a4b76bbc0b9a20185333ca44321ba438eb6d71c 3
040000 tree f556930b3cd058453894e0cc386fe6ca10908abd 4
040000 tree 36fde2340dd25814793c1346ebbb0175049665dd 5

# # Set Node 1 children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/1/left'
# git-update-index --add --cacheinfo 100644 36fde2340dd25814793c1346ebbb0175049665dd 'nodes/1/right'

# # Set Node 2 parent and children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/2/parent'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/left'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/right'

# # Set Node 3 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/3/parent'

# # Set Node 4 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/4/parent'

# # Set Node 5 parent
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/5/parent'

# git-write-tree
e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
# git ls-tree -r e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/1/left
100644 blob 36fde2340dd25814793c1346ebbb0175049665dd nodes/1/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/1/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a nodes/1/value
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/left
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/2/parent
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/2/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 nodes/2/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/3/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/3/type
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb nodes/3/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/4/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/4/type
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac nodes/4/value
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/5/parent
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 nodes/5/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f nodes/5/value


Of course, pulling the contents of these files from the DB results in a list of expression tokens ordered by node number:

# git ls-files --stage '*/value' | awk '{print $2}' | xargs git-show
+
*
x
y
2

While correct, this is not useful for building an infix expression. A new tree is needed that acts as an index for the first tree, so that listing the new tree will return the tokens in infix order. This is accomplished by using the names 'left', 'operator', and 'right' for node contents, so that lexical ordering of the index-tree will return tokens in that order.


# # Root: Operator '+'
# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash-object -w --stdin` 'infix/operator'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'infix/left/operator'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'infix/left/left/variable'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'infix/left/right/variable'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'infix/right/value'

# git-ls-files --stage infix/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y + 2


Here, the 'infix' tree re-creates the hierarchy of the original expression, much as the first attempt did -- only this time, the names of the files and directories are chosen to force an infix ordering when sorted by name. Only the files containing the values to display are included in this tree, and these map to the same SHA1 digest as those in the 'nodes' tree. Note that git only stores one copy of each BLOB, so the infix tree only uses as much space as is required to represent the tree: the value BLOBs are shared with the original table.

Listing of subtrees correctly returns sub-expressions:

# git-ls-files --stage infix/left/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y

Once the data is has been created, the database can be committed, creating a base version:

# echo 'Node database created' | git commit-tree e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
f1b038fb0ea444f7a05ef9c48d74aef6ac3d8b7e

Searching the database relies on git-grep, using the --cached option to search the BLOBs instead of the filesystem. The search may be limited to specific paths:

# git-grep --cached 'operator' -- root 
root/left/type:operator
root/type:operator
# git-grep --cached '*' -- expr
expr/left/operator:*
# git-grep --cached 'x' -- '*/variable'
expr/left/left/variable:x
# git-grep --cached '*''*/left/*'
expr/left/operator:*
root/left/value:*

Removing entries involves removing the index entries, not the BLOB:

# git-update-index --force-remove 'infix/operator'
# git-ls-files --stage infix\*
100644 587be6b4c3f93f93c489c0111bba5596147a26cb 0 infix/left/left/variable
100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 0 infix/left/operator
100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 0 infix/left/right/variable
100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 0 infix/right/value

This means that the values stay in the database, so that changes can be reversed. The actual 'database' therefore is nothing but a collection of named links to BLOBs. Searching and browsing of the database is done by operating on paths, the 'live' data, while inserting, updating, and deleting database entries requires changing what blobs the paths point to.



It goes without saying that the git commands provided above can be wrapped with shell functions (or Ruby, or Python, or any other scripting language) that create database entries, populate indexes, and commit changes to the data. With a small amount of work, a reasonable version-controlled hierarchical database can be put together, all contained in the .git dir.

More complex projects could mix files managed by git and the above tree-only approach; this is especially useful when storing binary content, which is awkward to deal with when stored in BLOBs (generally requires unzipping). In such cases, finding the top level (root) of the Git repository is useful, and git makes this easy as well:

# mkdir stuff
# cd stuff
# git-rev-parse --git-dir
/tmp/expr/.git

9 comments:

  1. This is a fascinating concept. Thanks for sharing. Are you aware of any project that actually uses this idea and wraps real world code around it?

    ReplyDelete
  2. This grew out of a project I was working on last year, where one of the feature requests was to make the backend database "versioned, like Git".

    Implementing that in the DB schema proved to be far more trouble than it was worth. One day I turned the problem on its head, deciding to use Git as the database (instead of using the database as Git).

    Unfortunately, I did not have time to finish the project before real-world work kicked into high gear. Perhaps I'll finish it this summer.

    ReplyDelete
  3. Did you ever make any headway on this? I'm drawn to the same solution ; have some heinous requirements involving the distributed versioning of objects sets with counts in the ~10M region, and really don't want to map one object to one file because that will kill NTFS stone dead. I've been considering how to get this done by implementing a Key/Value storage backend for git, perhaps starting with the jgit sources. Then the KV store can be both the "working tree" and the git storage.

    ReplyDelete
  4. I have a working implementation based on Grit (a ruby half-implementation of Git).

    Basically, I extended the Grit Index and Repo objects, then created mixins (what in Java would be additional base classes) for Repo-backed objects. The Repo-backed objects generate paths for themselves and their properties; each class determines whether itself and its properties are on-fs or in-db.

    Object idents therefore boil down to fs-style paths that are fed to the staging index on the repo. This makes it easy to do lightweight objects, where the object as instantiated just stores its path, then reads to/writes from the index when accessors are called.

    I haven't stress-tested things yet. This is a back-burner project, and I want to get it functional before fast - if Git turns out to be too slow as a backend, I'll return to using SQLite.

    So far, so good though -- the wrappers make it easy to create new classes, the path-based idents limit the redunancy, and I've got version control on creation of/changes to all in-db objects.

    The Git approach may not work well for simple key-value pairs, though. The issue is that Git stores the path to each object as an object in its db.

    If your object tree ends up looking like this:

    objects/key1 = "value1"
    objects/key1 = "value2"
    objects/key3 = "value1"

    ...you don't save a lot, because 3 tree objects and 2 BLOB objects are created as files under .git.

    If you are using key-value pairs in a file to represent properties of an object, and many of the objects have identical properties, then you will reduce the number of on-disk files. For example,

    objects/type_a/ident1 = "p1 = 0; p2 = 1"
    objects/type_a/ident2 = "p1 = 1; p2 = 1"
    objects/type_a/ident3 = "p1 = 0; p2 = 1"
    objects/type_a/ident4 = "p1 = 0; p2 = 1"
    objects/type_b/ident1 = "name=this; val=0"
    objects/type_b/ident2 => "name=that; val=1"

    ...will create 6 tree objects and 4 BLOB objects.

    Where the savings come is when there is redundancy among the subtrees:

    /objects/type_a/ident_1/date = '1/1/10'
    /objects/type_a/ident_1/order_details -> /objects/orders/1
    /objects/type_a/ident_2/date = '2/2/10'
    /objects/type_a/ident_2/order_details -> /objects/orders/1
    /objects/type_b/ident_1/date = '3/3/10'
    /objects/type_b/ident_1/order_details -> /objects/orders/2
    /objects/type_b/ident_2/date = '3/3/10'
    /objects/type_b/ident_2/order_details -> /objects/orders/1
    /objects/orders/1/customer = "somebody"
    /objects/orders/1/address = "101 main st, ..."
    /objects/orders/1/items/item1 = "nuts"
    /objects/orders/1/items/item2 = "bolts"
    /objects/orders/2/customer = "nobody"
    /objects/orders/2/address = "nowhere"
    /objects/orders/2/items/item1 = "nachos"

    Here, there are 4 tree objects and 4 BLOB objects created for orders/1. Each time orders/1 is referenced by an object, a single tree object is added that points to it. Compare this to 4 tree objects and 4 BLOB objects being created if that information was "inline", or 4 files if that information was stored as files on the fs without using Git. Basically the cost is 8 + n objects versus 4 * n objects for n copies of orders/1, so at 3 copies it starts to provide savings.

    ReplyDelete
  5. Ended up having to generalize the proof-of-concept implementation. GitDB (renamed to GitDS as there is already a GitDB gem apparently) is now on Rubyforge:

    http://rubyforge.org/projects/git-db/

    ReplyDelete
  6. any new updates on this?

    I want a versioned database that supports diffs and querying and of course is performant... the querying could even be a little weak.

    I've been pulling out my hair trying to find and existing solution.

    ReplyDelete
  7. > any new updates on this?

    I have been using the GitDS gem mentioned above on a research project.

    I had to add a few tweaks (e.g. bulk-inserts) for performance, and querying is strictly by-path -- so it may not be what you're looking for. I have found it quite workable, but I am not using it in production code.

    https://rubygems.org/gems/git-ds/

    ReplyDelete
  8. It seems it is just what I need.
    I need to store a hierarchy. Lets say ACME Corporation is subdivided into Divisions, some Divisions organize themselves into Regions, but some other may use Product-family hierarchy. Regions may subdivide into Countries, and these into Departments (or something else).
    I need to be able to tag some resources with the group within the hierarchy that they belong to. But the hierarchy is not static. We expect 20% of it to change over a year. Some changes may be simple as renaming the group: "USA" -> United States, but other may be merging of moving a sub-hierarchy under another parent.
    We may have search associated with some resources based on the hierarchy, but we also want to be able to return results from older versions of the hierarchy.

    ReplyDelete
    Replies
    1. That should work. Because Git separates trees and blobs, you can rearrange the tree structure pretty easily (ala `git move`) while leaving the data (BLOBs) intact. If you are indexing the data, you would need to update the index every time you change the tree. Searching previous versions is easiest if you have a tag or a branch as a reference, so use tags to identify official versions of the database. Contact me via the git-ds project if you run into problems.

      Delete