Tuesday, December 22, 2009

Using GIT as a hierarchical database

A quick look into the 'plumbing'' of git (the Pro Git book is good) reveals that it can easily store (versioned) data that is not based on the contents of files checked into the repository. This means that it can be used as a database that is based on trees rather than tables or key:value pairs.

Git stores four kinds of objects: BLOBs, trees, commits, and tags. A BLOB is raw data : it is identified by the SHA1 digest of its contents, so the same data will never be stored twice. A tree is a hierarchy: it can contain BLOBs or other tree objects. A commit object is basically a snapshot of a hierarchy (tree) at a point in time, and a tag assigns a name to a BLOB, tree, or commit.

In terms of using git as a database, these components can be considered as follows:

BLOB: raw data (values).
Tree: Parent-child relations between data, and a mapping of a database entries with a raw value.
Commit: A version of the data.
Tag: A label, e.g. the name of a table, or a name for a particular version of the data.

Consider the expression 'x * y + 2'. It can be expressed as a tree:

+(* ( x, y), 2)

Another way to express this is in a filesystem-like tree:

/root/
/root/type # 'operator'
/root/value # '+'
/root/left/type # 'operator'
/root/left/value # '*'
/root/left/left/
/root/left/left/type # 'variable'
/root/left/left/value # 'x'
/root/left/right/
/root/left/right/type # 'variable'
/root/left/right/value # 'y'
/root/right/
/root/right/type # 'integer'
/root/right/value # '2'
 

In this schema, entries named 'left' and 'right' are trees, while entries named 'type' and 'value' are BLOBS. Each tree contains at least a type and value entry, with left and right entries existing only if children are present.

To begin, create a Git repo:

# mkdir /tmp/expr
# cd /tmp/expr
# git init

Next, enter the raw data into git using git-hash-object:

# echo 'operator' | git-hash-object -w --stdin
e196e4b5d49f41231b184a124b3ff52bb00ef9f6
# echo 'variable' | git-hash-object -w --stdin
6437a1374bfa97cfdac6611fc5d2d650abaf782b
# echo 'integer' | git-hash-object -w --stdin
82539ed1cd5cbb2c26a500a228a865d8fbbbcd02
# echo '*' | git-hash-object -w --stdin
72e8ffc0db8aad71a934dd11e5968bd5109e54b4
# echo '+' | git-hash-object -w --stdin
fd38861632cb2360cb5acb6253e7e281647f034a
# echo 'x' | git-hash-object -w --stdin
587be6b4c3f93f93c489c0111bba5596147a26cb
# echo 'y' | git-hash-object -w --stdin
975fbec8256d3e8a3797e7a3611380f27c49f4ac
# echo '2' | git-hash-object -w --stdin
0cfbf08886fca9a91cb753ec8734c84fcbe52c9f

Note that git-hash-object can be run without the -w (write) option in order to generate the SHA1 digest of the data without adding it to git.

The tree hierarchy can be created with git-update-index:

# # Root: Operator '+'
#git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/type'
# git-update-index --add --cacheinfo 100644 fd38861632cb2360cb5acb6253e7e281647f034a 'root/value'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/left/type'
# git-update-index --add --cacheinfo 100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 'root/left/value'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/left/type'
# git-update-index --add --cacheinfo 100644 587be6b4c3f93f93c489c0111bba5596147a26cb 'root/left/left/value'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/right/type'
# git-update-index --add --cacheinfo 100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 'root/left/right/value'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 'root/right/type'
# git-update-index --add --cacheinfo 100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 'root/right/value'

# git-write-tree
7fe6589700060de813d529c48937b7678c297d42
# git-ls-tree 7fe6589700060de813d529c48937b7678c297d42
040000 tree b430c300c59dd80764892644c637ba6e29785c09 root
# git-ls-tree -r 7fe6589700060de813d529c48937b7678c297d42
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb root/left/left/value
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac root/left/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/left/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 root/left/value
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 root/right/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f root/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a root/value
# git-ls-files
root/left/left/value
root/left/right/value
root/left/type
root/left/value
root/right/type
root/type
root/value
# git-ls-files --stage \*/value | cut -d ' ' -f 2 | git-cat-file --batch
587be6b4c3f93f93c489c0111bba5596147a26cb blob 2
x

975fbec8256d3e8a3797e7a3611380f27c49f4ac blob 2
y

72e8ffc0db8aad71a934dd11e5968bd5109e54b4 blob 2
*

0cfbf08886fca9a91cb753ec8734c84fcbe52c9f blob 2
2

fd38861632cb2360cb5acb6253e7e281647f034a blob 2
+

#
git-ls-files --stage \*/value | cut -d ' ' -f 2 | xargs git-show
x
y
*
2
+

An ls of the repository will show only the .git directory: the data is stored as BLOBs and paths in the GIT database, not as a directory tree on the filesystem.

In the first attempt, git-ls-files returns the database entries lexically ordered by path. A more database-like approach would be to store the nodes of the expression in a single directory (i.e. a table) and linking them via parent and child references. This highlights a small flaw in the plan: it is not possible to get the SHA1 of a tree without writing the index to disk.


# # Remove previous attempt
# rm -r .git
# git init

# # Node 1 Operator '+'

# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash_object -w --stdin` 'nodes/1/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash_object -w --stdin` 'nodes/1/type'

# # Node 2: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'nodes/2/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash-object -w --stdin` 'nodes/2/type'

# # Node 3: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'nodes/3/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/3/type'

# # Node 4: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'nodes/4/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/4/type'

# # Node 5: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'nodes/5/value'
# git-update-index --add --cacheinfo 100644 `echo 'integer' | git-hash-object -w --stdin` 'nodes/5/type'

# git write-tree
d986912eb66eb446ddaaa188a5cc1068c8cf951b
# git ls-tree `git ls-tree d986912eb66eb446ddaaa188a5cc1068c8cf951b | grep nodes | awk '{print $3}'`
040000 tree 945999b7c15c9b745333847bf3d341b7f74a784f 1
040000 tree 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 2
040000 tree 9a4b76bbc0b9a20185333ca44321ba438eb6d71c 3
040000 tree f556930b3cd058453894e0cc386fe6ca10908abd 4
040000 tree 36fde2340dd25814793c1346ebbb0175049665dd 5

# # Set Node 1 children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/1/left'
# git-update-index --add --cacheinfo 100644 36fde2340dd25814793c1346ebbb0175049665dd 'nodes/1/right'

# # Set Node 2 parent and children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/2/parent'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/left'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/right'

# # Set Node 3 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/3/parent'

# # Set Node 4 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/4/parent'

# # Set Node 5 parent
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/5/parent'

# git-write-tree
e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
# git ls-tree -r e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/1/left
100644 blob 36fde2340dd25814793c1346ebbb0175049665dd nodes/1/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/1/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a nodes/1/value
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/left
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/2/parent
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/2/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 nodes/2/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/3/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/3/type
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb nodes/3/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/4/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/4/type
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac nodes/4/value
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/5/parent
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 nodes/5/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f nodes/5/value


Of course, pulling the contents of these files from the DB results in a list of expression tokens ordered by node number:

# git ls-files --stage '*/value' | awk '{print $2}' | xargs git-show
+
*
x
y
2

While correct, this is not useful for building an infix expression. A new tree is needed that acts as an index for the first tree, so that listing the new tree will return the tokens in infix order. This is accomplished by using the names 'left', 'operator', and 'right' for node contents, so that lexical ordering of the index-tree will return tokens in that order.


# # Root: Operator '+'
# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash-object -w --stdin` 'infix/operator'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'infix/left/operator'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'infix/left/left/variable'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'infix/left/right/variable'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'infix/right/value'

# git-ls-files --stage infix/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y + 2


Here, the 'infix' tree re-creates the hierarchy of the original expression, much as the first attempt did -- only this time, the names of the files and directories are chosen to force an infix ordering when sorted by name. Only the files containing the values to display are included in this tree, and these map to the same SHA1 digest as those in the 'nodes' tree. Note that git only stores one copy of each BLOB, so the infix tree only uses as much space as is required to represent the tree: the value BLOBs are shared with the original table.

Listing of subtrees correctly returns sub-expressions:

# git-ls-files --stage infix/left/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y

Once the data is has been created, the database can be committed, creating a base version:

# echo 'Node database created' | git commit-tree e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
f1b038fb0ea444f7a05ef9c48d74aef6ac3d8b7e

Searching the database relies on git-grep, using the --cached option to search the BLOBs instead of the filesystem. The search may be limited to specific paths:

# git-grep --cached 'operator' -- root 
root/left/type:operator
root/type:operator
# git-grep --cached '*' -- expr
expr/left/operator:*
# git-grep --cached 'x' -- '*/variable'
expr/left/left/variable:x
# git-grep --cached '*''*/left/*'
expr/left/operator:*
root/left/value:*

Removing entries involves removing the index entries, not the BLOB:

# git-update-index --force-remove 'infix/operator'
# git-ls-files --stage infix\*
100644 587be6b4c3f93f93c489c0111bba5596147a26cb 0 infix/left/left/variable
100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 0 infix/left/operator
100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 0 infix/left/right/variable
100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 0 infix/right/value

This means that the values stay in the database, so that changes can be reversed. The actual 'database' therefore is nothing but a collection of named links to BLOBs. Searching and browsing of the database is done by operating on paths, the 'live' data, while inserting, updating, and deleting database entries requires changing what blobs the paths point to.



It goes without saying that the git commands provided above can be wrapped with shell functions (or Ruby, or Python, or any other scripting language) that create database entries, populate indexes, and commit changes to the data. With a small amount of work, a reasonable version-controlled hierarchical database can be put together, all contained in the .git dir.

More complex projects could mix files managed by git and the above tree-only approach; this is especially useful when storing binary content, which is awkward to deal with when stored in BLOBs (generally requires unzipping). In such cases, finding the top level (root) of the Git repository is useful, and git makes this easy as well:

# mkdir stuff
# cd stuff
# git-rev-parse --git-dir
/tmp/expr/.git

Monday, December 21, 2009

The importance of documentation

Went back to Python after three or so years of Ruby, and found immediately that loading modules from . is now broken.

Checked the docs, which are in a horrible state (it's in the Language Ref, no wait it's in the Library Ref even though it's a language feature, no wait it's in a PEP, no wait that is out of date and the real documentation has been posted to a mailing list), and found stuff like this:

From Python Docs: The Module Search Path:

"When a module named spam is imported, the interpreter searches for a file named spam.py in the current directory, and then in the list of directories specified by the environment variable PYTHONPATH. This has the same syntax as the shell variable PATH, that is, a list of directory names. When PYTHONPATH is not set, or when the file is not found there, the search continues in an installation-dependent default path; on Unix, this is usually .:/usr/local/lib/python.

"Actually, modules are searched in the list of directories given by the variable sys.path which is initialized from the directory containing the input script (or the current directory), PYTHONPATH and the installation- dependent default."

Well, that's the theory at any rate. Let's test it with actual code:

# mkdir /tmp/test-py
# cd /tmp/test-py
# mkdir -p snark/snob
# touch snark/__init__.py snark/snob/__init__.py snark/snob/stuff.py
# echo -e '#!/usr/bin/env python2.5\nimport os\nprint(os.getcwd())\nimport snark.snob.stuff\n'> a.py
# chmod +x a.py
# mkdir bin
# cp a.py bin/b.py


What would you expect to happen when this is run? Surely having the module in the current directory means that running either a.py or b.py from . will work, right?

# ./a.py
/tmp/py-test
# ./bin/b.py
/tmp/py-test
Traceback (most recent call last):
File "./bin/b.py", line 4, in

import snark.snob.stuff
ImportError: No module named snark.snob.stuff


Nope! And look at that -- according to Python's own system command, the current working directory (as mentioned in their docs) is the same for both scripts!

Explicitly setting PYTHONPATH fixes this:

# PYTHONPATH=. bin/b.py
/tmp/py-test


..but really, shouldn't the docs be a bit less misleading?

And, for that matter, why the hell is the location of the script used instead of the working directory anyways? Either be sane and use the current working directory, or be slightly less sane and check both.

The whole reason for using Python on this project in the first place was to revert to a language and interpreter that is more stable and more professional (in terms of management style) than Ruby, but this kind of crap certainly gives one pause. If thirty minutes with the language turns up something as obviously broken as the module loader, one shudders to think what is going to come up over the course of the project.

Sad days for the Python boys. The interest of Fowler and his crew must have really given Ruby a leg up in terms of quality.

Sunday, December 6, 2009

'standalone' rattle

Rattle is a decent data-mining utility built on top of GNU R, with one small problem: it is impossible to run outside of the R terminal (e.g. from the shell or a desktop icon/menu).

What this means, on Linux, is that to run Rattle one must start R in a terminal and enter the following commands:

> library(rattle)
Loading required package: pmml
Loading required package: XML
Rattle: Graphical interface for data mining using R.
Version 2.5.3 Copyright (C) 2006-2009 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle()

A Ctrl-D to log out of R is also required.

The big problem (ignoring R's inconsistent handling of stdin/out, which can mostly be solved by using littler) is the R Gtk module starts a separate GUI thread, and quitting R does not do a wait on child threads -- it just kills them.

A workaround to launch rattle from outside of R is to provide a simple wrapper script:

#!/usr/bin/env ruby

require 'open3'

r_prog = `which R`.chomp

Open3.popen3( "#{r_prog} --interactive --no-save --no-restore --slave" ) do | st
din, stdout, stderr |
stdin.puts "library(rattle)"
stdin.puts "rattle()"
puts stdout.read
end


This runs R and launches rattle(), but leaves an R process open in the background once rattle() has exited.

Adding a q() call to the script merely terminates the Rattle GUI child thread. In addition, rattle provides no indication that it is running:

> ls()
character(0)
> library(rattle)
Loading required package: pmml
Loading required package: XML
Rattle: Graphical interface for data mining using R.
Version 2.5.3 Copyright (C) 2006-2009 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"
> rattle()
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"
# Use Ctrl-Q to exit the Rattle GUI
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"

Once the Rattle library has loaded, there is no way to tell from within R if the Rattle GUI is running or not. Modifying Rattle to create a variable when the GUI loads successfully, and to remove it when the GUI quites, would undoubtedly fix this.

Wednesday, November 25, 2009

Too clever by half

It's always frustrating to come across a build error in allegedly-portable code, then to discover that the root of it is someone complicating things in a misguided attempt to improve portability -- usually by introducing complexity instead of removing it.

The macports project in particular is rife with the problems, undoubtedly because most of the "cross-platform" open source projects it ports never considered OS X as a build target.

KDE4 has refused to build on a particular OS X box (Quad G5, 10.5.8, XCode 3.1.4) via macports for quite some time. Currently, the build fails in kdebase4-runtime due to an error in the fishProtocol ctor in fish.cpp:


/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp: In constructor 'fishProtocol::fishProtocol(const QByteArray&, const QByteArray&)':
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:275: error: cannot convert 'const char* (*)()' to 'const char*' for argument '1' to 'size_t strlen(const char*)'
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp: In member function 'void fishProtocol::manageConnection(const QString&)':
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:1079: error: no matching function for call to 'fishProtocol::writeChild(const char* (&)(), int&)'
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:561: note: candidates are: void fishProtocol::writeChild(const char*, KIO::fileoffset_t)
make[2]: *** [kioslave/fish/CMakeFiles/kio_fish.dir/fish.o] Error 1
make[1]: *** [kioslave/fish/CMakeFiles/kio_fish.dir/all] Error 2
make: *** [all] Error 2


Little did I know at the time that the actual cause for the error was further up:

[ 57%] Generating fishcode.h
cd /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish && /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/generate_fishcode.sh /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl /opt/local/bin/md5 /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/build/kioslave/fish/fishcode.h -f\ 4
sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unescaped newline inside substitute pattern



The line in question (fish.cpp:275) does a strlen on fishCode, which is not defined anywhere, though there is an #include for fishcode.h .

I searched around, found fishcode.h, and it was close to empty:

#define CHECKSUM "'fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'"
static const char
*fishCode('
');


Something had gone horribly wrong, somewhere -- fishCode is empty, though that shouldn't cause the error being displayed. Still, builds get wacky, and it's a lead.

Googling around for a copy of fishcode.h, turns up nothing, which is not a surprise as it is obviously autogenerated by the build process. What does turn up, however, are the commands used to generate it:

SUM=`/usr/bin/md5sum
/usr/src/packages/BUILD/kdebase-683581/kioslave/fish/fish.pl | cut -d ' ' -f 1`;
echo '#define CHECKSUM "'$SUM'"' > fishcode.h; echo 'static const char
*fishCode(' >> fishcode.h; sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'
/usr/src/packages/BUILD/kdebase-683581/kioslave/fish/fish.pl >> fishcode.h; echo
');' >> fishcode.h;


Time to set SUM in the shell and started playing with the sed command, which is obviously what's failing:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])


Something is going wrong, set SUM to a more sane value:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])

bash-3.2$ SUM='/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])


Try replacing the embedded newline with a good old '\n':

bash-3.2$ SUM='/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[\n]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": bad flag in substitute command: 'o'


Sed doesn't like the embedded newline; not a surprise. Now it seems like SUM is being interpreted as part of the expression. Let's see what's really going on:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ echo sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;' sed -e s/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl/;


Argh! Someone's single quotes aren't properly escaped! Right there at '$SUM'!

Simple enough to fix, though. Fortunately, using sed to generate fishcode.h manually allows the kdebase4-runtime build to continue, which saves the time of tracking down where in the build process that sed command is.

It does raise the question, though: was any of this autogeneration really necessary? And if so, wouldn't a perl script (or even a one-liner) be more reliable?

One can argue that perl does not ship on as many OSes as sed (debatable), but the fact that this build worked on a maintainer's machine and not on an end-user's machine (running an OS highly regulated by the manufacturer, no less) pretty much blows the portability argument out of the water: if that's what the maintainer is truly after, well, their methods aren't working either.

OK, end angry-at-overly-convoluted-open-source-code-again rant. The moral: verify your goddamn shell commands!

Saturday, September 12, 2009

Ubuntu, bluetooth, and the E70

A few quick notes on connecting a phone (in this case the venerable Nokia E70, the last with the excellent gull-wing keyboard design) to ubuntu via bluetooth.

Assuming the kernel modules and all necessary utilities (bluetooth, bluez, obexftp, obexfs, obecpushd, obex-data-server, irda-utils, cobex, gammi, kbluetooth) are installed, the first step is to create an rfcomm device:

mknod /dev/rfcomm0 c 216 0

Next, bind the device to the phone's bluetooth MAC address:

sudo rfcomm bind 0 00:12:D1:AD:FD:5E

The MAC address can be obtained using

hcitool scan

Note that the link and the bind have to be performed at boot; the file /etc/bluetooth/rfcomm.conf can be modified to perform this (it contains a sufficiently clear example).

Once these steps are performed, utilities like cobex and kbluetooth should just work.

With the E70, though, they don't, so it's necessary to use Obex to get data to and from the device.

The standard, longhand way to do this is with obex-ftp:

obexftp -b $MAC_ADDR -B 11 -l

The -b option is the bluetooth MAC address; the -B option is the channel (usually 10 or 11), and the -l option is the command to execute (ls). The man page lists commands.

Better that obexftp is obexfs, which mounts the phone as a filesystem:

mkdir ~/e70
obexfs -b $MAC_ADDR -B 11 ~/e70

This mounts the phone at the mount point ~/e70, where its phone and MMC memory can be accessed directly.

Tuesday, September 8, 2009

Doxygen gotchas

Finally took an afternoon to learn doxygen and have been commenting up the latest project for the past 24 or so hours.

Doxygen is easy to get up to speed with, but there are a couple of gotchas that aren't made clear in the documentation. These are listed below in no particular order.


Standalone pages

This is the first thing one notices when mucking about with doxygen: how can the index.html (aka "Main Page") be modified?

The answer is just as quick to find: create a document with a \mainpage tag. But what document? Surely adding a header file just to create doxygen tags is a bit silly?

Indeed. Instead, create a directory (e.g. doc) and use it for standalone doxygen files. A standalone file is basically a text file consisting of only the C++ comment, e.g.:

/*!
 \mainpage The Main Page
 \section sec_1 Section 1
 ...section 1 stuff...
 \section sec_2 Section 2
 ... section 2 stuff...

<HR>
<b>\ref todo%lt;/b>
*/


Name this something like 'main.dox', and add the .dox extension to the FILE_PATTERNS definition in the project Doxyfile:

FILE_PATTERNS += *.dox

These standalone doxygen files are incredibly useful for stuff like installation instructions, HowTos, and FAQs.

They are also useful for defining groups. A file groups.dox can contain group definitions like the following:

/*!                                                                         
\defgroup outer_group "Outer Group"
This module contains outer-group stuff.

\defgroup inner_group "Inner Group"
\ingroup outer_group
These things are in a group in a group.
*/


This provides a single, easy-to-maintain place for group definitions, so that header files and such just have to use "\ingroup". It seems like pretty good practice to put most files and classes in groups -- it makes for more expedient browsing.


Global namespace

Ok, there's a nice 'Namespaces' folder in the tree pane, and what does it contain? The project namespace. What about all those singletons ill-advisably implemented as globals (purely for narrative effect)?

Turns out there is no way to list the global namespace. This seems like a huge oversight -- if there's one thing you want to know about a project, it's how many lame globals the developers used.

Adding a 'globals' page seems like a good way to circumvent it, except for one slight problem -- if you create a standalone doc with "\page globals Global Namespace" in it, doxygen creates a page in the tree called "Global Namespace" ... with its own internal(?) version of the global namespace in it. This means, basically, that globals defined in your project are not there -- it only contains stuff like (for example) qRegisterMetaType invocations. It looks like 'globals' is an undocumented, and not particularly working, doxygen 'special command'.

A workaround is to use xrefitems. Add a line like the following to doxygen (remember, 'globals" is not an allowed name):

ALIASES                +=  "globalfn=\xrefitem global \"Functions\" \"Globals\""

Now, use code like the following to document your global function:

/*! \fn void GlobalSingletonFactoryMess( bool suck_less )
       \globalfn void GlobalSingletonFactoryMess( bool suck_less )
       \param suck_less : Make singletons suck less
*/
void GlobalSingletonFactoryMess( bool );

A page called 'Globals' will appear under 'Related Pages', and will contain the function prototype, with a link to its documentation in the appropriate .h file. Of course, it's called a 'Member', but one can't have everything.


Xrefitems

Speaking of xrefitems, see that second argument in the ALIASES line above? The one that's set to "Functions", and is supposedly the header for the section that contains the xref items?

Yeh, that argument does nothing. Doxygen ignores it. Go on, try it. Set it to "Doxygen, please format my hard drive". Or have it make fun of your boss or users. It makes no difference. That text will never appear.

Multiple namespaces in header files

This one is just plain wacky. Or rather, it's just plain lazy. Of the doxygen parser.

Let's say you have a header file where you declare an interface and an abstract base class implementing that interface (never mind why, it's an example):

namespace Mine {
class MyInterface {
public:
virtual ~MyInterface() {}
virtual void doStuff() = 0;
};
} /* close namespace */

/* interface must be registered in gobal namespace */
Q_DECLARE_INTERFACE( Mine::MyInterface, "com.me.Mine.MyInterface/1.0");

namespace Mine {
class MyClass : public QObject, public MyInterface {
Q_OBJECT
Q_INTERFACES(Mine::MyInterface)
public:
MyClass( QObject *parent=0 );
virtual void doStuff();
};
}


Guess what happens? MyClass never appears in the documentation. The second namespace block leaves the doxygen parsers as befuddled as ... your favorite befuddlement simile.

The solution of course is to put the interface class in its own header file, which is no big deal ... but really, you shouldn't *have* to.


Random \example tag links

OK, there's an example in docs/examples/, that dir is safely added to the Doxyfile EXAMPLE_PATH, and the file shows up where it's supposed to in the doc tree under Examples.

Looking at the class for which it is an example, though, you see nothing -- or maybe it gets linked to a few rather arbitrary methods, or to methods outside of the class altogether.

What's going on?

Well, it turns out that doxygen decides to out-clever you, and only put a link to the example file in elements that appear in the example. Thus, if your class never directly appears as a type in the file (e.g. 'PluginManager::plugin("MyPlugin").status()' appears instead of 'Plugin p(PluginManager::plugin("MyPlugin"); p.status()'), then the documentation for your class will not be linked to the example, no matter how close to the \class tag you put the \example tag. Doxy knows best, eh? Surely you have no idea where you want your examples linked from.

The fix is to rewrite the example so that the class/function/whatever appears clearly and distinctly.

Tuesday, August 11, 2009

Using qmake without a build target

One of the greatest weaknesses of Qt's qmake utility has been its over-reliance on templates. The available templates are standard stuff for Qt projects: app, lib, and subdirs (new with Qt4, I believe). These are all well and good when one is just slapping together a bunch of Qt code, but what about when interfacing with other build systems? A command template would be quite useful.

The follow project file is a workaround. The lib template is used (although app could be used as well) with empty HEADERS and SOURCES variables. The qmake variables that specify which build tools to use are set to a NOP (no-operation) tool such as echo or true. An "extra" target is created for the command to be executed, and set to be a pre-dependency of the main target.

Background: There is a config.py in . that has sip generate all C++ code, as well as the Makefile, in the directory 'sip'. The python module (sip_test.so) is output to the same directory. This project filed is invoked, along with many others, from a subdirs project file in the parent directory.


-----------------------------------------------------
# sip_test.pro
TEMPLATE = lib
TARGET = sip_test

# Check that sip exists
unix {
!system(sip -V > /dev/null) {
error( "SIP does not exist!" )
}
}

# sip build command and makefile targets
sip.commands = python ./config.py && cd sip && make
QMAKE_EXTRA_TARGETS += sip
PRE_TARGETDEPS = sip

# compiler toolchain override
unix {
# 'true' should exist on Unix and OS X. Win32 folk can
# use TYPE.EXE or something.
NOP_TOOL = true
}

QMAKE_CC = $$NOP_TOOL
QMAKE_CXX = $$NOP_TOOL
QMAKE_LINK = $$NOP_TOOL
QMAKE_LN_SHLIB = $$NOP_TOOL

# Copy the .so created by sip.commands to the current directory
QMAKE_POST_LINK = cp sip/$${TARGET}.so .
-----------------------------------------------------

With this project file, the standard "qmake && make" command works as expected.

Saturday, May 30, 2009

Toshiba R500 Backlight Button

Keep forgetting to u/l this script for toggling the r500 backlight using toshset:

#!/bin/sh

TOSHSET=/usr/local/bin/toshset

STATE=`sudo $TOSHSET -trmode 2>/dev/null | grep -a transreflective | \
cut -f 3 -d ' '`

if [ $STATE = "on" ]
then
NEW_STATE="off"
else
NEW_STATE="on"
fi

sudo $TOSHSET -trmode $NEW_STATE


Xev shows the keycode for the backlight (and the 'i' internet/information/idiot key, so they'll both do the same thing) to be 180, which is already taken by XF86HomePage in Ubuntu. If for some reason this key is undefined (i.e. nothing shows up on `xmodmap -pke | grep '^keycode 180'`), add the following line to ~/.Xmodmap (which might need to be symlinked to ~/.xmodmaprc on non-Ubuntu Linuxes):

keycode 180 = XF86HomePage

In E17, edit the keyboard bindings using Settings-> Settings Panel -> Input -> Key Bindings. For XF86HomePage, select the Launch : Defined Command option, and point it to the above shell script (e.g. in ~/bin).

Note that the script uses sudo to launch toshset, so you will either need to change this to invoke an GUI sudo program like gtksu, or add a line like the following to /etc/sudoers:

$USER ALL=(root) NOPASSWD: /usr/local/bin/toshset

where $USER is your username. Note that this allows any toshset command to be run as root without a password from this user's session, which could be exploited if someone were sufficiently motivated.

Thursday, March 26, 2009

Linking static libraries into a shared library

The ability to link static libraries into a shared library -- e.g.

gcc -shared -o libstuff.so lib1.a lib2.a lib3.a

-- disappeared in one of the 3.x series. This broke my tendency to build applications as static libraries, linked into a shared library, linked to by an executable consisting of little more than a main().

Turns out all that was missing was a GNU ld flag to prevent ld from optimizing the unused object code out of the shared library.

The solution is to wrap the .a files in --whole-archive flags, e.g.

gcc -shared -o libstuff.so -Wl,--whole-archive lib1.a lib2.a lib3.a -Wl,--no-whole-archive

Friday, March 6, 2009

Insipid Ibex

Finally upgraded (K)Ubuntu to 8.10. Thanks for breaking my xorg config, guys, by assuming that HAL could actually do a better job that my hand-picked settings. This is easily fixed by uncommenting all of the lines commented out by HAL, but it's amazing that it had to be done at all, since the HAL change made X flat-out stop working. Also, for some reason, ~/.xsession no longer works -- one has to use ~/.xsessionrc instead. Those wacky Ubuntites.

Upgraded E17 shortly afterwards, using the packages at

deb http://greenie.sk/ubuntu intrepid e17

The new version of E has compiz support (which has to almost entirely be disabled for it to run at E17's proper speed) and looks nice. The modules all seem to work, for once.

But it crashes!

The first one came immediately, and was fixed by wiping ~/.e, which contained all of my old config settings. Apparently E17 can't reliably be upgraded with an existing config. Thanks for moving to a binary config file, guys, so that this is actually a far larger problem than it should be. Really, how hard is it to use the binary config only at runtime, and compile from text to binary whenever a config change is made?

Some of the other crashes appear to be from the OpenGeu inclusion of compiz. After removing the calls to emerald and ecomorph from enlightenment_startup.sh, and stripping out the -evil command line option (which, along with -good and -psychotic, did not seem to make things stable), it's become pretty stable, if a little buggy. Dialog boxes occasionally do not disappear when closed, requiring a restart of E17.

Also, some E17 idiocies remain -- things that seemed to be oversights before, but now (having been a pretty stable WM for the past three years) appear to be truly asinine design decisions.

Specifically:

* There is no way to modify the properties of an application (short of adding it to iBar, or modifying the desktop file directly). A dialog exists for this (it's used by ibar and when creating a new application), so it can't be much extra work.

* One cannot cut/copy/paste between E17 widgets and X. This is nice to be able to do when, say, setting an icon file path (since the desktop files for KDE applications do not seem to be handled correctly by E17, and therefore have no icon files). Hope you like typing.

* The binary only config, mentioned above. Really, there is no excuse for this -- the rest of the industry learned that lesson in the 90s. Even Windows allows its abhorred registry to be read from and written to text files.

Aside from that, Ibex works well, and the upgrade was generally painless. The only sticky points were the Xorg conf file (their mistake), E17 (my fault for running it), and the patch I needed to add to get toshset supported by the latest kernel (the maintainer's fault, for not getting it merged after 2 years of reliable performance).

Wednesday, February 11, 2009

more ffox plugins

These are more for the UI folk:

Colorzilla:Look up a pixel's color value

Unicode lookup/converter : Character lookup utility

MeasureIt : Ruler widget for measuring distance in pixels

Tuesday, February 3, 2009

Vim links

People either love vi or hate it; if you hate it, sod off and take your parentheses with you.

It's good to have a printout of the vi reference card handy, and a cheat sheet or two saved locally:

Vi Reference Card (PDF)
Graphical Vim Cheat Sheet (GIF)
Vim Cheat Sheet

In support of the "learn a new vim command every day" effort, both Vim Tips and the Vim Cookbook have proved interesting, as has the piping feature.
Speaking of pipes, pv is quite an interesting utility. So is histring, but it doesn't seem to be available (or maintained) except in rpm.

Vim has a ton of plugins worth checking out, including:
FuzzyFinder
OmniComplete
TagList
RenameWithCScope
C Call Tree Explorer
SourceCodeObedience
VimDebug
MiniBufferExplorer
SnippetsEmu
AllFold
Fly
SessionManager
Alternate
RunView
Scratch*
VcsCommand
MultVals
GitCommit
GitDiff
GitFile
Git Branch Info
There is also NERDTree, but its moderate utility does not justify its horrible name.

Most languages have syntax and snippet plugins, such as:
C
Python
Ruby/Rails
Perl
SQL
x86 and x86-64 asm
XML/HTML


Finally, to forestall the inevitable requests, here is a sample .vimrc .

DNS

DNS timeouts have been pretty bad lately; decided to take action and make resolve.conf actually be usable. Of course, it gets overwritten every time DHCP is used to configure an interface, so the name servers have to be added to dhclient.conf .

There are quite a lot of reliable servers to choose from:

OpenDNS
TechFAQ public dns server list
dnsserverlist.org
dnsserverlist.org organized by round trip time

The last two are particularly useful: dnsserverlist.org recommends three DNS servers based on your current IP, and provides an option for sorting their list by round-trip-time.

"Permanent" DNS servers can be added by editing /etc/dhcp3/dhclient.conf and adding 'prepend domain-name-servers' lines:

prepend domain-name-servers 216.224.112.14;
prepend domain-name-servers 67.198.198.213;
prepend domain-name-servers 69.111.95.106 ;
prepend domain-name-servers 208.67.222.222;
prepend domain-name-servers 208.67.220.220;
prepend domain-name-servers 4.2.2.1;
prepend domain-name-servers 4.2.2.2;
prepend domain-name-servers 151.197.0.38;
prepend domain-name-servers 151.202.0.84;
prepend domain-name-servers 151.203.0.84;

Remember to list these in reverse-order: the last server prepended will be the first line in resolv.conf.

The ultimate solution, of course, is to set up a local caching DNS server.

Friday, January 30, 2009

Unfortunate words

leverage: To utilize or exploit in an unspecified, hand-wavy, and generally unplanned manner.
We leverage the core compentency of our acquisitions in this sector to maximise ROI throughout the project lifecycle.

pundit: A cheerleader in the sport of politics.
Pundits are claiming that the candidate's gaffe is an opportunity to appeal to blue-collar voters, in stark contrast to his opponent's overly intellectual demeanor.

punditry: The presentation of half-considered opinion as research- or poll-backed fact. Generally considered the province of pundits and journalists.
(See above)

blogosphere: Amateur hour in the world of publish-or-perish.
Lo, it is easier for a shark to stop swimming, than for a blogger to refrain from banal commentary.

Wednesday, January 28, 2009

Productive Programmer

Just tore through Neal Ford's Productive Programmer (978-0-596-51978-0), ended up with mixed feelings about it.

The first part of the book is pretty decent, with pointers to some good tools. This is also a weakness: the book is not going to be a timeless classic, and will likely be of half utility in two to five years.

The second part has good advice, but is mostly a rehash of agile methodologies. Agile has (or rather, has appropriated) a lot of good ideas, but it is ultimately a methodology, and methodologies lead to two things: zealotry and incomplete projects.

The advice given tends towards the obvious ("use the command line", "learn keyboard shortcuts", "make macros and scripts"), and UNIX is presented as something used by Other People.

In fact the book seems to assume that the reader is encumbered by the two most enfeebling technologies out there: Windows and Java. Windows is inescapable as a platform, but one always has the choice not to code like a Windows programmer (excuse me, "developer"). Java seems to be the language of choice in... well, in companies that I have no interest in working for, so nothing lost there. Buying into the Windows Way or the Java Way means you've created a lot of your own problems, and if this book helps you solve some of them, more power to you.

For the non-Windows-and/or-Java Way programmer, stick with the classics by Brooks, Hunt/Thomas, Kernighan, Plauger, Bentley, Fowler. Maybe some Beck, but you have to be careful with evangelicals.

Tuesday, January 27, 2009

Free as in...

Finally got sick of building custom Linux kernels just to change CONFIG_HZ from the inane default of 250 to something that makes the 10ms timer in my (proprietary) device driver work reliably (instead of firing 83 times a second). Switched to hi-resolution timers, and immediate encountered this:

FATAL: modpost: GPL-incompatible module my_module.ko uses GPL-only symbol 'hrtimer_cancel'


Turns out, after reading stuff like http://www.linuxdevices.com/articles/AT9161119242.html and http://www.linuxdevices.com/articles/AT5041108431.html, that the lunatics have taken over the asylum: specific kernel subsystems can be restricted so that only GPL modules can call them.

This is ludicrous: why should GPL code care who calls it? This has nothing to do with distribution; this is *usage*.

Not to mention entirely unlike the "free as in speech" ethos that the FSF claims to support. "Free speech" generally doesn't imply "say anything you like, just don't say it in kernel mode!".

Between this GPL nonsense and the changing of the CONFIG_HZ default value in a minor release, the Linux kernel is proving itself quite unreliable. I may have to move this entire product over to FreeBSD.