Sunday, August 15, 2010

Examining .gem files

The gem(1) utility has a lot of useful commands for managing gem repositories. It only provides two commands for manipulating .gem files, however: build and install.

When building your own gems, or examining a gem before install, it is useful to be able to take a look at the contents without extracting the gem to a temporary directory.

The structure of the gem is simple: a tar file containing a tarball of the gem contents, and a compressed file containing the metadata:

bash# tar -tf /tmp/test.gem
tar: Record size = 19 blocks
data.tar.gz
metadata.gz

Both tar and gzip support input and output on STDOUT, so it is easy enough to write shell functions to access the contents and metadata:

gem_contents () {
  tar -xOf $1 data.tar.gz | tar -ztf -
  return $?
}

gem_metadata () {
  tar -xOf $1 metadata.gz | gzip -d
  return $?
}

This provides quick access to the contents and metadata of a gem file:

gem_contents /tmp/test.gem
gem_metatdata /tmp/test.gem

Unit Testing in C

There are a number of unit test frameworks for C, but they all seem to lack something.

Sure, each can be used to perform automated testing, and some will even generate test stubs for you. But all of them are limited to API testing; that is, the only functions that can be unit-tested are the public API exported in the header files.

C differs from object-oriented languages in that most of the work -- the units to be tested, as it were -- is done in static functions that are not exported. This makes unit-testing, as it stands, all but useless from a C programmer's perspective; there is no compelling reason to use unit tests over the usual "test programs running on test data and returning 0 or 1 to the makefile" approach.

With some small work, however, it is possible to apply a unit test framework to a C project, and to perform actual unit tests (i.e. on static functions). It is a bit ugly, it is a bit invasive (naturally each .c file must contain its own unit tests), but it is scalable and maintainable.


Unit Testing with CHECK

check is a unit testing framework for C (manual). It is primarily designed to be used with the GNU autotools (autoconf, automake, etc), and the documentation is not clear on integrating check with a standard Makefile-based project. The unit tests developed here will serve as an example.

To begin, assume a project with the following directory structure:

Makefile
stuff.c
stuff.h
tests/
tests/Makefile

The main program, the shared library libstuff.so, has its API in stuff.h, its code in stuff.c, and is built by the Makefile. The unit tests run by check will be in the tests director.y

stuff.h

#ifndef STUFF_H
#define STUFF_H

double do_stuff( double i );

#endif

stuff.c

#include <math.h>
#include "stuff.h"

static double step1( double i ) { return i * i; }
static double step2( double i ) { return i + i; }
static double step3( double i ) { return pow( i, i ); }

double do_stuff( double i ) { return step3( step2( step1( i ) ) ); }

Makefile

# Library Makefile
# -------------------------------------------------------------------
NAME        =    stuff
LIBNAME     =    lib$(NAME).so
ARCHIVE     =    lib$(NAME).a

DEBUG       =    -DDEBUG -ggdb
OPT         =    -O2
ERR         =    -Wall
INC_PATH    =    -I.
LIB_PATH    =   

#----------------------------------------------------------------
CC          =     gcc
LD          =    ld
AR          =    ar rc
RANLIB      =    ranlib
RM          =    rm -f

LIBS        =    -lm
CC_FLAGS    =     $(INC_PATH) $(DEBUG) $(OPT) $(ERR) -fPIC
LD_FLAGS    =    $(LIB_PATH) $(LIBS) -shared -soname=$(LIBNAME)
           

SRC         =     stuff.c

OBJ         =     stuff.o

#---------------------------------------------------------- Targets
all: $(LIBNAME)
.PHONY: all clean check

$(ARCHIVE): $(OBJ)
    $(AR) $(ARCHIVE) $^
    $(RANLIB) $(ARCHIVE)

$(LIBNAME): $(ARCHIVE)
    $(LD) $(LD_FLAGS) --whole-archive $< --no-whole-archive -o $@

.c.o: $(SRC)
    $(CC) $(CC_FLAGS) -o $@ -c $<

clean:
    [ -f $(LIBNAME) ] && $(RM) $(LIBNAME)|| [ 1 ]
    [ -f $(ARCHIVE) ] && $(RM) $(ARCHIVE)|| [ 1 ]
    [ -f $(OBJ) ] && $(RM) $(OBJ) || [ 1 ]
    cd tests && make clean

check: $(LIBNAME)
    cd tests && make && make check

So far, this is all pretty straightforward stuff. The Makefile in the tests directory will contain all of the check-specific settings.

tests/Makefile

# Unit-test Makefile
#--------------------------------------------------------- Definitions
TGT_NAME    =    stuff
TGT_SRC     =    ../stuff.c

OPT         =    -O2 -fprofile-arcs -ftest-coverage
ERR         =    -Wall
INC_PATH    =    -I. -I..
LIB_PATH    =    -L..
LD_PATH     =     ..

#---------------------------------------------------------
CC          =     gcc
RM          =    rm -f

# NOTE: check libs must be enclosed by --whole-archive directives
CHECK_LIBS  =    -Wl,--whole-archive -lcheck -Wl,--no-whole-archive
LIBS        =    -lm $(CHECK_LIBS)  

# NOTE: UNIT_TEST enables the static-function test case in stuff.c
CC_FLAGS    =     $(INC_PATH) $(OPT) $(ERR) -DUNIT_TEST
# NOTE: check libs must be enclosed by --whole-archive directives
LD_FLAGS    =    $(LIB_PATH)


# Test Definitions (to be added later)
TESTS       = 

#---------------------------------------------------------- Targets
all: $(TESTS)
.PHONY: all clean check


clean:
    $(RM) $(TESTS) *.gcno *.gcda


check:
        @for t in $(TESTS); do                          \
                LD_LIBRARY_PATH='$(LD_PATH)' ./$$t;     \
        done


There are a few things to take note of here.

First, the compiler options profile-arcs and test-coverage cause check to perform test coverage profiling. See the the manual for details.

Next, libcheck.a (there is no .so) is added to the linker options, enclosed in --whole-archive directives so that the linker will not discard what it thinks are unused object files. This last point is important; when linking a static library into a dynamic library, --whole-archive must be used.

Finally, the preprocessor definition UNIT_TEST is added to the compiler options. This will be used to ensure that unit tests do not get built into a distribution version of libstuff.so.

Simple Unit Test

The first unit test will be a simple one that ensures the library links with no errors. This is a common problem when developing a shared library; unresolved symbols will not be reported until an executable is linked to the library.

tests/test_link.c

#include <stuff.h>

int main(void) { return (int) do_stuff( 32.0 ); }

Add a make target for this first test.

tests/Makefile

# Test 1 : Simple test to ensure that linking against the library succeeds
TEST1        =    test_link
TEST1_SRC    =    test_link.c

TEST1_FLAGS  =    $(CC_FLAGS) $(LD_FLAGS)
TEST1_LIBS   =    $(LIBS) -l$(TGT_NAME)

TESTS        =    $(TEST1)

# ...

$(TEST1): $(TEST1_SRC)
    $(CC) $(TEST1_FLAGS) -o $@ $^ $(TEST1_LIBS)

Note that the lines before # ... go in the definitions (top) part of the Makefile, while the lines after it go in the targets (bottom) part of it.

This first test can now be run:

bash# make check
gcc -I.  -DDEBUG -ggdb -O2  -Wall -fPIC -o stuff.o -c stuff.c
ar rc libstuff.a stuff.o
ranlib libstuff.a
ld  -lm  -shared -soname=libstuff.so --whole-archive libstuff.a --no-whole-archive -o libstuff.so
cd tests && make && make check
make[1]: Entering directory `stuff/tests'
gcc -I. -I.. -O2 -fprofile-arcs -ftest-coverage -Wall -DUNIT_TEST -L.. -Wl,--whole-archive -lstuff -lcheck -Wl,--no-whole-archive  -o test_link test_link.c
make[1]: Leaving directory `stuff/tests'
make[1]: Entering directory `stuff/tests'
make[1]: Leaving directory `stuff/tests'

Success! Note that the link test does not involve check; make will error out if the linking fails. Pedantic unit testers may want to take the route of making the test fail first (by invoking, say, do_nothing() in test_link.c) in order to convince themselves that it works.


Testing Static Functions

Testing a static function requires embedding the test code in the file that contains the function. The UNIT_TEST preprocessor directive will prevent the unit test from being compiled in non-unit-test binaries.

First, append the test code to the end of the library code.

stuff.c

/* ================================================================= */
/* UNIT TESTS */
#ifdef UNIT_TEST
#include <check.h>
#include
<stdlib.h>    /* for rand() */

START_TEST (test_step1)
{
    double d = (double) rand();
    fail_unless( step1(d) == (d * d), "Step 1 does not square" );
}
END_TEST

START_TEST (test_step2)
{
    double d = (double) rand();
    fail_unless( step2(d) == (d + d), "Step 2 does not double" );
}
END_TEST

START_TEST (test_step3)
{
    double d = (double) rand();
    fail_unless( step3(d) == pow(d, d), "Step 3 does not exponentiate" );
}
END_TEST

TCase * create_static_testcase(void) {
    TCase * tc = tcase_create("Static Functions");
    tcase_add_test(tc, test_step1);
    tcase_add_test(tc, test_step2);
    tcase_add_test(tc, test_step3);
    return tc;
}

#endif

The START_TEST and END_TEST macros are provided by check, as are the tcase_ routines. The strategy here is to define a single exported function, create_static_testcase, which the unit-test binaries will link to.

The next step is to add a unit test suite to the test directory.

tests/test_stuff.c

#include <check.h>
#include <math.h>
#include <stdlib.h>

#include <stuff.h>

#define SUITE_NAME "Stuff"

/* ----------------------------------------------------------------- */
/* TESTS */

START_TEST (test_stuff_diff)
{
    double d = (double) rand();
    fail_unless ( do_stuff(d) != d, "do_stuff doesn't!" );
}
END_TEST

START_TEST (test_stuff_rand)
{
    double d = (double) rand();
    double dd = d * d;
    double sumdd = dd + dd;
    fail_unless ( do_stuff(d) == pow(sumdd, sumdd), "Incorrect result" );
}
END_TEST

/* ----------------------------------------------------------------- */
/* SUITE */

extern TCase * create_static_testcase(void);

Suite * create_suite(void) {
    Suite *s = suite_create( SUITE_NAME );

    /* Create test cases against library API */
    TCase *tc_core = tcase_create ("Core");
    tcase_add_test(tc_core, test_stuff_diff);
    tcase_add_test(tc_core, test_stuff_rand);
    suite_add_tcase(s, tc_core);

    /* Create test cases against static functions */
    suite_add_tcase( s, create_static_testcase() );

    return s;
}

int main( void ) {
    int num_fail;
    Suite *s = create_suite();
    SRunner *sr = srunner_create(s);
    srunner_run_all (sr, CK_NORMAL);
    num_fail = srunner_ntests_failed (sr);
    srunner_free (sr);
    return (num_fail == 0) ? EXIT_SUCCESS : EXIT_FAILURE;
}

This file creates a test suite that contains two tests of the library API (defined in test_stuff.c) and three tests of the static functions (defined in stuff.c). The entire suite will be run when the unit-test binary is executed.

Finally, the new test must be added to tests/Makefile, as discussed earlier.

tests/Makefile

# Test 2 : Actual unit tests against source code
# NOTE: this cannot link against the library as it incorporates the source code
TEST2        =    test_$(TGT_NAME)
TEST2_SRC    =     $(TEST2).c \
                   $(TGT_SRC)
TEST2_FLAGS  =    $(CC_FLAGS) $(LD_FLAGS)

TEST2_LIBS   =    $(LIBS) 

TESTS        =    $(TEST1) $(TEST2)

# ...

$(TEST2): $(TEST2_SRC)
    $(CC) $(TEST2_FLAGS) -o $@ $^ $(TEST2_LIBS)

Now all of the tests will be run when the make target is invoked:


gcc -I.  -DDEBUG -ggdb -O2  -Wall -fPIC -o stuff.o -c stuff.c
ar rc libstuff.a stuff.o
ranlib libstuff.a
ld  -lm  -shared -soname=libstuff.so --whole-archive libstuff.a --no-whole-archive -o libstuff.so
cd tests && make && make check
make[1]: Entering directory `stuff/tests'
gcc -I. -I.. -O2 -fprofile-arcs -ftest-coverage -Wall -DUNIT_TEST -L.. -Wl,--whole-archive -lstuff -lcheck -Wl,--no-whole-archive  -o test_link test_link.c
gcc -I. -I.. -O2 -fprofile-arcs -ftest-coverage -Wall -DUNIT_TEST -L.. -Wl,--whole-archive -lstuff -lcheck -Wl,--no-whole-archive  -o test_stuff test_stuff.c ../stuff.c
make[1]: Leaving directory `stuff/tests'
make[1]: Entering directory `stuff/tests'
Running suite(s): Stuff
100%: Checks: 5, Failures: 0, Errors: 0
make[1]: Leaving directory `stuff/tests'

Wednesday, March 17, 2010

Up and running with Ruport

The need to revamp an aging report generator at work led to a bit of research and thus to Ruport, an impressive Ruby framework for report generation.

Ruport has suffered the same fate as many a young open source project: fast and frenzied development ("release early and often" and all that cal) has resulted in enough API changes that many of the examples and documentation useless. The best info to date is in the Ruport Book, but even this book does not get one up-and-running with anything more than the most trivial projects (e.g. printing reports for a CSV table).

Let's define "up and running" as "generating reports from data in an existing database" -- the problem that most people are likely trying to solve, but which somehow doesn't appear as a ready example in most of the Ruport documentation. What follows is a quick demonstration of getting Ruport "up and running".

First, install the requisite gems. Note: PostgreSQL is being used as the database in this example instead of MySQL, both because it is a better RDBMS and because the world has enough MySQL examples already.

gem install ruport      
# DBI support                                                      
gem install ruport-util
gem install pg
gem install dbi
gem install dbd-pg
# ActiveRecord and AAR support
gem install activerecord
gem install acts_as_reportable
# Extras - not covered in this example
gem install gruff
gem install documatic

The database is assumed to have the following configuration:

* host : db
* port : 15434
* database: stuff
* user: dba
* password : secret

The 'stuff' database has the table 'complaint' which will the reports will pull data from. Note that the name is not 'complaints', as ActiveRecord would expect (and create) -- the point here is to connect to an existing database, most likely managed by someone else (who, even more likely, is neither a Rails developer nor a fan of the 'database is subservient to the code' approach).

Generating a Report Using the Ruby DBI

Connecting to a database table via the Ruby DBI requires the Ruport::Query class, which has been moved to the ruport-util gem. Ruport::Query maintains a list of named database sources, with the default source being named, surprisingly, :default.

Generating a report therefore becomes a matter of adding a database connection via Ruport::Query#add_source, then instantiating a Ruport::Query and printing its result:

require 'rubygems'
require 'ruport'
require 'ruport/util'

Ruport::Query.add_source( :default, 

                          :user => 'dba', 
                          :password => 'secret',
                          :dsn => 'dbi:Pg:database=stuff;host=db;port=15434' 

                        )
puts Ruport::Query.new( [['select * from complaint']] ).result

All in all, nice and straightforward, except for that argument to Ruport::Query#new. According to the docs, this should be as simple as

Ruport::Query.new( 'select * from complaint' )

...but this fails, and a look at the source shows that Ruport::Query invokes the each() method of the sql argument (which, being a String, as no each() method). To make matters worse, if sql is an array, then the each() method of sql.first() is invoked, so the nested array is needed. This is very likely a bug, and may be fixed in future versions, rendering this example and caveat obsolete.

Generating a Report Using ActiveRecord

The ActiveRecord support in Ruport is a bit more complicated. In particular, the acts_as_reportable (AAR) mixin must be added to every ActiveRecord table class.

Note that with ActiveRecord, a table must be wrapped by a class derived from ActiveRecord::Base. For existing databases, this class can be empty -- although it may be necessary to override the table_name method if the database schema does not conform to ActiveRecord's expectations.

require 'rubygems'
require 'ruport'
require 'ruport/acts_as_reportable'

ActiveRecord::Base::establish_connection( :adapter='postgresql',
                                          :host=>'db', 

                                          :port='15434',
                                          :database=>'stuff'
                                          :username => 'dba',
                                          :password => 'secret' )

class Complaint < ActiveRecord::Base
   acts_as_reportable
   def table_name()
       return 'complaint'
  end
end

puts Complaint.report_table


The report_table method is what AAR exposes to generate a Ruport report for the table.

These quick examples should suffice to get Ruport connected to an existing database without spelunking through the docs and gem source. Once connected, it's simply a matter of playing with Ruport and following the (rest of the) docs.

Tuesday, December 22, 2009

Using GIT as a hierarchical database

A quick look into the 'plumbing'' of git (the Pro Git book is good) reveals that it can easily store (versioned) data that is not based on the contents of files checked into the repository. This means that it can be used as a database that is based on trees rather than tables or key:value pairs.

Git stores four kinds of objects: BLOBs, trees, commits, and tags. A BLOB is raw data : it is identified by the SHA1 digest of its contents, so the same data will never be stored twice. A tree is a hierarchy: it can contain BLOBs or other tree objects. A commit object is basically a snapshot of a hierarchy (tree) at a point in time, and a tag assigns a name to a BLOB, tree, or commit.

In terms of using git as a database, these components can be considered as follows:

BLOB: raw data (values).
Tree: Parent-child relations between data, and a mapping of a database entries with a raw value.
Commit: A version of the data.
Tag: A label, e.g. the name of a table, or a name for a particular version of the data.

Consider the expression 'x * y + 2'. It can be expressed as a tree:

+(* ( x, y), 2)

Another way to express this is in a filesystem-like tree:

/root/
/root/type # 'operator'
/root/value # '+'
/root/left/type # 'operator'
/root/left/value # '*'
/root/left/left/
/root/left/left/type # 'variable'
/root/left/left/value # 'x'
/root/left/right/
/root/left/right/type # 'variable'
/root/left/right/value # 'y'
/root/right/
/root/right/type # 'integer'
/root/right/value # '2'
 

In this schema, entries named 'left' and 'right' are trees, while entries named 'type' and 'value' are BLOBS. Each tree contains at least a type and value entry, with left and right entries existing only if children are present.

To begin, create a Git repo:

# mkdir /tmp/expr
# cd /tmp/expr
# git init

Next, enter the raw data into git using git-hash-object:

# echo 'operator' | git-hash-object -w --stdin
e196e4b5d49f41231b184a124b3ff52bb00ef9f6
# echo 'variable' | git-hash-object -w --stdin
6437a1374bfa97cfdac6611fc5d2d650abaf782b
# echo 'integer' | git-hash-object -w --stdin
82539ed1cd5cbb2c26a500a228a865d8fbbbcd02
# echo '*' | git-hash-object -w --stdin
72e8ffc0db8aad71a934dd11e5968bd5109e54b4
# echo '+' | git-hash-object -w --stdin
fd38861632cb2360cb5acb6253e7e281647f034a
# echo 'x' | git-hash-object -w --stdin
587be6b4c3f93f93c489c0111bba5596147a26cb
# echo 'y' | git-hash-object -w --stdin
975fbec8256d3e8a3797e7a3611380f27c49f4ac
# echo '2' | git-hash-object -w --stdin
0cfbf08886fca9a91cb753ec8734c84fcbe52c9f

Note that git-hash-object can be run without the -w (write) option in order to generate the SHA1 digest of the data without adding it to git.

The tree hierarchy can be created with git-update-index:

# # Root: Operator '+'
#git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/type'
# git-update-index --add --cacheinfo 100644 fd38861632cb2360cb5acb6253e7e281647f034a 'root/value'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 e196e4b5d49f41231b184a124b3ff52bb00ef9f6 'root/left/type'
# git-update-index --add --cacheinfo 100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 'root/left/value'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/left/type'
# git-update-index --add --cacheinfo 100644 587be6b4c3f93f93c489c0111bba5596147a26cb 'root/left/left/value'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 1006446437a1374bfa97cfdac6611fc5d2d650abaf782b 'root/left/right/type'
# git-update-index --add --cacheinfo 100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 'root/left/right/value'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 'root/right/type'
# git-update-index --add --cacheinfo 100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 'root/right/value'

# git-write-tree
7fe6589700060de813d529c48937b7678c297d42
# git-ls-tree 7fe6589700060de813d529c48937b7678c297d42
040000 tree b430c300c59dd80764892644c637ba6e29785c09 root
# git-ls-tree -r 7fe6589700060de813d529c48937b7678c297d42
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb root/left/left/value
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac root/left/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/left/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 root/left/value
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 root/right/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f root/right/value
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 root/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a root/value
# git-ls-files
root/left/left/value
root/left/right/value
root/left/type
root/left/value
root/right/type
root/type
root/value
# git-ls-files --stage \*/value | cut -d ' ' -f 2 | git-cat-file --batch
587be6b4c3f93f93c489c0111bba5596147a26cb blob 2
x

975fbec8256d3e8a3797e7a3611380f27c49f4ac blob 2
y

72e8ffc0db8aad71a934dd11e5968bd5109e54b4 blob 2
*

0cfbf08886fca9a91cb753ec8734c84fcbe52c9f blob 2
2

fd38861632cb2360cb5acb6253e7e281647f034a blob 2
+

#
git-ls-files --stage \*/value | cut -d ' ' -f 2 | xargs git-show
x
y
*
2
+

An ls of the repository will show only the .git directory: the data is stored as BLOBs and paths in the GIT database, not as a directory tree on the filesystem.

In the first attempt, git-ls-files returns the database entries lexically ordered by path. A more database-like approach would be to store the nodes of the expression in a single directory (i.e. a table) and linking them via parent and child references. This highlights a small flaw in the plan: it is not possible to get the SHA1 of a tree without writing the index to disk.


# # Remove previous attempt
# rm -r .git
# git init

# # Node 1 Operator '+'

# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash_object -w --stdin` 'nodes/1/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash_object -w --stdin` 'nodes/1/type'

# # Node 2: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'nodes/2/value'
# git-update-index --add --cacheinfo 100644 `echo 'operator' | git-hash-object -w --stdin` 'nodes/2/type'

# # Node 3: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'nodes/3/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/3/type'

# # Node 4: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'nodes/4/value'
# git-update-index --add --cacheinfo 100644 `echo 'variable' | git-hash-object -w --stdin` 'nodes/4/type'

# # Node 5: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'nodes/5/value'
# git-update-index --add --cacheinfo 100644 `echo 'integer' | git-hash-object -w --stdin` 'nodes/5/type'

# git write-tree
d986912eb66eb446ddaaa188a5cc1068c8cf951b
# git ls-tree `git ls-tree d986912eb66eb446ddaaa188a5cc1068c8cf951b | grep nodes | awk '{print $3}'`
040000 tree 945999b7c15c9b745333847bf3d341b7f74a784f 1
040000 tree 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 2
040000 tree 9a4b76bbc0b9a20185333ca44321ba438eb6d71c 3
040000 tree f556930b3cd058453894e0cc386fe6ca10908abd 4
040000 tree 36fde2340dd25814793c1346ebbb0175049665dd 5

# # Set Node 1 children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/1/left'
# git-update-index --add --cacheinfo 100644 36fde2340dd25814793c1346ebbb0175049665dd 'nodes/1/right'

# # Set Node 2 parent and children
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/2/parent'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/left'
# git-update-index --add --cacheinfo 100644 f556930b3cd058453894e0cc386fe6ca10908abd 'nodes/2/right'

# # Set Node 3 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/3/parent'

# # Set Node 4 parent
# git-update-index --add --cacheinfo 100644 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 'nodes/4/parent'

# # Set Node 5 parent
# git-update-index --add --cacheinfo 100644 945999b7c15c9b745333847bf3d341b7f74a784f 'nodes/5/parent'

# git-write-tree
e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
# git ls-tree -r e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/1/left
100644 blob 36fde2340dd25814793c1346ebbb0175049665dd nodes/1/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/1/type
100644 blob fd38861632cb2360cb5acb6253e7e281647f034a nodes/1/value
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/left
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/2/parent
100644 blob f556930b3cd058453894e0cc386fe6ca10908abd nodes/2/right
100644 blob e196e4b5d49f41231b184a124b3ff52bb00ef9f6 nodes/2/type
100644 blob 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 nodes/2/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/3/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/3/type
100644 blob 587be6b4c3f93f93c489c0111bba5596147a26cb nodes/3/value
100644 blob 86a40cb9ccbc91a00ce06c9e9ca3e132f5e22ab6 nodes/4/parent
100644 blob 6437a1374bfa97cfdac6611fc5d2d650abaf782b nodes/4/type
100644 blob 975fbec8256d3e8a3797e7a3611380f27c49f4ac nodes/4/value
100644 blob 945999b7c15c9b745333847bf3d341b7f74a784f nodes/5/parent
100644 blob 82539ed1cd5cbb2c26a500a228a865d8fbbbcd02 nodes/5/type
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f nodes/5/value


Of course, pulling the contents of these files from the DB results in a list of expression tokens ordered by node number:

# git ls-files --stage '*/value' | awk '{print $2}' | xargs git-show
+
*
x
y
2

While correct, this is not useful for building an infix expression. A new tree is needed that acts as an index for the first tree, so that listing the new tree will return the tokens in infix order. This is accomplished by using the names 'left', 'operator', and 'right' for node contents, so that lexical ordering of the index-tree will return tokens in that order.


# # Root: Operator '+'
# git-update-index --add --cacheinfo 100644 `echo '+' | git-hash-object -w --stdin` 'infix/operator'

# # Root Left Child: Operator '*'
# git-update-index --add --cacheinfo 100644 `echo '*' | git-hash-object -w --stdin` 'infix/left/operator'

# # * Left Child: Variable 'x'
# git-update-index --add --cacheinfo 100644 `echo 'x' | git-hash-object -w --stdin` 'infix/left/left/variable'

# # * Right Child: Variable 'y'
# git-update-index --add --cacheinfo 100644 `echo 'y' | git-hash-object -w --stdin` 'infix/left/right/variable'

# # Root Right Child: Integer '2'
# git-update-index --add --cacheinfo 100644 `echo '2' | git-hash-object -w --stdin` 'infix/right/value'

# git-ls-files --stage infix/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y + 2


Here, the 'infix' tree re-creates the hierarchy of the original expression, much as the first attempt did -- only this time, the names of the files and directories are chosen to force an infix ordering when sorted by name. Only the files containing the values to display are included in this tree, and these map to the same SHA1 digest as those in the 'nodes' tree. Note that git only stores one copy of each BLOB, so the infix tree only uses as much space as is required to represent the tree: the value BLOBs are shared with the original table.

Listing of subtrees correctly returns sub-expressions:

# git-ls-files --stage infix/left/\* | cut -d ' ' -f 2 | xargs git-show | awk '{printf $0 " "}'
x * y

Once the data is has been created, the database can be committed, creating a base version:

# echo 'Node database created' | git commit-tree e6b41858f6e7350914dbf1ddbdc2e457ee9bb4d3
f1b038fb0ea444f7a05ef9c48d74aef6ac3d8b7e

Searching the database relies on git-grep, using the --cached option to search the BLOBs instead of the filesystem. The search may be limited to specific paths:

# git-grep --cached 'operator' -- root 
root/left/type:operator
root/type:operator
# git-grep --cached '*' -- expr
expr/left/operator:*
# git-grep --cached 'x' -- '*/variable'
expr/left/left/variable:x
# git-grep --cached '*''*/left/*'
expr/left/operator:*
root/left/value:*

Removing entries involves removing the index entries, not the BLOB:

# git-update-index --force-remove 'infix/operator'
# git-ls-files --stage infix\*
100644 587be6b4c3f93f93c489c0111bba5596147a26cb 0 infix/left/left/variable
100644 72e8ffc0db8aad71a934dd11e5968bd5109e54b4 0 infix/left/operator
100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 0 infix/left/right/variable
100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 0 infix/right/value

This means that the values stay in the database, so that changes can be reversed. The actual 'database' therefore is nothing but a collection of named links to BLOBs. Searching and browsing of the database is done by operating on paths, the 'live' data, while inserting, updating, and deleting database entries requires changing what blobs the paths point to.



It goes without saying that the git commands provided above can be wrapped with shell functions (or Ruby, or Python, or any other scripting language) that create database entries, populate indexes, and commit changes to the data. With a small amount of work, a reasonable version-controlled hierarchical database can be put together, all contained in the .git dir.

More complex projects could mix files managed by git and the above tree-only approach; this is especially useful when storing binary content, which is awkward to deal with when stored in BLOBs (generally requires unzipping). In such cases, finding the top level (root) of the Git repository is useful, and git makes this easy as well:

# mkdir stuff
# cd stuff
# git-rev-parse --git-dir
/tmp/expr/.git

Monday, December 21, 2009

The importance of documentation

Went back to Python after three or so years of Ruby, and found immediately that loading modules from . is now broken.

Checked the docs, which are in a horrible state (it's in the Language Ref, no wait it's in the Library Ref even though it's a language feature, no wait it's in a PEP, no wait that is out of date and the real documentation has been posted to a mailing list), and found stuff like this:

From Python Docs: The Module Search Path:

"When a module named spam is imported, the interpreter searches for a file named spam.py in the current directory, and then in the list of directories specified by the environment variable PYTHONPATH. This has the same syntax as the shell variable PATH, that is, a list of directory names. When PYTHONPATH is not set, or when the file is not found there, the search continues in an installation-dependent default path; on Unix, this is usually .:/usr/local/lib/python.

"Actually, modules are searched in the list of directories given by the variable sys.path which is initialized from the directory containing the input script (or the current directory), PYTHONPATH and the installation- dependent default."

Well, that's the theory at any rate. Let's test it with actual code:

# mkdir /tmp/test-py
# cd /tmp/test-py
# mkdir -p snark/snob
# touch snark/__init__.py snark/snob/__init__.py snark/snob/stuff.py
# echo -e '#!/usr/bin/env python2.5\nimport os\nprint(os.getcwd())\nimport snark.snob.stuff\n'> a.py
# chmod +x a.py
# mkdir bin
# cp a.py bin/b.py


What would you expect to happen when this is run? Surely having the module in the current directory means that running either a.py or b.py from . will work, right?

# ./a.py
/tmp/py-test
# ./bin/b.py
/tmp/py-test
Traceback (most recent call last):
File "./bin/b.py", line 4, in

import snark.snob.stuff
ImportError: No module named snark.snob.stuff


Nope! And look at that -- according to Python's own system command, the current working directory (as mentioned in their docs) is the same for both scripts!

Explicitly setting PYTHONPATH fixes this:

# PYTHONPATH=. bin/b.py
/tmp/py-test


..but really, shouldn't the docs be a bit less misleading?

And, for that matter, why the hell is the location of the script used instead of the working directory anyways? Either be sane and use the current working directory, or be slightly less sane and check both.

The whole reason for using Python on this project in the first place was to revert to a language and interpreter that is more stable and more professional (in terms of management style) than Ruby, but this kind of crap certainly gives one pause. If thirty minutes with the language turns up something as obviously broken as the module loader, one shudders to think what is going to come up over the course of the project.

Sad days for the Python boys. The interest of Fowler and his crew must have really given Ruby a leg up in terms of quality.

Sunday, December 6, 2009

'standalone' rattle

Rattle is a decent data-mining utility built on top of GNU R, with one small problem: it is impossible to run outside of the R terminal (e.g. from the shell or a desktop icon/menu).

What this means, on Linux, is that to run Rattle one must start R in a terminal and enter the following commands:

> library(rattle)
Loading required package: pmml
Loading required package: XML
Rattle: Graphical interface for data mining using R.
Version 2.5.3 Copyright (C) 2006-2009 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> rattle()

A Ctrl-D to log out of R is also required.

The big problem (ignoring R's inconsistent handling of stdin/out, which can mostly be solved by using littler) is the R Gtk module starts a separate GUI thread, and quitting R does not do a wait on child threads -- it just kills them.

A workaround to launch rattle from outside of R is to provide a simple wrapper script:

#!/usr/bin/env ruby

require 'open3'

r_prog = `which R`.chomp

Open3.popen3( "#{r_prog} --interactive --no-save --no-restore --slave" ) do | st
din, stdout, stderr |
stdin.puts "library(rattle)"
stdin.puts "rattle()"
puts stdout.read
end


This runs R and launches rattle(), but leaves an R process open in the background once rattle() has exited.

Adding a q() call to the script merely terminates the Rattle GUI child thread. In addition, rattle provides no indication that it is running:

> ls()
character(0)
> library(rattle)
Loading required package: pmml
Loading required package: XML
Rattle: Graphical interface for data mining using R.
Version 2.5.3 Copyright (C) 2006-2009 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"
> rattle()
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"
# Use Ctrl-Q to exit the Rattle GUI
> ls()
[1] "Global_rattleGUI" "crs"
[3] "crv" "on_aboutdialog_response"
[5] "rattleGUI" "suppressRattleWelcome"
[7] "viewdataGUI"

Once the Rattle library has loaded, there is no way to tell from within R if the Rattle GUI is running or not. Modifying Rattle to create a variable when the GUI loads successfully, and to remove it when the GUI quites, would undoubtedly fix this.

Wednesday, November 25, 2009

Too clever by half

It's always frustrating to come across a build error in allegedly-portable code, then to discover that the root of it is someone complicating things in a misguided attempt to improve portability -- usually by introducing complexity instead of removing it.

The macports project in particular is rife with the problems, undoubtedly because most of the "cross-platform" open source projects it ports never considered OS X as a build target.

KDE4 has refused to build on a particular OS X box (Quad G5, 10.5.8, XCode 3.1.4) via macports for quite some time. Currently, the build fails in kdebase4-runtime due to an error in the fishProtocol ctor in fish.cpp:


/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp: In constructor 'fishProtocol::fishProtocol(const QByteArray&, const QByteArray&)':
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:275: error: cannot convert 'const char* (*)()' to 'const char*' for argument '1' to 'size_t strlen(const char*)'
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp: In member function 'void fishProtocol::manageConnection(const QString&)':
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:1079: error: no matching function for call to 'fishProtocol::writeChild(const char* (&)(), int&)'
/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.cpp:561: note: candidates are: void fishProtocol::writeChild(const char*, KIO::fileoffset_t)
make[2]: *** [kioslave/fish/CMakeFiles/kio_fish.dir/fish.o] Error 1
make[1]: *** [kioslave/fish/CMakeFiles/kio_fish.dir/all] Error 2
make: *** [all] Error 2


Little did I know at the time that the actual cause for the error was further up:

[ 57%] Generating fishcode.h
cd /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish && /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/generate_fishcode.sh /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl /opt/local/bin/md5 /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/build/kioslave/fish/fishcode.h -f\ 4
sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unescaped newline inside substitute pattern



The line in question (fish.cpp:275) does a strlen on fishCode, which is not defined anywhere, though there is an #include for fishcode.h .

I searched around, found fishcode.h, and it was close to empty:

#define CHECKSUM "'fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'"
static const char
*fishCode('
');


Something had gone horribly wrong, somewhere -- fishCode is empty, though that shouldn't cause the error being displayed. Still, builds get wacky, and it's a lead.

Googling around for a copy of fishcode.h, turns up nothing, which is not a surprise as it is obviously autogenerated by the build process. What does turn up, however, are the commands used to generate it:

SUM=`/usr/bin/md5sum
/usr/src/packages/BUILD/kdebase-683581/kioslave/fish/fish.pl | cut -d ' ' -f 1`;
echo '#define CHECKSUM "'$SUM'"' > fishcode.h; echo 'static const char
*fishCode(' >> fishcode.h; sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'
/usr/src/packages/BUILD/kdebase-683581/kioslave/fish/fish.pl >> fishcode.h; echo
');' >> fishcode.h;


Time to set SUM in the shell and started playing with the sed command, which is obviously what's failing:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])


Something is going wrong, set SUM to a more sane value:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])

bash-3.2$ SUM='/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": unbalanced brackets ([])


Try replacing the embedded newline with a good old '\n':

bash-3.2$ SUM='/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[\n]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;'

sed: 1: "s/\\/\\\\/g;s/"/\\"/g;s ...": bad flag in substitute command: 'o'


Sed doesn't like the embedded newline; not a surprise. Now it seems like SUM is being interpreted as part of the expression. Let's see what's really going on:

bash-3.2$ SUM='fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl'

bash-3.2$ echo sed -e 's/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[> ]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/'$SUM'/;' sed -e s/\\/\\\\/g;s/"/\\"/g;s/^[ ]*/"/;/^"# /d;s/[
]*$/\\n"/;/^"\\n"$/d;s/{CHECKSUM}/fb18e850532f42b49f1034b4e17a4cdc /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_kde_kdebase4-runtime/work/kdebase-runtime-4.3.3/kioslave/fish/fish.pl/;


Argh! Someone's single quotes aren't properly escaped! Right there at '$SUM'!

Simple enough to fix, though. Fortunately, using sed to generate fishcode.h manually allows the kdebase4-runtime build to continue, which saves the time of tracking down where in the build process that sed command is.

It does raise the question, though: was any of this autogeneration really necessary? And if so, wouldn't a perl script (or even a one-liner) be more reliable?

One can argue that perl does not ship on as many OSes as sed (debatable), but the fact that this build worked on a maintainer's machine and not on an end-user's machine (running an OS highly regulated by the manufacturer, no less) pretty much blows the portability argument out of the water: if that's what the maintainer is truly after, well, their methods aren't working either.

OK, end angry-at-overly-convoluted-open-source-code-again rant. The moral: verify your goddamn shell commands!