Tuesday, July 24, 2012

Ruby module inheritance

Another frequently-used Ruby technique.

When building up a data model, it is not unusual to generate two modules that represent the abstract ModelItem (base) class : a ModelItemClass module to be extended by all ModelItem (derived) classes, and a ModelItemObject module to be included by all ModelItem (derived) classes.

But what happens when these ModelItem modules themselves need to mix-in other modules? For example, to support serialization, the ModelItemClass and ModelItemObject might need to include the modules JsonClass and JsonObject, respectively.

When the abstract base classes are already represented by distinct Class and Object modules, it becomes trivial to extend these : a module can be included in a module, and its (instance) methods will be added when that module is subsequently included or extended by a derived class.

Some quick experimentation in irb will demonstrate that this works as expected:

module AClass
  def a; puts "a class method"; end

module AObject
  def a; puts "a instance method"; end

module BClass
  include AClass
  def b; puts "b class method"; end

module BObject
  include AObject
  def b; puts "b instance method"; end

class B
  extend BClass
  include BObject

B.a   # -> "a class method"
B.b  # -> "b class method"
B.new.a # -> "a instance method"
B.new.b #-> "b instance method"

Monday, July 23, 2012

Format 212

Physionet has a number of databases that consist of .dat signal-capture files in their "format212" format.

The signal (5) manpage provides this following less-than-lucid explanation:

Each sample is represented by a 12-bit two's complement amplitude. The first sample is obtained from the 12 least significant bits of the first byte pair (stored least significant byte first). The second sample is formed from the 4 remaining bits of the first byte pair (which are the 4 high bits of the 12-bit sample) and the next byte (which contains the remaining 8 bits of the second sample). The process is repeated for each successive pair of sample. 

Why write a line of code when 100 words of plaintext will suffice?

Basically, 3 bytes (A, B, C) encode 2 data points (x, y) as follows:
  x = ((B & 0x0F) << 8) | A
  y = ((B & 0xF0) << 4) | C

A throwaway Ruby script to convert a "format212" file to an array of data-pairs:

#!/usr/bin/env ruby

def fmt212_to_a(buf)
  buf.bytes.each_slice(3).collect do |data|
    a = ((data[1] & 0x0F) << 8) | data[0]
    b = ((data[1] & 0xF0) << 4) | data[2]

if __FILE__ == $0
  ARGV.each do |path| 
    File.open(path, 'rb') do |f| 
      fmt212_to_a(f.read).each_with_index { |(a,b),x|
         # generate gnuplot-friendly output
         puts x.to_s + "\t" + a.to_s + "\t" + b.to_s

Some equally throwaway gnuplot code to plot an output file signal.dat:

# plot entire file

plot 'signal.dat' using 1:2 title 'lead 1' with lines, 'signal.dat' using 1:3 title 'lead 2' with lines
# plot first 1024 data points of potentially huge files
 plot '< head -1024 signal.dat' using 1:2 title 'lead 1' with lines, '< head -1024 signal.dat' using 1:3 title 'lead 2' with lines 

Tuesday, July 17, 2012

The DRb JRuby Bridge

This is a technique that has proved useful when a Ruby application needs to call some Java code, but writing the application entirely in JRuby is not appropriate.

* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins

The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.

The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.

The directory structure for this example is as follows:


The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.

#!/usr/bin/env jruby

raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/

require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'

# =============================================================================
module Tika

  # ------------------------------------------------------------------------
  # Namespaces for Tika plugins

  module ContentHandler
    Body = Java::org.apache.tika.sax.BodyContentHandler
    Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
    Writeout = Java::org.apache.tika.sax.WriteOutContentHandler

  module Parser
    Auto = Java::org.apache.tika.parser.AutoDetectParser

  module Detector
    Default = Java::org.apache.tika.detect.DefaultDetector
    Language = Java::org.apache.tika.language.LanguageIdentifier

  Metadata = Java::org.apache.tika.metadata.Metadata

  class Service
    # ----------------------------------------------------------------------
    # JRuby Bridge

    # Number of clients connected to TikaServer
    attr_reader :usage_count

    def initialize
      @usage_count = 0

    def inc_usage; @usage_count += 1; end

    def dec_usage; @usage_count -= 1; end

    def stop_if_unused; DRb.stop_service if (usage_count <= 0); end

    def self.drb_start(port)
      port ||= DEFAULT_PORT

      DRb.start_service "druby://localhost:#{port.to_i}", self.new
      puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
      trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
      trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }


    # ----------------------------------------------------------------------
    # Tika Facade

    def parse(str)
      input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
      content = Tika::ContentHandler::Body.new(-1)
      metadata = Tika::Metadata.new

      Tika::Parser::Auto.new.parse(input, content, metadata)
      lang = Tika::Detector::Language.new(input.to_string)

      { :content => content.to_string, 
        :language => lang.getLanguage(),
        :metadata => metadata_to_hash(metadata) }

    def metadata_to_hash(mdata)
      h = {}
      Metadata.constants.each do |name| 
          val = mdata.get(Metadata.const_get name)
          h[name.downcase.to_sym] = val if val
        rescue NameError
          # nop


# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0

The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.

When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.

The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.

#!/usr/bin/env ruby

require 'drb'

module Tika

  class Service
   DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
    DEFAULT_PORT = 44344
    DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
    TIMEOUT = 300    # in 100-ms increments

    # Return command to launch JRuby interpreter
    def self.get_jruby
      # 1. detect system JRuby
      jruby = `which jruby`
      return jruby.chomp if (! jruby.empty?)

      # 2. detect RVM-managed JRuby
      return nil if (`which rvm`).empty?
      jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
      return nil if (! jruby)

      "rvm #{jruby.strip.split(' ').first} do ruby "

    # Replace current process with JRuby running Tika Service
    def self.exec(port)
      jruby = get_jruby
      Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby

      $stderr.puts "No JRUBY found!"
      return 1

    def self.start
      return @pid if @pid
      @pid = Process.fork do
        exit(::Tika::Service::exec DEFAULT_PORT)

      connected = false
      TIMEOUT.times do
          connected = true
        rescue DRb::DRbConnError
          sleep 0.1
      raise "Could not connect to #{DEFAULT_URI}" if ! connected

    def self.stop

    # this will return a new Tika DRuby connection
    def self.service_send(method, *args)
        obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
        obj.send(method, *args)
      rescue DRb::DRbConnError => e
        $stderr.puts "Could not connect to #{DEFAULT_URI}"
        raise e

    def self.connect

    def self.disconnect


The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.

The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).

The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.

#!/usr/bin/env ruby
require 'tika_service'

  tika = Tika::Service.connect
  ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika

The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.

Update: A generalized (but entirely untested) version is up on Github.