Tuesday, July 24, 2012

Ruby module inheritance

Another frequently-used Ruby technique.

When building up a data model, it is not unusual to generate two modules that represent the abstract ModelItem (base) class : a ModelItemClass module to be extended by all ModelItem (derived) classes, and a ModelItemObject module to be included by all ModelItem (derived) classes.

But what happens when these ModelItem modules themselves need to mix-in other modules? For example, to support serialization, the ModelItemClass and ModelItemObject might need to include the modules JsonClass and JsonObject, respectively.

When the abstract base classes are already represented by distinct Class and Object modules, it becomes trivial to extend these : a module can be included in a module, and its (instance) methods will be added when that module is subsequently included or extended by a derived class.

Some quick experimentation in irb will demonstrate that this works as expected:


module AClass
  def a; puts "a class method"; end
end


module AObject
  def a; puts "a instance method"; end
end


module BClass
  include AClass
  def b; puts "b class method"; end
end


module BObject
  include AObject
  def b; puts "b instance method"; end
end


class B
  extend BClass
  include BObject
end


B.a   # -> "a class method"
B.b  # -> "b class method"
B.new.a # -> "a instance method"
B.new.b #-> "b instance method"

Monday, July 23, 2012

Format 212

Physionet has a number of databases that consist of .dat signal-capture files in their "format212" format.

The signal (5) manpage provides this following less-than-lucid explanation:

Each sample is represented by a 12-bit two's complement amplitude. The first sample is obtained from the 12 least significant bits of the first byte pair (stored least significant byte first). The second sample is formed from the 4 remaining bits of the first byte pair (which are the 4 high bits of the 12-bit sample) and the next byte (which contains the remaining 8 bits of the second sample). The process is repeated for each successive pair of sample. 

Why write a line of code when 100 words of plaintext will suffice?

Basically, 3 bytes (A, B, C) encode 2 data points (x, y) as follows:
  x = ((B & 0x0F) << 8) | A
  y = ((B & 0xF0) << 4) | C

A throwaway Ruby script to convert a "format212" file to an array of data-pairs:

#!/usr/bin/env ruby

def fmt212_to_a(buf)
  buf.bytes.each_slice(3).collect do |data|
    a = ((data[1] & 0x0F) << 8) | data[0]
    b = ((data[1] & 0xF0) << 4) | data[2]
    [a,b]
  end
end


if __FILE__ == $0
  ARGV.each do |path| 
    File.open(path, 'rb') do |f| 
      fmt212_to_a(f.read).each_with_index { |(a,b),x|
         # generate gnuplot-friendly output
         puts x.to_s + "\t" + a.to_s + "\t" + b.to_s
      }
    end
  end
end

Some equally throwaway gnuplot code to plot an output file signal.dat:

# plot entire file

plot 'signal.dat' using 1:2 title 'lead 1' with lines, 'signal.dat' using 1:3 title 'lead 2' with lines
# plot first 1024 data points of potentially huge files
 plot '< head -1024 signal.dat' using 1:2 title 'lead 1' with lines, '< head -1024 signal.dat' using 1:3 title 'lead 2' with lines 

Tuesday, July 17, 2012

The DRb JRuby Bridge

This is a technique that has proved useful when a Ruby application needs to call some Java code, but writing the application entirely in JRuby is not appropriate.

Examples:
* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins

The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.


The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.

The directory structure for this example is as follows:

  bin/test_app.rb
  lib/tika-app-1.1.jar
  lib/tika_service.rb
  lib/tika_service_jruby


The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.


#!/usr/bin/env jruby


raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/


require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'


# =============================================================================
module Tika


  # ------------------------------------------------------------------------
  # Namespaces for Tika plugins


  module ContentHandler
    Body = Java::org.apache.tika.sax.BodyContentHandler
    Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
    Writeout = Java::org.apache.tika.sax.WriteOutContentHandler
  end


  module Parser
    Auto = Java::org.apache.tika.parser.AutoDetectParser
  end


  module Detector
    Default = Java::org.apache.tika.detect.DefaultDetector
    Language = Java::org.apache.tika.language.LanguageIdentifier
  end


  Metadata = Java::org.apache.tika.metadata.Metadata


  class Service
    # ----------------------------------------------------------------------
    # JRuby Bridge


    # Number of clients connected to TikaServer
    attr_reader :usage_count


    def initialize
      @usage_count = 0
      Tika::Detector::Language.initProfiles
    end


    def inc_usage; @usage_count += 1; end


    def dec_usage; @usage_count -= 1; end


    def stop_if_unused; DRb.stop_service if (usage_count <= 0); end


    def self.drb_start(port)
      port ||= DEFAULT_PORT


      DRb.start_service "druby://localhost:#{port.to_i}", self.new
      puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
     
      trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
      trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }


      DRb.thread.join
    end


    # ----------------------------------------------------------------------
    # Tika Facade


    def parse(str)
      input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
      content = Tika::ContentHandler::Body.new(-1)
      metadata = Tika::Metadata.new


      Tika::Parser::Auto.new.parse(input, content, metadata)
      lang = Tika::Detector::Language.new(input.to_string)


      { :content => content.to_string, 
        :language => lang.getLanguage(),
        :metadata => metadata_to_hash(metadata) }
    end


    def metadata_to_hash(mdata)
      h = {}
      Metadata.constants.each do |name| 
        begin
          val = mdata.get(Metadata.const_get name)
          h[name.downcase.to_sym] = val if val
        rescue NameError
          # nop
        end
      end
      h
    end


  end
end


# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0


The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.

When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.


The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.

#!/usr/bin/env ruby

require 'drb'


module Tika


  class Service
   DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
    DEFAULT_PORT = 44344
    DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
    TIMEOUT = 300    # in 100-ms increments


    # Return command to launch JRuby interpreter
    def self.get_jruby
      # 1. detect system JRuby
      jruby = `which jruby`
      return jruby.chomp if (! jruby.empty?)


      # 2. detect RVM-managed JRuby
      return nil if (`which rvm`).empty?
      jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
      return nil if (! jruby)


      "rvm #{jruby.strip.split(' ').first} do ruby "
    end


    # Replace current process with JRuby running Tika Service
    def self.exec(port)
      jruby = get_jruby
      Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby


      $stderr.puts "No JRUBY found!"
      return 1
    end


    def self.start
      return @pid if @pid
      @pid = Process.fork do
        exit(::Tika::Service::exec DEFAULT_PORT)
      end
      Process.detach(@pid)


      connected = false
      TIMEOUT.times do
        begin
          DRb::DRbObject.new_with_uri(DEFAULT_URI).to_s
          connected = true
          break
        rescue DRb::DRbConnError
          sleep 0.1
        end
      end
      raise "Could not connect to #{DEFAULT_URI}" if ! connected
    end


    def self.stop
      service_send(:stop_if_unused)
    end


    # this will return a new Tika DRuby connection
    def self.service_send(method, *args)
      begin
        obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
        obj.send(method, *args)
        obj
      rescue DRb::DRbConnError => e
        $stderr.puts "Could not connect to #{DEFAULT_URI}"
        raise e
      end
    end


    def self.connect
      service_send(:inc_usage)
    end


    def self.disconnect
      service_send(:dec_usage)
    end


  end
end


The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.

The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).



The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.


#!/usr/bin/env ruby
require 'tika_service'


Tika::Service.start
begin
  tika = Tika::Service.connect
  ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika
ensure
  Tika::Service.disconnect
  Tika::Service.stop
end


The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.


Update: A generalized (but entirely untested) version is up on Github.