Another frequently-used Ruby technique.
When building up a data model, it is not unusual to generate two modules that represent the abstract ModelItem (base) class : a ModelItemClass module to be extended by all ModelItem (derived) classes, and a ModelItemObject module to be included by all ModelItem (derived) classes.
But what happens when these ModelItem modules themselves need to mix-in other modules? For example, to support serialization, the ModelItemClass and ModelItemObject might need to include the modules JsonClass and JsonObject, respectively.
When the abstract base classes are already represented by distinct Class and Object modules, it becomes trivial to extend these : a module can be included in a module, and its (instance) methods will be added when that module is subsequently included or extended by a derived class.
Some quick experimentation in irb will demonstrate that this works as expected:
module AClass
def a; puts "a class method"; end
end
module AObject
def a; puts "a instance method"; end
end
module BClass
include AClass
def b; puts "b class method"; end
end
module BObject
include AObject
def b; puts "b instance method"; end
end
class B
extend BClass
include BObject
end
B.a # -> "a class method"
B.b # -> "b class method"
B.new.a # -> "a instance method"
B.new.b #-> "b instance method"
Tuesday, July 24, 2012
Monday, July 23, 2012
Format 212
Physionet has a number of databases that consist of .dat signal-capture files in their "format212" format.
The signal (5) manpage provides this following less-than-lucid explanation:
Each sample is represented by a 12-bit two's complement amplitude. The first sample is obtained from the 12 least significant bits of the first byte pair (stored least significant byte first). The second sample is formed from the 4 remaining bits of the first byte pair (which are the 4 high bits of the 12-bit sample) and the next byte (which contains the remaining 8 bits of the second sample). The process is repeated for each successive pair of sample.
Why write a line of code when 100 words of plaintext will suffice?
Basically, 3 bytes (A, B, C) encode 2 data points (x, y) as follows:
x = ((B & 0x0F) << 8) | A
y = ((B & 0xF0) << 4) | C
A throwaway Ruby script to convert a "format212" file to an array of data-pairs:
#!/usr/bin/env ruby
def fmt212_to_a(buf)
buf.bytes.each_slice(3).collect do |data|
a = ((data[1] & 0x0F) << 8) | data[0]
b = ((data[1] & 0xF0) << 4) | data[2]
[a,b]
end
end
if __FILE__ == $0
ARGV.each do |path|
File.open(path, 'rb') do |f|
fmt212_to_a(f.read).each_with_index { |(a,b),x|
# generate gnuplot-friendly output
puts x.to_s + "\t" + a.to_s + "\t" + b.to_s
}
end
end
end
Some equally throwaway gnuplot code to plot an output file signal.dat:
# plot entire file
plot 'signal.dat' using 1:2 title 'lead 1' with lines, 'signal.dat' using 1:3 title 'lead 2' with lines
# plot first 1024 data points of potentially huge files
plot '< head -1024 signal.dat' using 1:2 title 'lead 1' with lines, '< head -1024 signal.dat' using 1:3 title 'lead 2' with lines
The signal (5) manpage provides this following less-than-lucid explanation:
Each sample is represented by a 12-bit two's complement amplitude. The first sample is obtained from the 12 least significant bits of the first byte pair (stored least significant byte first). The second sample is formed from the 4 remaining bits of the first byte pair (which are the 4 high bits of the 12-bit sample) and the next byte (which contains the remaining 8 bits of the second sample). The process is repeated for each successive pair of sample.
Why write a line of code when 100 words of plaintext will suffice?
Basically, 3 bytes (A, B, C) encode 2 data points (x, y) as follows:
x = ((B & 0x0F) << 8) | A
y = ((B & 0xF0) << 4) | C
A throwaway Ruby script to convert a "format212" file to an array of data-pairs:
#!/usr/bin/env ruby
def fmt212_to_a(buf)
buf.bytes.each_slice(3).collect do |data|
a = ((data[1] & 0x0F) << 8) | data[0]
b = ((data[1] & 0xF0) << 4) | data[2]
[a,b]
end
end
if __FILE__ == $0
ARGV.each do |path|
File.open(path, 'rb') do |f|
fmt212_to_a(f.read).each_with_index { |(a,b),x|
# generate gnuplot-friendly output
puts x.to_s + "\t" + a.to_s + "\t" + b.to_s
}
end
end
end
Some equally throwaway gnuplot code to plot an output file signal.dat:
# plot entire file
plot 'signal.dat' using 1:2 title 'lead 1' with lines, 'signal.dat' using 1:3 title 'lead 2' with lines
# plot first 1024 data points of potentially huge files
plot '< head -1024 signal.dat' using 1:2 title 'lead 1' with lines, '< head -1024 signal.dat' using 1:3 title 'lead 2' with lines
Labels:
physionet,
ruby,
signal analysis
Tuesday, July 17, 2012
The DRb JRuby Bridge
This is a technique that has proved useful when a Ruby application needs to call some Java code, but writing the application entirely in JRuby is not appropriate.
Examples:
* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins
The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.
The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.
The directory structure for this example is as follows:
bin/test_app.rb
lib/tika-app-1.1.jar
lib/tika_service.rb
lib/tika_service_jruby
The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.
#!/usr/bin/env jruby
raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/
require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'
# =============================================================================
module Tika
# ------------------------------------------------------------------------
# Namespaces for Tika plugins
module ContentHandler
Body = Java::org.apache.tika.sax.BodyContentHandler
Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
Writeout = Java::org.apache.tika.sax.WriteOutContentHandler
end
module Parser
Auto = Java::org.apache.tika.parser.AutoDetectParser
end
module Detector
Default = Java::org.apache.tika.detect.DefaultDetector
Language = Java::org.apache.tika.language.LanguageIdentifier
end
Metadata = Java::org.apache.tika.metadata.Metadata
class Service
# ----------------------------------------------------------------------
# JRuby Bridge
# Number of clients connected to TikaServer
attr_reader :usage_count
def initialize
@usage_count = 0
Tika::Detector::Language.initProfiles
end
def inc_usage; @usage_count += 1; end
def dec_usage; @usage_count -= 1; end
def stop_if_unused; DRb.stop_service if (usage_count <= 0); end
def self.drb_start(port)
port ||= DEFAULT_PORT
DRb.start_service "druby://localhost:#{port.to_i}", self.new
puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }
DRb.thread.join
end
# ----------------------------------------------------------------------
# Tika Facade
def parse(str)
input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
content = Tika::ContentHandler::Body.new(-1)
metadata = Tika::Metadata.new
Tika::Parser::Auto.new.parse(input, content, metadata)
lang = Tika::Detector::Language.new(input.to_string)
{ :content => content.to_string,
:language => lang.getLanguage(),
:metadata => metadata_to_hash(metadata) }
end
def metadata_to_hash(mdata)
h = {}
Metadata.constants.each do |name|
begin
val = mdata.get(Metadata.const_get name)
h[name.downcase.to_sym] = val if val
rescue NameError
# nop
end
end
h
end
end
end
# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0
The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.
When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.
The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.
#!/usr/bin/env ruby
require 'drb'
module Tika
class Service
DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
DEFAULT_PORT = 44344
DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
TIMEOUT = 300 # in 100-ms increments
# Return command to launch JRuby interpreter
def self.get_jruby
# 1. detect system JRuby
jruby = `which jruby`
return jruby.chomp if (! jruby.empty?)
# 2. detect RVM-managed JRuby
return nil if (`which rvm`).empty?
jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
return nil if (! jruby)
"rvm #{jruby.strip.split(' ').first} do ruby "
end
# Replace current process with JRuby running Tika Service
def self.exec(port)
jruby = get_jruby
Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby
$stderr.puts "No JRUBY found!"
return 1
end
def self.start
return @pid if @pid
@pid = Process.fork do
exit(::Tika::Service::exec DEFAULT_PORT)
end
Process.detach(@pid)
connected = false
TIMEOUT.times do
begin
DRb::DRbObject.new_with_uri(DEFAULT_URI).to_s
connected = true
break
rescue DRb::DRbConnError
sleep 0.1
end
end
raise "Could not connect to #{DEFAULT_URI}" if ! connected
end
def self.stop
service_send(:stop_if_unused)
end
# this will return a new Tika DRuby connection
def self.service_send(method, *args)
begin
obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
obj.send(method, *args)
obj
rescue DRb::DRbConnError => e
$stderr.puts "Could not connect to #{DEFAULT_URI}"
raise e
end
end
def self.connect
service_send(:inc_usage)
end
def self.disconnect
service_send(:dec_usage)
end
end
end
The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.
The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).
The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.
#!/usr/bin/env ruby
require 'tika_service'
Tika::Service.start
begin
tika = Tika::Service.connect
ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika
ensure
Tika::Service.disconnect
Tika::Service.stop
end
The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.
Update: A generalized (but entirely untested) version is up on Github.
Examples:
* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins
The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.
The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.
The directory structure for this example is as follows:
bin/test_app.rb
lib/tika-app-1.1.jar
lib/tika_service.rb
lib/tika_service_jruby
The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.
#!/usr/bin/env jruby
raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/
require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'
# =============================================================================
module Tika
# ------------------------------------------------------------------------
# Namespaces for Tika plugins
module ContentHandler
Body = Java::org.apache.tika.sax.BodyContentHandler
Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
Writeout = Java::org.apache.tika.sax.WriteOutContentHandler
end
module Parser
Auto = Java::org.apache.tika.parser.AutoDetectParser
end
module Detector
Default = Java::org.apache.tika.detect.DefaultDetector
Language = Java::org.apache.tika.language.LanguageIdentifier
end
Metadata = Java::org.apache.tika.metadata.Metadata
class Service
# ----------------------------------------------------------------------
# JRuby Bridge
# Number of clients connected to TikaServer
attr_reader :usage_count
def initialize
@usage_count = 0
Tika::Detector::Language.initProfiles
end
def inc_usage; @usage_count += 1; end
def dec_usage; @usage_count -= 1; end
def stop_if_unused; DRb.stop_service if (usage_count <= 0); end
def self.drb_start(port)
port ||= DEFAULT_PORT
DRb.start_service "druby://localhost:#{port.to_i}", self.new
puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }
DRb.thread.join
end
# ----------------------------------------------------------------------
# Tika Facade
def parse(str)
input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
content = Tika::ContentHandler::Body.new(-1)
metadata = Tika::Metadata.new
Tika::Parser::Auto.new.parse(input, content, metadata)
lang = Tika::Detector::Language.new(input.to_string)
{ :content => content.to_string,
:language => lang.getLanguage(),
:metadata => metadata_to_hash(metadata) }
end
def metadata_to_hash(mdata)
h = {}
Metadata.constants.each do |name|
begin
val = mdata.get(Metadata.const_get name)
h[name.downcase.to_sym] = val if val
rescue NameError
# nop
end
end
h
end
end
end
# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0
The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.
When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.
The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.
#!/usr/bin/env ruby
require 'drb'
module Tika
class Service
DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
DEFAULT_PORT = 44344
DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
TIMEOUT = 300 # in 100-ms increments
# Return command to launch JRuby interpreter
def self.get_jruby
# 1. detect system JRuby
jruby = `which jruby`
return jruby.chomp if (! jruby.empty?)
# 2. detect RVM-managed JRuby
return nil if (`which rvm`).empty?
jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
return nil if (! jruby)
"rvm #{jruby.strip.split(' ').first} do ruby "
end
# Replace current process with JRuby running Tika Service
def self.exec(port)
jruby = get_jruby
Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby
$stderr.puts "No JRUBY found!"
return 1
end
def self.start
return @pid if @pid
@pid = Process.fork do
exit(::Tika::Service::exec DEFAULT_PORT)
end
Process.detach(@pid)
connected = false
TIMEOUT.times do
begin
DRb::DRbObject.new_with_uri(DEFAULT_URI).to_s
connected = true
break
rescue DRb::DRbConnError
sleep 0.1
end
end
raise "Could not connect to #{DEFAULT_URI}" if ! connected
end
def self.stop
service_send(:stop_if_unused)
end
# this will return a new Tika DRuby connection
def self.service_send(method, *args)
begin
obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
obj.send(method, *args)
obj
rescue DRb::DRbConnError => e
$stderr.puts "Could not connect to #{DEFAULT_URI}"
raise e
end
end
def self.connect
service_send(:inc_usage)
end
def self.disconnect
service_send(:dec_usage)
end
end
end
The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.
The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).
The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.
#!/usr/bin/env ruby
require 'tika_service'
Tika::Service.start
begin
tika = Tika::Service.connect
ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika
ensure
Tika::Service.disconnect
Tika::Service.stop
end
The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.
Update: A generalized (but entirely untested) version is up on Github.
Subscribe to:
Posts (Atom)