Wednesday, December 12, 2012

malware in the wild: an interesting diversion

This little number provided some measure of entertainment this evening.

Ostensibly a Javascript malware "caught in the wild", it consists largely of an incomprehensible HTML I tag full of what appears to be encoded data:

17="=0g1ekd4=0f3f6ek=4e844e5=8g77gcf=37mdd60=1eee2f6..."


The nature of the obfuscation can be determined fairly quickly by looking at the two SCRIPT tags:

dd="i";ss=String.fromCharCode;pp="eIn";gg="getElem"+"entsByTagName";zx="al";

and

if(document.getElementsByTagName("div")[0].style.display==""){a=document[gg](dd)[0];

s=new String();
for(i=0;;i++){
        r=a.getAttribute(i);
        if(r){s=s+r;}else break;
}
a=s;
s=new String();
zx="ev"+zx;
e=window[zx];
p=parseInt;
for(i=0;i=2){
        if(a["su"+"bs"+"tr"](i,1)!="=")
        s=s.concat(ss(p(a.substr(i,2),23)/3));
}
c=s;
e(c)}

It only takes putting them side by side to realize that the first contains string substitutions for the second.

Performing the substitutions manually, we get:

if(document.getElementsByTagName("div")[0].style.display==""){

  # a)
  a=document["getElementsByTagName"]("i")[0];
  s=new String();
  for(i=0;;i++){
    r=a.getAttribute(i);
    if(r){s=s+r;}else break;
  }
  a=s;
  s=new String();
  e=window["eval"];
  # b)
  for(i=0;i
    # c)
    if(a["substr"](i,1)!="=")
    #d)
    s=s.concat(String.fromCharCode(parseInt(a.substr(i,2),23)/3));
  }
  c=s;
  # e)
  e(c)}

By now the operation of the main encoding scheme is apparent. The code at a) reads the (numeric) attributes of the "I" tag from 0 to 28 (the largest attribute # in the I tag), and assembles it into a string. The characters of this string are iterated over in pairs at b), and each pair beginning with '=' is skipped at c). Finally, at d), a command string is generated by:

  * reading each pair of characters as a base-23 number
  * dividing the resulting number by 3
  * converting that result into an ASCII character

At e), fittingly, this command string is evaluated.


The code in a) shows that the numeric attributes in the I tag are displayed out of order, in another crude attempt at obfuscation. The first attribute must be zero (according to the loop), and its contents are:

0="=6f9cfek=344aae2=2f6dadg=5e88kd4=2f3d4cl=2f37mg1=3f9d4ek=2f0dgeb=8e87d4a=9666074=7607a4a=15he8cf..."

Manually converting this in ruby shows that the above interpretation of the obfuscation is indeed correct:

%w{ f9 cf ek 44 aa e2 f6 da dg e8 8k d4 f3 d4 cl f3 7m g1 f9 d4 ek f0 dg eb e8 7d 4a 66 60 74 60 7a 4a 5h e8 cf }.map { |x| (Integer(x, 23) / 3).chr }.join
 => "var PluginDetect={version:\"0.7.9\",na" 

From here on it is a simple matter of decoding. Extract the contents of the I tag into a file called "lines.dat". 

 dat = File.open('lines.dat', 'r') { |f| f.read }

A quick perusal shows that the attributes are separated by spaces, and a quick experiment verifies this:

 dat.split(' ').length
 => 29 

Enumerable#inject proves a nice way to turn this into a hash:

h = dat.split(' ').inject({}) { |h, attr| k,v,jnk = attr.split('="'); h[Integer(k)] = v.split(/=[[:alnum:]]/)[1..-1].map{ |s| [s[0,2], s[2,2], s[4,2]] }.flatten ; h}

Some commentary is perhaps in order here. The inject block splits each attribute on '="', resulting in a [name, attribute] pair such as ["0", "=6f9cfek=344aa..."]. The first half of the pair, called k for key, is converted to a fixnum and serves as a sort of line number for the command. 

The tricky bit is the handling of v (for value, of course). This is split on a regex consisting of an equals sign and an alphanumeric character;  for attribute 0, this creates the array ["", "f9cfek", "44aae2", "f6dadg", "e88kd4", ... ]. The empty first element is discarded, then the strings in the array are manually divided into their three numeric components, resulting in an array such as ["f9", "cf", "ek", "44", "aa", "e2", "f6", "da", "dg", "e8", "8k", "d4", ... ].

All that remains now is to order the attributes by "line number", convert each encoded number from a base 23 String representation to a Fixnum, divide it by three, and convert the result to a character. Simple enough:

h.keys.sort.map{ |i| h[i].map{ |x| (Integer(x, 23) / 3).chr }.join }.join

The result is a rather long payload which the reader is welcome to reconstruct themselves or peruse. To give a taste, here is the very beginning and the very end:

"var PluginDetect={version:\"0.7.9\",name:\"PluginDetect\",handler:function(c,b,a){return function(){c(b,a)}},openTag:\"
...
ss=setTimeout;var res=ar[arcalli]();arcalli++;if(res&&window.document){ss(function(){arcall()},5509);}else{arcall();}};arcall();}$$[\"onDetec\"+\"tionDone\"](\"Ja\"+\"va\", svwrbew6436b, \"../legs/getJavaInfo.jar\");"

Tuesday, October 16, 2012

Static DNS servers with Connman

For some reason, Connman uses dnsmasq for name resolution, resulting in extremely slow hostname lookups. This can be verified using the cmcc tool:

bash$ for i in `cmcc services | grep -e '^*' | awk '{ print $2 }'`; do cmcc show $i | grep Nameservers; done
Nameservers = [ 127.0.0.1 ]

Nameservers.Configuration = [  ]


The quick fix is to use cmcc to manually specify name servers:

bash$ for i in `cmcc services | grep -e '^*' | awk '{ print $2 }'`; do cmcc edit $i nameservers 8.8.8.8 8.8.4.4; done

unknown property: 8.8.8.8
bash$ for i in `cmcc services | grep -e '^*' | awk '{ print $2 }'`; do cmcc show $i | grep Nameservers; done
Nameservers = [ 8.8.8.8 8.8.4.4 ]
Nameservers.Configuration = [ 8.8.8.8 8.8.4.4 ]


Alternatively, the default DNS server supplied by the router can be used by specifying auto as the nameserver:

bash$ for i in `cmcc services | grep -e '^*' | awk '{ print $2 }'`; do cmcc edit $i nameservers auto; done
unknown property: auto
bash$ for i in `cmcc services | grep -e '^*' | awk '{ print $2 }'`; do cmcc show $i | grep Nameservers; done
Nameservers = [ 192.168.0.1 ]
Nameservers.Configuration = [  ]

Note that the cmcc tool has a bug in its command-line parsing ("argv" is passed to service.SetProperty in cmd_edit_nameservers, but the nameserver arguments are not popped from argv and thus are treated as additional properties to be parsed once cmd_edit_nameservers returns) -- but as cmcc show proves, the changes have been made.

Thursday, October 4, 2012

"losetup -e blowfish" broken across distributions

Apparently this has been an issue for a long time:

  1. create an encrypted disk image with "losetup -e blowfish"
  2. ...years pass...
  3. move the encrypted disk image to a machine with a
      different distribution of Linux
     -or-
     upgrade the installed version of Linux
  4. discover that all access to the encrypted data is lost!

This apparently has to do the hash being used in blowfish changing, usually for security reasons and usually without a backwards-compatibility option, as documented in loop-AES.README.

This is very bad news if you've been backing up your encrypted disk image. Better stick with a losetup-mount-tar-pgp backup procedure instead:

# to backup:
bash$ cd /media/encrypted_fs
bash$ tar cf - * | gpg --output /home/backup/encrypted_fs.gpg -r you@email.address -e - 
# to restore:
bash$ cd /media/encrypted_fs
bash$ gpg -d /home/backup/encrypted_fs.gp | tar xf -

Tuesday, September 18, 2012

Returning to Pre-Framebuffer console fonts

At some point in time, the various Linuxes out there decided that the large, blocky, easy-to-read console (as in Virtual Console, not Terminal Emulator) fonts should be replaced with tiny, hard-to-read fonts that make it a chore to edit standard 80-column scripts/source/config files in vim.

Experimenting with various grub and kernel options to change the resolution, fix i915.modeset, and so forth make for a lot of rebooting to get things just right.

Enter the Debian console-setup package. This provides a wizard that ultimately calls setfont and stty to configure the Linux Virtual Consoles.

The dialog-based wizard is invoked via sudo dpkg-reconfigure console-setup .  Each screen is simple enough to navigate on its own; however, the following selections will configure the consoles to display an old, 90s-to-early-oughts style of font:

1. UTF-8 [default, just press ENTER]
2. Combined - Latin; Slavic Cyrillic; Greek [default]
3. VGA
4. 28x16

The result should produce a nice 80 column x 25 row vim display that is visible from halfway across the room (and to old geezers a few inches away). Chuck it on a Macbook air to really get some double-takes from passers by.

Tuesday, July 24, 2012

Ruby module inheritance

Another frequently-used Ruby technique.

When building up a data model, it is not unusual to generate two modules that represent the abstract ModelItem (base) class : a ModelItemClass module to be extended by all ModelItem (derived) classes, and a ModelItemObject module to be included by all ModelItem (derived) classes.

But what happens when these ModelItem modules themselves need to mix-in other modules? For example, to support serialization, the ModelItemClass and ModelItemObject might need to include the modules JsonClass and JsonObject, respectively.

When the abstract base classes are already represented by distinct Class and Object modules, it becomes trivial to extend these : a module can be included in a module, and its (instance) methods will be added when that module is subsequently included or extended by a derived class.

Some quick experimentation in irb will demonstrate that this works as expected:


module AClass
  def a; puts "a class method"; end
end


module AObject
  def a; puts "a instance method"; end
end


module BClass
  include AClass
  def b; puts "b class method"; end
end


module BObject
  include AObject
  def b; puts "b instance method"; end
end


class B
  extend BClass
  include BObject
end


B.a   # -> "a class method"
B.b  # -> "b class method"
B.new.a # -> "a instance method"
B.new.b #-> "b instance method"

Monday, July 23, 2012

Format 212

Physionet has a number of databases that consist of .dat signal-capture files in their "format212" format.

The signal (5) manpage provides this following less-than-lucid explanation:

Each sample is represented by a 12-bit two's complement amplitude. The first sample is obtained from the 12 least significant bits of the first byte pair (stored least significant byte first). The second sample is formed from the 4 remaining bits of the first byte pair (which are the 4 high bits of the 12-bit sample) and the next byte (which contains the remaining 8 bits of the second sample). The process is repeated for each successive pair of sample. 

Why write a line of code when 100 words of plaintext will suffice?

Basically, 3 bytes (A, B, C) encode 2 data points (x, y) as follows:
  x = ((B & 0x0F) << 8) | A
  y = ((B & 0xF0) << 4) | C

A throwaway Ruby script to convert a "format212" file to an array of data-pairs:

#!/usr/bin/env ruby

def fmt212_to_a(buf)
  buf.bytes.each_slice(3).collect do |data|
    a = ((data[1] & 0x0F) << 8) | data[0]
    b = ((data[1] & 0xF0) << 4) | data[2]
    [a,b]
  end
end


if __FILE__ == $0
  ARGV.each do |path| 
    File.open(path, 'rb') do |f| 
      fmt212_to_a(f.read).each_with_index { |(a,b),x|
         # generate gnuplot-friendly output
         puts x.to_s + "\t" + a.to_s + "\t" + b.to_s
      }
    end
  end
end

Some equally throwaway gnuplot code to plot an output file signal.dat:

# plot entire file

plot 'signal.dat' using 1:2 title 'lead 1' with lines, 'signal.dat' using 1:3 title 'lead 2' with lines
# plot first 1024 data points of potentially huge files
 plot '< head -1024 signal.dat' using 1:2 title 'lead 1' with lines, '< head -1024 signal.dat' using 1:3 title 'lead 2' with lines 

Tuesday, July 17, 2012

The DRb JRuby Bridge

This is a technique that has proved useful when a Ruby application needs to call some Java code, but writing the application entirely in JRuby is not appropriate.

Examples:
* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins

The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.


The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.

The directory structure for this example is as follows:

  bin/test_app.rb
  lib/tika-app-1.1.jar
  lib/tika_service.rb
  lib/tika_service_jruby


The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.


#!/usr/bin/env jruby


raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/


require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'


# =============================================================================
module Tika


  # ------------------------------------------------------------------------
  # Namespaces for Tika plugins


  module ContentHandler
    Body = Java::org.apache.tika.sax.BodyContentHandler
    Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
    Writeout = Java::org.apache.tika.sax.WriteOutContentHandler
  end


  module Parser
    Auto = Java::org.apache.tika.parser.AutoDetectParser
  end


  module Detector
    Default = Java::org.apache.tika.detect.DefaultDetector
    Language = Java::org.apache.tika.language.LanguageIdentifier
  end


  Metadata = Java::org.apache.tika.metadata.Metadata


  class Service
    # ----------------------------------------------------------------------
    # JRuby Bridge


    # Number of clients connected to TikaServer
    attr_reader :usage_count


    def initialize
      @usage_count = 0
      Tika::Detector::Language.initProfiles
    end


    def inc_usage; @usage_count += 1; end


    def dec_usage; @usage_count -= 1; end


    def stop_if_unused; DRb.stop_service if (usage_count <= 0); end


    def self.drb_start(port)
      port ||= DEFAULT_PORT


      DRb.start_service "druby://localhost:#{port.to_i}", self.new
      puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
     
      trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
      trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }


      DRb.thread.join
    end


    # ----------------------------------------------------------------------
    # Tika Facade


    def parse(str)
      input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
      content = Tika::ContentHandler::Body.new(-1)
      metadata = Tika::Metadata.new


      Tika::Parser::Auto.new.parse(input, content, metadata)
      lang = Tika::Detector::Language.new(input.to_string)


      { :content => content.to_string, 
        :language => lang.getLanguage(),
        :metadata => metadata_to_hash(metadata) }
    end


    def metadata_to_hash(mdata)
      h = {}
      Metadata.constants.each do |name| 
        begin
          val = mdata.get(Metadata.const_get name)
          h[name.downcase.to_sym] = val if val
        rescue NameError
          # nop
        end
      end
      h
    end


  end
end


# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0


The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.

When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.


The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.

#!/usr/bin/env ruby

require 'drb'


module Tika


  class Service
   DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
    DEFAULT_PORT = 44344
    DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
    TIMEOUT = 300    # in 100-ms increments


    # Return command to launch JRuby interpreter
    def self.get_jruby
      # 1. detect system JRuby
      jruby = `which jruby`
      return jruby.chomp if (! jruby.empty?)


      # 2. detect RVM-managed JRuby
      return nil if (`which rvm`).empty?
      jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
      return nil if (! jruby)


      "rvm #{jruby.strip.split(' ').first} do ruby "
    end


    # Replace current process with JRuby running Tika Service
    def self.exec(port)
      jruby = get_jruby
      Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby


      $stderr.puts "No JRUBY found!"
      return 1
    end


    def self.start
      return @pid if @pid
      @pid = Process.fork do
        exit(::Tika::Service::exec DEFAULT_PORT)
      end
      Process.detach(@pid)


      connected = false
      TIMEOUT.times do
        begin
          DRb::DRbObject.new_with_uri(DEFAULT_URI).to_s
          connected = true
          break
        rescue DRb::DRbConnError
          sleep 0.1
        end
      end
      raise "Could not connect to #{DEFAULT_URI}" if ! connected
    end


    def self.stop
      service_send(:stop_if_unused)
    end


    # this will return a new Tika DRuby connection
    def self.service_send(method, *args)
      begin
        obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
        obj.send(method, *args)
        obj
      rescue DRb::DRbConnError => e
        $stderr.puts "Could not connect to #{DEFAULT_URI}"
        raise e
      end
    end


    def self.connect
      service_send(:inc_usage)
    end


    def self.disconnect
      service_send(:dec_usage)
    end


  end
end


The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.

The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).



The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.


#!/usr/bin/env ruby
require 'tika_service'


Tika::Service.start
begin
  tika = Tika::Service.connect
  ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika
ensure
  Tika::Service.disconnect
  Tika::Service.stop
end


The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.


Update: A generalized (but entirely untested) version is up on Github.

Tuesday, February 14, 2012

Ruby: Make any program a Quine

One of the amusing diversions available to writers of compiled-language programs was the challenge of writing a quine (after W.V.O., of course): a program which, when executed, prints its own source code.

Attempts in C range with the ugly but obvious (repeating the source code as a string, and essentially printing the string twice) to the downright obfuscated.

Things are different in dynamic languages, of course, and there's a reason why this problem was never a RubyQuiz (UPDATE: it has! with a necessary extra 'piping' condition): any ruby program can be turned into a quine with the following code:


if __FILE__ == $0
  puts File.open(__FILE__, 'r') { |f| f.read }
end

But perhaps something more can be done? Perhaps the source code and the bytecode can be displayed.
It turns out that the RubyVM object in YARV-based Ruby interpreters makes this possible:

#!/usr/bin/env ruby

def no_op
  $some_variable ||= 1024
end

if __FILE__ == $0
  buf = File.open(__FILE__, 'r') { |f| f.read }
  code = RubyVM::InstructionSequence.compile(buf)
  puts buf
  puts code.disasm
end    

Running this under ruby-1.9.2-head produces the following output:

bash$ rvm ruby-1.9.2-head do ruby quine.rb
#!/usr/bin/env ruby

def no_op
  $some_variable ||= 1024
end

if __FILE__ == $0
  buf = File.open(__FILE__, 'r') { |f| f.read }
  code = RubyVM::InstructionSequence.compile(buf)
  puts buf
  puts code.disasm
end

== disasm: <RubyVM::InstructionSequence:<compiled>@<compiled>>==========
== catch table
| catch type: break  st: 0029 ed: 0046 sp: 0000 cont: 0046
|------------------------------------------------------------------------
local table (size: 3, argc: 0 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 3] buf        [ 2] code       
0000 trace            1                                               (   3)
0002 putspecialobject 1
0004 putspecialobject 2
0006 putobject        :no_op
0008 putiseq          no_op
0010 send             :"core#define_method", 3, nil, 0, <ic:0>
0016 pop              
0017 trace            1                                               (   7)
0019 putstring        "<compiled>"
0021 getglobal        $0
0023 opt_eq           <ic:9>
0025 branchunless     100
0027 trace            1                                               (   8)
0029 getinlinecache   36, <ic:2>
0032 getconstant      :File
0034 setinlinecache   <ic:2>
0036 putstring        "<compiled>"
0038 putstring        "r"
0040 send             :open, 2, block in <compiled>, 0, <ic:3>
0046 setlocal         buf
0048 trace            1                                               (   9)
0050 getinlinecache   59, <ic:4>
0053 getconstant      :RubyVM
0055 getconstant      :InstructionSequence
0057 setinlinecache   <ic:4>
0059 getlocal         buf
0061 send             :compile, 1, nil, 0, <ic:5>
0067 setlocal         code
0069 trace            1                                               (  10)
0071 putnil           
0072 getlocal         buf
0074 send             :puts, 1, nil, 8, <ic:6>
0080 pop              
0081 trace            1                                               (  11)
0083 putnil           
0084 getlocal         code
0086 send             :disasm, 0, nil, 0, <ic:7>
0092 send             :puts, 1, nil, 8, <ic:8>
0098 leave                                                            (   7)
0099 pop              
0100 putnil                                                           (  11)
0101 leave            
== disasm: <RubyVM::InstructionSequence:no_op@<compiled>>===============
0000 trace            8                                               (   3)
0002 trace            1                                               (   4)
0004 putnil           
0005 defined          7, :$some_variable, false
0009 branchunless     17
0011 getglobal        $some_variable
0013 dup              
0014 branchif         22
0016 pop              
0017 putobject        1024
0019 dup              
0020 setglobal        $some_variable
0022 trace            16                                              (   5)
0024 leave                                                            (   4)
== disasm: <RubyVM::InstructionSequence:block in <compiled>@<compiled>>=
== catch table
| catch type: redo   st: 0000 ed: 0011 sp: 0000 cont: 0000
| catch type: next   st: 0000 ed: 0011 sp: 0000 cont: 0011
|------------------------------------------------------------------------
local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1] s3)
[ 2] f<Arg>     
0000 trace            1                                               (   8)
0002 getdynamic       f, 0
0005 send             :read, 0, nil, 0, <ic:0>
0011 leave            

Saturday, February 4, 2012

Web page mirroring (wget) in Ruby

Doing a basic search for 'wget in ruby' or 'web page mirroring in ruby' generally turns up a number of unimaginative (not to mention lazy) examples of invoking wget from a sub-shell. While these solutions often meet the immediate needs of their authors, they provide no guidance for those who need to implement the mirroring of remote documents as part of a larger data processing or data management application.

The following are the results of implementing web page mirroring (as performed by non-recursive wget, or by Firefox's Save As...[Web Page, Complete] feature) in such an application. The provided code will download an HTML document, parse it for JS, CSS, and image files (referred to as resources), download those files to the destination directory, rewrite the references to those resources so they refer to the local (downloaded) files, then write the updated HTML document to the destination directory.

The bulk of the work is done by Nokogiri, which parses the HTML document and supports XPath queries on its elements. Fetching remote documents is handled by Ruby's built-in net/http module.


#!/usr/bin/env ruby
# faux-wget.rb : example of performing wget-style mirroring
require 'nokogiri'
require 'net/http'
require 'fileutils'
require 'uri'

The implementation uses a single RemoteDocument class which takes a URI in its constructor and provides the high-level method mirror which downloads and localizes the document at the URI.

The mirror method invokes four helper methods:

  • html_get will download the remote document
  • Nokogiri::HTML parses the downloaded document
  • process_contents extracts resource tags from the document
  • save_locally localizes all resources

The document parsing is straighforward: Nokogiri does all the work, and process_contents simply performs XPath queries to build lists of CSS, JS and IMG tags. The document fetching is also pretty straightforward; support for response codes in the 3xx (redirect) and 4xx-5xx series (not found,
unauthorized, internal error, etc) adds only a couple of lines of code

The save_locally method begins by creating the destination directory if it does not exist. If the document contains a BASE tag, it is removed - which makes the mirrored page behave better in most web browsers. Next, the localize method is invoked on all of the resource tags. Finally, the document is written to the destination directory.

The localize method performs all of the interesting work.

It begins with a random delay of tens of milliseconds, which prevents most web servers from blocking the barrage of requests when downloading the resource files. The resource URL is obtained from the HREF attribute of CSS tags, or from the SRC attribute of JS and IMG tags.

This URL can be absolute or relative (e.g. the file "http://foo.com/images/bar.jpg" may be referenced as "images/bar.jpg"), so an absolute URL is generated from the document URI if needed. The URL is then converted to a local path, meaning that the downloaded resource can end up in the directory "images" inside the destination directory.

The remote resource file is then downloaded to this path, and the URL in the tag attribute (which is part of the original document's source) is replaced with the local path to the downloaded resource.

The local path generation logic in localize_url is a bit simplistic (resulting in ugly directory trees), and the mirroring process itself is non-recursive (i.e. HREF attributes in A tags are detected but ignored). The caller can easily use the RemoteDocument object to provide additional services: the parsed contents of the HTML document are available in the contents property, the metadata tags (e.g. keywords) are available in the meta property, and all links found in the document are available in the links property.


=begin rdoc
Wrap a URI and provide methods for download, parsing, and mirroring of remote HTML document.
=end
class RemoteDocument
  attr_reader :uri
  attr_reader :contents
  attr_reader :css_tags, :js_tags, :img_tags, :meta, :links


  def initialize(uri)
    @uri = uri
  end


=begin rdoc
Download, parse, and save the RemoteDocument and all resources (JS, CSS, 
images) in the specified directory.
=end
  def mirror(dir)
    source = html_get(uri)
    @contents = Nokogiri::HTML( source )
    process_contents
    save_locally(dir)
  end


=begin rdoc
Extract resources (CSS, JS, Image files) from the parsed html document.
=end
  def process_contents
    @css_tags = @contents.xpath( '//link[@rel="stylesheet"]' )
    @js_tags = @contents.xpath('//script[@src]')
    @img_tags = @contents.xpath( '//img[@src]' )
    # Note: meta tags and links are unused in this example
    find_meta_tags
    find_links
  end


=begin rdoc
Extract contents of META tags to @meta Hash.
=end
  def find_meta_tags
    @meta = {}
    @contents.xpath('//meta').each do |tag|
      last_name = name = value = nil
      tag.attributes.each do |key, attr|
        if attr.name == 'content'
          value = attr.value
        elsif attr.name == 'name'
          name = attr.value
        else
          last_name = attr.value
        end
      end
      name = last_name if not name
      @meta[name] = value if name && value
    end
  end


=begin rdoc
Generate a Hash URL -> Title of all (unique) links in document.
=end
  def find_links
    @links = {}
    @contents.xpath('//a[@href]').each do |tag| 
      @links[tag[:href]] = (tag[:title] || '') if (! @links.include? tag[:href])
    end
  end


=begin rdoc
Generate a local, legal filename for url in dir.
=end
  def localize_url(url, dir)
    path = url.gsub(/^[|[:alpha]]+:\/\//, '')
    path.gsub!(/^[.\/]+/, '')
    path.gsub!(/[^-_.\/[:alnum:]]/, '_')
    File.join(dir, path)
  end


=begin rdoc
Construct a valid URL for an HREF or SRC parameter. This uses the document URI
to convert a relative URL ('/doc') to an absolute one ('http://foo.com/doc').
=end
  def url_for(str)
    return str if str =~ /^[|[:alpha:]]+:\/\//
    File.join((uri.path.empty?) ? uri.to_s : File.dirname(uri.to_s), str)
  end


=begin rdoc
Send GET to url, following redirects if required.
=end
  def html_get(url)
    resp = Net::HTTP.get_response(url)
    if ['301', '302', '307'].include? resp.code
      url = URI.parse resp['location']
    elsif resp.code.to_i >= 400
      $stderr.puts "[#{resp.code}] #{url}"
      return
    end
    Net::HTTP.get url
  end


=begin rdoc
Download a remote file and save it to the specified path
=end
  def download_resource(url, path)
    FileUtils.mkdir_p File.dirname(path)
    the_uri = URI.parse(url)
    if the_uri
      data = html_get the_uri
      File.open(path, 'wb') { |f| f.write(data) } if data
    end
  end


=begin rdoc
Download resource for attribute 'sym' in 'tag' (e.g. :src in IMG), saving it to
'dir' and modifying the tag attribute to reflect the new, local location.
=end
  def localize(tag, sym, dir)
    delay
    url = tag[sym]
    resource_url = url_for(url)
    dest = localize_url(url, dir)
    download_resource(resource_url, dest)
    tag[sym.to_s] = dest.partition(File.dirname(dir) + File::SEPARATOR).last
  end


=begin rdoc
Attempt to "play nice" with web servers by sleeping for a few ms.
=end
  def delay
    sleep(rand / 100)
  end


=begin rdoc
Download all resources to destination directory, rewriting in-document tags
to reflect the new resource location, then save the localized document.
Creates destination directory if it does not exist.
=end
  def save_locally(dir)
    Dir.mkdir(dir) if (! File.exist? dir)
   
    # remove HTML BASE tag if it exists
    @contents.xpath('//base').each { |t| t.remove }


    # save resources
    @img_tags.each { |tag| localize(tag, :src, File.join(dir, 'images')) }
    @js_tags.each { |tag| localize(tag, :src, File.join(dir, 'js')) }
    @css_tags.each { |tag| localize(tag, :href, File.join(dir, 'css')) }


    save_path = File.join(dir, File.basename(uri.to_s))
    save_path += '.html' if save_path !~ /\.((html?)|(txt))$/
    File.open(save_path, 'w') { |f| f.write(@contents.to_html) }
  end
end


Finally. the source code file needs a driver method to make it useful from the command line. This one takes a URL and a destination directory as its parameters:


# ----------------------------------------------------------------------
if __FILE__ == $0
  if ARGV.count < 2
    $stderr.puts "Usage: #{$0} URL DIR"
    exit 1
  end


  url = ARGV.shift
  dir = ARGV.shift
  doc = RemoteDocument.new(URI.parse(url))
  doc.mirror(dir)
end