The Sorry Scheme of Things Entire: February 2012

One of the amusing diversions available to writers of compiled-language programs was the challenge of writing a quine (after W.V.O., of course): a program which, when executed, prints its own source code.

Attempts in C range with the ugly but obvious (repeating the source code as a string, and essentially printing the string twice) to the downright obfuscated.

Things are different in dynamic languages, of course, and there's a reason why this problem was never a RubyQuiz (UPDATE: it has! with a necessary extra 'piping' condition): any ruby program can be turned into a quine with the following code:

if __FILE__ == $0
puts File.open(__FILE__, 'r') { |f| f.read }
end

But perhaps something more can be done? Perhaps the source code and the bytecode can be displayed.

It turns out that the RubyVM object in YARV-based Ruby interpreters makes this possible:

#!/usr/bin/env ruby

def no_op

$some_variable ||= 1024

end

if __FILE__ == $0

buf = File.open(__FILE__, 'r') { |f| f.read }

code = RubyVM::InstructionSequence.compile(buf)

puts buf

puts code.disasm

end

Running this under ruby-1.9.2-head produces the following output:

bash$ rvm ruby-1.9.2-head do ruby quine.rb

#!/usr/bin/env ruby

def no_op

$some_variable ||= 1024

end

if __FILE__ == $0

buf = File.open(__FILE__, 'r') { |f| f.read }

code = RubyVM::InstructionSequence.compile(buf)

puts buf

puts code.disasm

end

== disasm: <RubyVM::InstructionSequence:<compiled>@<compiled>>==========
== catch table
| catch type: break st: 0029 ed: 0046 sp: 0000 cont: 0046
|------------------------------------------------------------------------
local table (size: 3, argc: 0 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 3] buf [ 2] code
0000 trace 1 ( 3)
0002 putspecialobject 1
0004 putspecialobject 2
0006 putobject :no_op
0008 putiseq no_op
0010 send :"core#define_method", 3, nil, 0, <ic:0>
0016 pop
0017 trace 1 ( 7)
0019 putstring "<compiled>"
0021 getglobal $0
0023 opt_eq <ic:9>
0025 branchunless 100
0027 trace 1 ( 8)
0029 getinlinecache 36, <ic:2>
0032 getconstant :File
0034 setinlinecache <ic:2>
0036 putstring "<compiled>"
0038 putstring "r"
0040 send :open, 2, block in <compiled>, 0, <ic:3>
0046 setlocal buf
0048 trace 1 ( 9)
0050 getinlinecache 59, <ic:4>
0053 getconstant :RubyVM
0055 getconstant :InstructionSequence
0057 setinlinecache <ic:4>
0059 getlocal buf
0061 send :compile, 1, nil, 0, <ic:5>
0067 setlocal code
0069 trace 1 ( 10)
0071 putnil
0072 getlocal buf
0074 send :puts, 1, nil, 8, <ic:6>
0080 pop
0081 trace 1 ( 11)
0083 putnil
0084 getlocal code
0086 send :disasm, 0, nil, 0, <ic:7>
0092 send :puts, 1, nil, 8, <ic:8>
0098 leave ( 7)
0099 pop
0100 putnil ( 11)
0101 leave
== disasm: <RubyVM::InstructionSequence:no_op@<compiled>>===============
0000 trace 8 ( 3)
0002 trace 1 ( 4)
0004 putnil
0005 defined 7, :$some_variable, false
0009 branchunless 17
0011 getglobal $some_variable
0013 dup
0014 branchif 22
0016 pop
0017 putobject 1024
0019 dup
0020 setglobal $some_variable
0022 trace 16 ( 5)
0024 leave ( 4)
== disasm: <RubyVM::InstructionSequence:block in <compiled>@<compiled>>=
== catch table
| catch type: redo st: 0000 ed: 0011 sp: 0000 cont: 0000
| catch type: next st: 0000 ed: 0011 sp: 0000 cont: 0011
|------------------------------------------------------------------------
local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1] s3)
[ 2] f<Arg>
0000 trace 1 ( 8)
0002 getdynamic f, 0
0005 send :read, 0, nil, 0, <ic:0>
0011 leave

Doing a basic search for 'wget in ruby' or 'web page mirroring in ruby' generally turns up a number of unimaginative (not to mention lazy) examples of invoking wget from a sub-shell. While these solutions often meet the immediate needs of their authors, they provide no guidance for those who need to implement the mirroring of remote documents as part of a larger data processing or data management application.

The following are the results of implementing web page mirroring (as performed by non-recursive wget, or by Firefox's Save As...[Web Page, Complete] feature) in such an application. The provided code will download an HTML document, parse it for JS, CSS, and image files (referred to as resources), download those files to the destination directory, rewrite the references to those resources so they refer to the local (downloaded) files, then write the updated HTML document to the destination directory.

The bulk of the work is done by Nokogiri, which parses the HTML document and supports XPath queries on its elements. Fetching remote documents is handled by Ruby's built-in net/http module.

#!/usr/bin/env ruby
# faux-wget.rb : example of performing wget-style mirroring
require 'nokogiri'
require 'net/http'
require 'fileutils'
require 'uri'

The implementation uses a single RemoteDocument class which takes a URI in its constructor and provides the high-level method mirror which downloads and localizes the document at the URI.

The mirror method invokes four helper methods:

html_get will download the remote document
Nokogiri::HTML parses the downloaded document
process_contents extracts resource tags from the document
save_locally localizes all resources

The document parsing is straighforward: Nokogiri does all the work, and process_contents simply performs XPath queries to build lists of CSS, JS and IMG tags. The document fetching is also pretty straightforward; support for response codes in the 3xx (redirect) and 4xx-5xx series (not found,
unauthorized, internal error, etc) adds only a couple of lines of code

The save_locally method begins by creating the destination directory if it does not exist. If the document contains a BASE tag, it is removed - which makes the mirrored page behave better in most web browsers. Next, the localize method is invoked on all of the resource tags. Finally, the document is written to the destination directory.

The localize method performs all of the interesting work.

It begins with a random delay of tens of milliseconds, which prevents most web servers from blocking the barrage of requests when downloading the resource files. The resource URL is obtained from the HREF attribute of CSS tags, or from the SRC attribute of JS and IMG tags.

This URL can be absolute or relative (e.g. the file "http://foo.com/images/bar.jpg" may be referenced as "images/bar.jpg"), so an absolute URL is generated from the document URI if needed. The URL is then converted to a local path, meaning that the downloaded resource can end up in the directory "images" inside the destination directory.

The remote resource file is then downloaded to this path, and the URL in the tag attribute (which is part of the original document's source) is replaced with the local path to the downloaded resource.

The local path generation logic in localize_url is a bit simplistic (resulting in ugly directory trees), and the mirroring process itself is non-recursive (i.e. HREF attributes in A tags are detected but ignored). The caller can easily use the RemoteDocument object to provide additional services: the parsed contents of the HTML document are available in the contents property, the metadata tags (e.g. keywords) are available in the meta property, and all links found in the document are available in the links property.

=begin rdoc
Wrap a URI and provide methods for download, parsing, and mirroring of remote HTML document.
=end
class RemoteDocument
attr_reader :uri
attr_reader :contents
attr_reader :css_tags, :js_tags, :img_tags, :meta, :links

def initialize(uri)
@uri = uri
end

=begin rdoc
Download, parse, and save the RemoteDocument and all resources (JS, CSS,
images) in the specified directory.
=end
def mirror(dir)
source = html_get(uri)
@contents = Nokogiri::HTML( source )
process_contents
save_locally(dir)
end

=begin rdoc
Extract resources (CSS, JS, Image files) from the parsed html document.
=end
def process_contents
@css_tags = @contents.xpath( '//link[@rel="stylesheet"]' )
@js_tags = @contents.xpath('//script[@src]')
@img_tags = @contents.xpath( '//img[@src]' )
# Note: meta tags and links are unused in this example
find_meta_tags
find_links
end

=begin rdoc
Extract contents of META tags to @meta Hash.
=end
def find_meta_tags
@meta = {}
@contents.xpath('//meta').each do |tag|
last_name = name = value = nil
tag.attributes.each do |key, attr|
if attr.name == 'content'
value = attr.value
elsif attr.name == 'name'
name = attr.value
else
last_name = attr.value
end
end
name = last_name if not name
@meta[name] = value if name && value
end
end

=begin rdoc
Generate a Hash URL -> Title of all (unique) links in document.
=end
def find_links
@links = {}
@contents.xpath('//a[@href]').each do |tag|
@links[tag[:href]] = (tag[:title] || '') if (! @links.include? tag[:href])
end
end

=begin rdoc
Generate a local, legal filename for url in dir.
=end
def localize_url(url, dir)
path = url.gsub(/^[|[:alpha]]+:\/\//, '')
path.gsub!(/^[.\/]+/, '')
path.gsub!(/[^-_.\/[:alnum:]]/, '_')
File.join(dir, path)
end

=begin rdoc
Construct a valid URL for an HREF or SRC parameter. This uses the document URI
to convert a relative URL ('/doc') to an absolute one ('http://foo.com/doc').
=end
def url_for(str)
return str if str =~ /^[|[:alpha:]]+:\/\//
File.join((uri.path.empty?) ? uri.to_s : File.dirname(uri.to_s), str)
end

=begin rdoc
Send GET to url, following redirects if required.
=end
def html_get(url)
resp = Net::HTTP.get_response(url)
if ['301', '302', '307'].include? resp.code
url = URI.parse resp['location']
elsif resp.code.to_i >= 400
$stderr.puts "[#{resp.code}] #{url}"
return
end
Net::HTTP.get url
end

=begin rdoc
Download a remote file and save it to the specified path
=end
def download_resource(url, path)
FileUtils.mkdir_p File.dirname(path)
the_uri = URI.parse(url)
if the_uri
data = html_get the_uri
File.open(path, 'wb') { |f| f.write(data) } if data
end
end

=begin rdoc
Download resource for attribute 'sym' in 'tag' (e.g. :src in IMG), saving it to
'dir' and modifying the tag attribute to reflect the new, local location.
=end
def localize(tag, sym, dir)
delay
url = tag[sym]
resource_url = url_for(url)
dest = localize_url(url, dir)
download_resource(resource_url, dest)
tag[sym.to_s] = dest.partition(File.dirname(dir) + File::SEPARATOR).last
end

=begin rdoc
Attempt to "play nice" with web servers by sleeping for a few ms.
=end
def delay
sleep(rand / 100)
end

=begin rdoc
Download all resources to destination directory, rewriting in-document tags
to reflect the new resource location, then save the localized document.
Creates destination directory if it does not exist.
=end
def save_locally(dir)
Dir.mkdir(dir) if (! File.exist? dir)

# remove HTML BASE tag if it exists
@contents.xpath('//base').each { |t| t.remove }

# save resources
@img_tags.each { |tag| localize(tag, :src, File.join(dir, 'images')) }
@js_tags.each { |tag| localize(tag, :src, File.join(dir, 'js')) }
@css_tags.each { |tag| localize(tag, :href, File.join(dir, 'css')) }

save_path = File.join(dir, File.basename(uri.to_s))
save_path += '.html' if save_path !~ /\.((html?)|(txt))$/
File.open(save_path, 'w') { |f| f.write(@contents.to_html) }
end
end

Finally. the source code file needs a driver method to make it useful from the command line. This one takes a URL and a destination directory as its parameters:

# ----------------------------------------------------------------------
if __FILE__ == $0
if ARGV.count < 2
$stderr.puts "Usage: #{$0} URL DIR"
exit 1
end

url = ARGV.shift
dir = ARGV.shift
doc = RemoteDocument.new(URI.parse(url))
doc.mirror(dir)
end

The Sorry Scheme of Things Entire

Tuesday, February 14, 2012

Ruby: Make any program a Quine

Saturday, February 4, 2012

Web page mirroring (wget) in Ruby

Labels

Blog Archive