How a Perl script helped me learn Ruby

December 6, 2011

Back when Rails was all the rage, I wanted to learn Ruby. But it wasn’t Rails that taught me Ruby.

It was a Perl-based password generator.

I don’t remember what the circumstances were–it could be that pwgen wasn’t available on OS X–but for whatever reason, I chose to use a script named SopPasswd.pl, by someone named Mark A. Pors. I tried to pull up his old website, but it redirects to average-photography.com now.

Here’s the script, reproduced from this link without permission:

#!/usr/bin/perl -w

# SopPasswd: A generator for Sort-of-pronounceable passwords.
# Version: 0.1
# Author: Mark A. Pors, mark@dreamzpace.com, www.dreamzpace.com
# License: GPL

use strict;

my $dict = '/usr/share/dict/words';       # path to dict file
my $wordlen = 8;                    # desired length of the password
my $numwords = 10;                 # number of passwords to print
my $sublen = 3;                     # length of the word chunks that create the password
my $sep = "\n";                     # how to separate the words

my @dict;

$wordlen >= $sublen || die "Error: The word length should be equal or larger than the length of the 'chunks'\n";

open (DICT, "<$dict") || die ("Cannot open dict: $!");
while (<DICT>) {
    chomp;
    push (@dict, $_);
}

while (1) {

    my @sub = ();
    my $word;
    my $parts = int ($wordlen/$sublen);

    for (1 .. $parts) {
        my $try = $dict[rand @dict];
        redo if length($try) < $sublen;
        $word .= lc substr($try, 0, $sublen);
    }

    my @chars = split(m{}xms, $word);
    my $upper = rand @chars;
    $chars[$upper] = uc $chars[$upper];
    $word = join(q{}, @chars);

    my $left = $wordlen % $sublen;
    $word .= substr (int rand (10**($wordlen - 1)), 0, $left);

    print $word . $sep;
    chomp (my $exit = <STDIN>);

}

That’s fairly readable for Perl, but at the time, I tended to think of perl as line noise, and that looks a little different than what I reemember.

About that time, Rails was all the rage. I had an internal project at work I wanted to work on, and Rails seemed to be a good fit. I’d implemented an earlier version in Zope, but just didn’t have a machine I could dedicate to it.

One minor problem: I knew nothing about Ruby. Sure, the “blog in 10 minutes” videos were sexy, but it was clear that they were by people who knew Rails intimately.

I had checked out various tutorials, and at the time, _why’s tutorials were among the best. The thing is, _why’s style was, well, odd. It just didn’t click with me. I had put the project on the back burner for a while.

I looked at the script, realized I wanted to keep using it, but wanted something I was comfortable with working with. And thus, SopPasswd.rb became my pet project.

What it does

All it does is it reads in /usr/share/dict/words and chops the words into $sublen lengths, and strings them together into a word that’s approximately $wordlen long, with digits from 0-9 as padding.

Now, unfortunately, I didn’t keep all my code, but I do have an old version I just resurrected from my git repo. With that in mind, I’ll revisit the train wreck of the original, as I remember it, and what I did instead.

What I did was as straight a port of the original script as I could, at work during a slow day. I was working on a Mac G4 Gigabit Ethernet machine at the time, which iirc ran at a blazing 400MHz. After a few minutes of thinking and poring over the Pickaxe book, I came up with a script, and ran it…

…and 30 seconds later, I had a list of passwords.

What had gone wrong?

Perl is fast.

The thing is, I did a straight port. One thing to keep in mind about Perl is that it’s fast. Blazingly fast. It’s such an astoundingly amazing language that asking whether it’s interpreted or compiled isn’t even that simple of a question: it’s more like an interpreted language that gets compiled on the fly. For all its warts, it’s an amazing language.

Ruby, like Python, is an interpreted language. Both languages tend to have a “fast enough” philosophy, meaning that they haven’t put the time in to make their languages as fast as Perl, preferring instead to provide ways of speeding up your code.

So, where this takes no time at all:

open (DICT, "<$dict") || die ("Cannot open dict: $!");
while (<DICT>) {
    chomp;
    push (@dict, $_);
}

This is a major bottleneck on older hardware:

dict = []
begin
    d = File.open(DICT,"r")
    d.each {|f| dict << f.chomp }
rescue
    raise RuntimeError, "#{DICT} not found"
end

Ruby is slow, but we can work around that

Before I started to learn about Ruby, I had been learning about Python, and if there was one thing I retained about Python, it was that part of the “fast enough” philosophy was knowing when to offload computation onto a built in function. With that in mind, I pored over the Pickaxe book again, and came up with this:

begin
    dict = File.readlines(DICT)
rescue
    raise RuntimeError, "#{DICT} not found"
end

Now, readlines retains newlines, which the Perl script fixed by invoking chomp on each line. For the heck of it, I decided to try adding this:

dict.each {|f| f.chomp!}

That brought back the bottleneck.

So, what do we do here? Well, the ultimate question is, do we need to chomp() all the lines? Not really; when I got right down to it, when I did the sanity check for line length, I just took the word.length() and subtracted one.

So we started with this:

for (1 .. $parts) {
    my $try = $dict[rand @dict];
    redo if length($try) < $sublen;
    $word .= lc substr($try, 0, $sublen);
}

Here was the ultimate result:

parts.times do
    word = Iconv.conv("UTF-8", "ISO_8859-1", dict[rand(dictlen).to_i])
    wordlen = word.length - 1
    redo if wordlen < sublen
    wrange = rand(wordlen-sublen).to_i
    myword = word[wrange..wrange+sl]
    redo if (!(myword =~ vowel) || (myword =~ badmatch))   
    print myword.downcase
end

I know that’s a lot longer, but bear with me.

This line:

word = Iconv.conv("UTF-8", "ISO_8859-1", dict[rand(dictlen).to_i])

was necessary because I was getting an error when I moved this from Ruby 1.8 to Ruby 1.9. I haven’t tried it lately (more on that below) but hoepfully that kludge isn’t necessary anymore; it converts UTF-8 text to ISO_8859-1.

wordlen = word.length - 1

Instead of running chomp(), I subtract one from the line length, which has the same effect.

redo if wordlen < sublen

This goes back to the top of the loop if the chosen word is shorter than sublen.

wrange = rand(wordlen-sublen).to_i

This chooses a random number between 0 and the length of the word, minus the length of sublen.

myword = word[wrange..wrange+sl]

If you’re not familiar with Ruby, this chooses a range from wrange to wrange + sublen.

redo if myword =~ badmatch or not myword =~ vowel

There are two regular expressions defined before that:

badmatch = /\W|\'/
vowel = /^[aeiou]/i

badmatch checks for non-word characters, and an apostrophe. vowel is a case-insensitive check for vowels. I decided that checking for vowels in the sublen led to more usable passwords.

And, the random numbers

So then this

my $left = $wordlen % $sublen;
$word .= substr (int rand (10**($wordlen - 1)), 0, $left);

Became this:

num.times {print rand(10).to_s}

and num was calculated like this:

num = wordlen % sublen

The Perl users probably noticed the original script interpolated a string, while mine just sends stuff to stdout. String interpolation tended to be another bottleneck on the old machine, and print just worked fine for me.

The whole thing:

So to keep this a little shorter, I had started on this toy problem with the goal of getting the Ruby version as close to Perl performance as possible, and keep the line count as close as possible. Since I was using it all the time but wanted different options, I eventually changed the hardcoded wordlengths, sublen lengths, and number of words from being hardcoded, to being read from stdin with sane defaults set.

So, here’s the whole thing:

#!/usr/bin/env ruby1.9
# Port of the Perl script SopPassword, now ported to Ruby 1.9
def usage
  puts <<END
SopPasswd: A generator for Sort-of-pronounceable passwords.
Version: 0.9
Author of Perl version: Mark A. Pors <mark@dreamzpace.com>
Ruby port:  Shane Simmons <regeya@earthlink.net>
License: GPL
Usage: SopPasswd [OPTION]
   -h, --help:       this message
   -l, --wordlen:    Change the length of the word (default: 7)
   -s, --sublen:     Change the length of word segments (default: 3)
   -n, --numwords:   Change the number of words (default: 100)
   -d, --dictionary: Full path to dict/words (default: /usr/share/dict/words)
END
  Process.exit
end

require 'getoptlong'
require 'iconv'

fn = '/usr/share/dict/words'

wordlen = 8
numwords = 20
sublen = 3
word = ""
list = []

opts = GetoptLong.new(
     ["--wordlen", "-l", GetoptLong::OPTIONAL_ARGUMENT],
     ["--numwords", "-n", GetoptLong::OPTIONAL_ARGUMENT],
     ["--sublen", "-s", GetoptLong::OPTIONAL_ARGUMENT],
     ["--help","-h", GetoptLong::OPTIONAL_ARGUMENT],
     ["--dictionary","-d", GetoptLong::OPTIONAL_ARGUMENT])

opts.each do |opt, arg|
    case opt
        when "--help" then usage
        when "--wordlen" then wordlen = arg.to_i
        when "--numwords" then numwords = arg.to_i
        when "--sublen" then sublen = arg.to_i
        when "--dictionary" then fn = arg
    end
end

parts = (wordlen/sublen).to_i
sl = sublen - 1
badmatch = /\W|\'/
vowel = /^[aeiou]/i
dict = File.readlines(fn)
dictlen = dict.length
num = wordlen % sublen

numwords.times do
    parts.times do
        word = Iconv.conv("UTF-8", "ISO_8859-1", dict[rand(dictlen).to_i])
        wordlen = word.length - 1
        redo if wordlen < sublen
        wrange = rand(wordlen-sublen).to_i
        myword = word[wrange..wrange+sl]
        redo if myword =~ badmatch or not myword =~ vowel
        print myword.downcase
    end
    num.times {print rand(10).to_s}
    puts
end

So, that’s still working for you?

Do I still use it? Did I switch to pwgen? Nah, I decided to switch toy problems and write a new password generator as a markov chain generator. It’s almost instantaneous, doesn’t have to know anything about the language of the dictionary file, and doesn’t have to read in the dictionary every time. The output is identical.