A Simple Ruby NGram Generator

I was playing around with Ruby the other night and wrote a simple n-gram generator. In case anyone is interested, here is the script:

#!/usr/bin/env ruby -w
# r_ngram.rb
# Generate a simple bi- and tri-gram, prints to STDOUT
# Usage: ruby ngram.rb file.txt
# To save the output to a file: ruby ngram.rb file.txt > output.txt

$words = File.read(ARGV[0]).downcase.scan(/[a-z]+/)

bi_grams = Hash.new(0)
tri_grams = Hash.new(0)

num = $words.length - 2
num.times {|i|
  bi = $words[i] + ' ' + $words[i+1]
  tri = bi + ' ' + $words[i+2]
  bi_grams[bi] += 1
  tri_grams[tri] += 1
}

puts "## -- bi-grams -- ##"
bg = bi_grams.sort{|a,b| b[1] <=> a[1]}
(num / 10).times {|i| puts "#{bg[i][0]} : #{bg[i][1]}"}
puts "\n"
puts "## -- tri-grams -- ##"
tg = tri_grams.sort{|a,b| b[1] <=> a[1]}
(num / 10).times {|i| puts "#{tg[i][0]} : #{tg[i][1]}"}

April 24, 2012 @jaheppler