FREQr Version 2

Late last year I released a word frequency generator into the wild on Github. I’ve since updated the program to a more advanced version.

The original version of FREQr generated word frequencies based on a text file. The new version can now read files off the web, which makes the program much more useful. The program also generates an HTML output of the frequencies for display within digital history projects. I’ve also added a new feature: word clouds. FREQr now also outputs a word cloud based on frequencies, also formatted for display on the web.

Here’s the full code, or you can download a copy from Github:

#!/usr/bin/ruby -w

# FREQr.rb
# Written by Jason A. Heppler
# This program is free software.
# You can distribute/modify this program under the terms of
# the GNU Lesser General Public License version 2.1.
# Last Modified: Sun Feb 12 16:08:06 CST 2011
require 'open-uri'
STOPWORDS = %w{a about above across after again against all am an and any are arent as at be because been before being below between both but by cant cannot could couldnt did didnt do does doesnt doing dont down during each few for form further had hadnt has hasnt have havent having he her here heres hers herself him himself his how i id ill im ive if in into is isnt it its itself lets me more most mustnt my myself my myself no nor not of off on once only or other ought our ours ourselves out over own same shant she should shouldnt so some such than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who why with would you your yours yourself yourselves}

# Tag Cloud class for generating a word cloud from a word frequency
# =================================================================
class TagCloud
    attr_accessor :word_class
    def initialize(words)
        @wordcount = count_words(words)
    def count_words(words)
        wordcount = {}
        words.each do |word|
            if word.strip.size > 0
                unless wordcount.key?(word.strip)
                    wordcount[word.strip] = 0
                    wordcount[word.strip] = wordcount[word.strip] + 1
    def font_ratio(wordcount={})
        min, max = 100000, - 1000000
        wordcount.each_key do |word|
            max = wordcount[word] if wordcount[word] > max
            min = wordcount[word] if wordcount[word] < min
        18.0 / (max - min)
    def build
        cloud =
        ratio = font_ratio(@wordcount)
        @wordcount.each_key do |word|
            font_size = (9 + (@wordcount[word] * ratio))
            cloud << %Q{<span#{" class=\"" + word_class + "\"" unless word_class.nil? } style="font-size:#{font_size}pt;">#{word}</span> }

# Strip out HTML tags, alphanumeric characters, and punctuation, then 
# lower-case all words, split the words apart, and remove stopwords 
# ===================================================================
def readFile(url)
    uri_file = open(url).read.gsub(/<\/?[^>]*>/, "").gsub(/&quot;*/, "").gsub(/[0-9]*/, "").gsub(/[(,?!\'""':.)]/, '').downcase.split(' ') - STOPWORDS
    return uri_file

# Create a dictionary of n-grams
# ==============================
url = ARGV[0]
uri_file = readFile(url)

# Save output to HTML
# ==================="output.html", "w") do |output|
        frequency =
        uri_file.each { |word| frequency[word] += 1 }
        frequency.sort_by { |x,y| y }.reverse().each do |w,f| 
            output.write "<p>#{f}, #{w}</p>\n"

# Generate a word cloud and save as HTML
# ======================================"wordcloud.html", "w") do |output|
    cloud =
    cloud.word_class = "freq-cloud-css"

# Give the user an exported-to message
# ====================================
puts "\nFile exported to #{Dir.pwd}.\n"