Merging Adobe PDF's and generating a table of contents on the fly using ruby

So let's say you have some random PDFs and what you want is one PDF that includes all of the original PDF files and a table of contents listing all of the files and the proper page numbers. Well in Ruby it is not too hard to put this together. There are a wealth of plugins, gems, and other ruby software available for manipulating and creating PDFs (a thorough list can be found here - http://wiki.rubyonrails.org/rails/pages/HowtoGeneratePDFs). To get this project up and running we are going to use two PDF::Writer (http://rubyforge.org/projects/ruby-pdf/) and PDFTK (http://www.accesspdf.com/pdftk/) - though if you want to get fancier and also include text, html, or xml documents you can use PDF::Htmldoc (http://htmldoc.rubyforge.org/) which requires Htmldoc to be installed. Before I do get started though, I also have give thanks to George Anderson over at Benevolent Code who wrote a lot of similar code on the project which provided me with some great examples.

In the project that I wrote this for we began by creating a wrapper class for the PDFTK gem mentioned earlier. The basic shell of the method we will be using accepts a list of files and a a title for the new pdf and begins and ends by creating and removing temp directories respectively.

class PdfTool< PDF::Toolkit
  def self.concat_and_make_table_of_contents(title, files)
    require 'pdf/writer'
    begin
      tmpdir = make_temp('')
      FileUtils.mkdir(tmpdir)

       #the rest of the code is going here   

    rescue => error
      raise error
    ensure
      FileUtils.rm(temppath)
    end

  end
end

Following this we are going to have set up PDF::Writer for creating the table of contents. Unfortunately PDF::Writer is a bit messy to get going, but we are only going to be creating a table of contents with the simplest format possible so it will not be too bad. Of course if you are not going to want this hard coded, we could easily rewrite the concat_and_make_table_of_contents so that it takes a hash of formatting variables. Here is the set up of PDF:Writer as well printing of the title.

@text_font = "Times-Roman"
      @title_font_size = 17
      @title_vertical_spacing = 15
      @page_font_size = 12
      @page_vertical_spacing = 8

      print_pdf = PDF::Writer.new(:paper => "A4")
      print_pdf.margins_pt(36)
      @top_heading_font_position = 10

      print_pdf.select_font @text_font
      print_pdf.text("<b>#{title}</b>", :font_size => @title_font_size,:justification => :center,:left => @top_heading_font_position)
      print_pdf.move_pointer(@title_vertical_spacing)

Next we need to go through the files and generate the table of contents. We will have a loop to check that all of the files in the array exist and are PDFs (though this will need to change if you decide to add in html, etc conversion). Then we are going to start assuming the first page of the first document in the list is page 1 (if you want to include the table of contents in the page count you will mostly likely have to generate the table of contents once, then figure out how many pages is its then regenerate it with the offset - but we are simplifying this for this example). Following that we are going to loop though all of the files, determine the title from the metadata and print the title and the page number for each page. This will finish up by saving the table of contents to a temp file. (Note: don't be confused we are keeping track of two pdf variables - pdf which corresponds to PDFTK instances and print_pdf which corresponds to PDF::Writer instances).

files = files.delete_if {|f| !File.exists?(f) || File.mime_type?(f) != 'application/pdf'}
      current_page=1
      files.each do |f|
        pdf = open(f)
        page_title = ((pdf[:Title] || pdf[:File]) || File.basename(f))
        print_pdf.text("#{page_title} - #{current_page}", :font_size => @page_font_size,:justification => :left)
        print_pdf.move_pointer(@page_vertical_spacing)
        current_page+= pdf.pages
      end

      print_pdf.save_as(File.join(tmpdir,'toc.pdf'))

You may have noticed one method in the previous code that needs some flushing out. PDFTK provides accessors for all metadata that can be found in the pdf, you simply pass a symbol or a string to the [] method to retrieve it. Furthermore you can use PDFTK to access the number of pages in the file by simply calling pages.

For the last step we are going to concatenate all of the files that were passed in the array with the table of contents. Again PDFTK will come in handy here. There is a class method pdftk which will accept any commands that PDFTK will run as an array. So all we need to do is have the PDFTK invoke the concat method. Following that we are just going to read the data for the generated file and pass that back before we start rescuing and ensuring code. This portion is as follows:

command = [File.join(tmpdir,'toc.pdf')] + files + ['cat','output',File.join(tmpdir, 'outfile.pdf')]
      raise IOError, "Unknown PDFTK Error command: pdftk #{command.join(' ')}" unless pdftk(*command)

      data = nil
      File.open(File.join(tmpdir, 'outfile.pdf'), 'r') { |f| data = f.read }

      return data

So now we have a simple method for concatenating and creating a table of contents for several PDF files.