Converting a Word Doc and PDF into a Drupal Book for web display

Word 2003 / PDF to Drupal Book (Drupal 5.x)

This page describes how I converted a long MS Word document with embedded graphics and references to hierarchical Book structure in Drupal 5 CMS. [2009 update: this site is now at Drupal 6.x and the book is still working].

The original document was my 945 page PhD dissertation with hundreds of Endnote references and figures that began as an MS Word 2003 document with external (Linked) graphics primarily in EPS and JPG formats. The PDF that resulted was submitted to ProQuest and is served online here (PDF), however it was not readily accessible to browsers and search engines unless I converted it to an HTML display format such as a Drupal - Book structure. It is now a book with 499 content nodes (plus the Biblio nodes) that can be viewed here.

Summary

The principal challenges were batch-processing the following steps

  • converting the structure to a web based hierarchy (Drupal Book)
  • converting EPS graphics to web-viewable JPG format
  • converting bibliographical references to web-viewable hyperlink format. 

These were addressed by getting the document out of Word into XHTML, converting the figures to placed JPG files using Regular Expressions searches in Dreamweaver (or a text editor like Notepad++ [Windows] or Smultron [Mac]), and exporting Endnote references to the Drupal - Biblio module with a CiteID key as the basis for the URL Alias.

 

Adding tags to document and formatting for web. 

The original document had hierarchical title styles in Word that were turned into tags specified through stylesheets upon conversion to xHTML. I had used the default Heading tags (e.g., <h1>, <h2>) in the Word document while writing the dissertation.

  1. In Drupal 5 install the following modules: Book (core), HTML2Book, Html Corrector, and HTML Tidy as described in the HTML2Book module description.
  2. Embedded Graphics export: MS Word supports many embedded graphics formats that are not converted adequately to web-viewable formats. In particular, EPS files were garbled during the output. I used a Photoshop droplet to batch-convert all placed EPS files to 144 dpi JPG files with the same filename so they would get referenced instead of the EPS file.
      Using Regular Expressions in the text editor I converted tags that resembled
      <IMG src="/diss_images/filename.EPS" border="0" width="565" height="865">
      to something more like the following
      <a href="sites/all/files/diss_images/filename.JPG"><IMG src="/diss_images/filename.JPG" border="1" width="500"></a>
      Note that in addition to changing the EPS to JPG and the path to a Drupal sites/all path, the file is now displayed inside a hyperlink using an IMG SRC tag that shows only 500px width (proportionally). If the user clicks the image they are shown a larger version. The page is more readable (no giant graphics) but larger views are available on demand.
      The only drawback is that the large image is loaded on every view although its displayed smaller. There are a number of good alternatives for image display in Drupal (ideally, using ImageCache for multiple scaling of images), but this was the most expedient for this project. 
      Update (March 2009): This turns out to have been a good decision because now that I've updated the whole Dissertation-as-drupal-book to Drupal 6.10 I can use Quicksketch's Image Resize Filter module. This module caches smaller versions of images for quick display, the cached size is determined by the size specified in IMG tags.
     

Formatting In text citations for linking out to Biblio nodes

As of Endnote X2 this program still cannot produce hypertext links to a bibliographical view. I worked around that by modifying the In-Text citation style to include the unique ID number Endnote Ref# to use as a CiteKey in Biblio. Then I used Regular Expressions search (in Notepad++ or Dreamweaver) to convert the in-text citation to a hyperlink.

  1. The formatted Endnote tags with the Endnote reference included, were derived from the American Antiquity journal style and looked like this
    (<a href= "/biblio/ref_custom1">Author Year</a>)


Process for Import to Drupal from Word filtered .HTML

  1. Select out all or part of the text in Word (I did chapter by chapter so 50-200 pages at a time). Make sure the graphics have loaded. If they appear as an empty box with a red X try reloading them in Word by selecting and hitting F9 or toggling the code display with Alt-F9 (windows).
  2. If you also have a PDF of the document you can try Exporting the PDF to XML in Acrobat Pro and then you'll get an "images" folder full of the embedded graphics
  3. Save this selection out to an HTML file with the Save As... to the "Web Page, Filtered...(HTM)" format
  4. In Drupal 5 create your first page of a Book so the top level hierarchy is started. Create a secondary page and make sure the PathAuto and Tokens-derived URL works. I used [bookpath-raw]/[title-raw]
  5. Open the HTML file saved out from Word and paste into Drupal. There are different ways to do this but one that worked for me was to
  6. Turn off Rich text (FCKEditor or other) and paste in straight HTML as Source. Allow the HTML Tidy to work.
  7. I had to strip out other tags including replacing "<p class = c1>" with <p>, correcting the "-" symbol in places, and turning my custom <quote> style tag from Word into <blockquote> and </blockquote>. Dreamweaver has a "Find/Replace Tags" command that makes this easy.
  8. Bring in your Book, probably removing the Topmost <h1> tag. That is to say, the Bookpage you're creating has a title already so the topmost tag should not reproduce that.
  9. Note that the HTMLTidy (5-1.x-dev, 19 june 2007) has conflicts with CCK custom node types so I turn it off when not in use.

Bibliographical entries: Endnote db to Biblio via XML

Exporting from Endnote x1

Settings for Export from Endnote to Biblio 5.x

    • The XML format should include the tag <rec-number>.
    • Export the file to XML using File>Export... then Save as Type... XML, Output Style: (I think any will do as long as <Rec-number> is included in the XML)
    • Import to Biblio
    • Note: rec-numbers are specific to each database (but cumulative) so they aren't the best unique ID#. That's just what worked for me.
      Its probably better to make use of the digital object identifier (DOI) and the Biblio citekey functionality.

Bibliographical data into Biblio 5.x as EndNote 8+ XML

Taxonomy settings: I used a separate Drupal Taxonomy Vocabulary for Pubs so I could can hide that entire Vocabulary in the Book and the Biblio
using Taxonomy Hide module (because the default Taxonomy teaser view seems useless).

  • use a text editor to replace <rec-number></rec-number> tags with <custom1></custom1>
  • upload the XML file to Biblio using Endnote8+ XML format.

 

I had trouble with the upload timing out so I had cut your file into smaller XML file chunks during the export from Endnote.

    • Due to memory limitations I had to reduce the file size down to import in batches as small as 100. I created "Groups" in Endnote (available as of Endnote X) alphabetically by author name. I created a A-B group of 80 references and a C-D group of 92 references, and so on. These smaller files can be brought from Endnote into Biblio one by one. Discussion of issue.


Then I went into the PathAuto settings (v5.x) and changed the Node path settings for Biblio type "Pattern for all biblio paths:" to biblio/ref_[biblio_custom1], but this also required modifying the Biblio.module code to add support for showing Custom1 field in the PathAuto+Token system for URL Aliasing, as described in
http://drupal.org/node/89038#comment-869934

I used Custom1 instead of CiteKey because Custom1 is imported in the original Endnote8 XML parser (where CiteKey is not provided). The "unique ID" number used in the URL is relatively arbitrary: it is the <rec-number> in the Endnote bibliography that I used in graduate school. If these records are brought into another DB or even saved into a subset Endnote .ENL file the ID numbers change.