Posted:July 27, 2005

Preparing to Blog – Word Docs to HTML II

Earlier, I had documented my experience in testing Word docs to HTML conversion. As the time was approaching to actually do that with a huge 42 pp post, I began to put that process in play. I again discovered it was kludgy, went back to the drawing board, and refined my process and steps. This post updates this new process. Though it has many steps associated with it, it works and is quite clean.

When to Convert

As the earlier post noted, this procedure should ONLY be used for longer, complicated documents that have already been created in Word. (It applies to Excel spreadsheets, as well, whether straight from Excel or embedded in Word.)  There are thus four ways I use to create content for AI3:

  • Large, existing Word documents (most often produced for other purposes), for which I use the process documented here.  The reference document used in this example is DocName.doc
  • Short, existing Word documents, which I save as text and then final format within WordPress using the Xinha WYSIWYG editor
  • Large, new postings (most with complicated things like tables, etc.), which I compose offline using the Nvu WYSIWYG HTML editor (see below), and then import under the Prepare to Post sections below, and
  • Short, new postings, which I compose directly within the WordPress administration center using Xinha.

Thus, assuming we are in the first or third categories, here are the revised steps for Word doc converstions and postings.

Create Original Word Document

  1. Create the Word doc as normal
  2. When completely finalized, create as a “special” HTML-ready version. That is, reduce unnecessary styles, move footnotes to endnotes, remove repeated page headers and footers, or eliminate anything else extraneous that may make sense in a multi-page Word or printed document, but not as a Web page
  3. Save As the document using the Web Page, Filltered option. Give the document a logical name similar to the original, but include HTML in the name to distinguish it (for example, DocNameHTML.html). This is also the version you may need to return to should you have to go back to square one in this conversion process.

Cleanup Word HTML

You are now ready to stage the document for posting. The first phase is to cleanup the ugly HTML created by Word.

  1. Create an account with Textism for the Word HTML Cleaner. The service charge is minimal (~$6 for 24 hrs or ~$25 per year) and it does a fantastic job of stripping down to essential HTML and formatting the code for readability
  2. Submit the file to the Textism service
  3. Cut-and-paste the resulting clean version to your local drive; give the new document a logical name similar to the original, but include Clean or some other standard designator to distinguish it (for example, DocNameClean.html).

Edit HTML into Final Form

Assuming you want to make final formatting changes for what appears on your site, such as for example final table formatting and the like, you now need to edit the document into its final presentation look. You will need to use a WYSIWYG HTML editor or composer. In my previous post on this subject, I used Mozilla’ Composer. Most recently, I have been trying out the Nvu (“N-view”) editor, which is a new branch from Composer by Daniel Glazman for Linspire. (Though there are a couple of frustrations with the product such as using body text rather than paragraph as the default style and that insidious problem of inserting line breaks shared with Composer, it is an advancement from Composer that looks to be heading in a great direction.) (The Nvu user guide is also much better than that for Composer.) Thus the edit steps are:

  1. Use the editor to make all final HTML and formatting changes. Because of steps to follow that are a pain to repeat, remain at this step until the document looks exactly as you want it with final edits. If, per chance, you look at your document at this stage in WordPress, you may see funky line breaks and some other minor problems (addressed below). Don’t worry about them now, they will be cleaned up in the last steps
  2. Besides formatting, you may want to look for and remove items such as excess anchor tags (Word cross-referencing can introduce more than desired), caption and other style-specific entries, etc.
  3. You will also need to provide image references relative to your site. Click on each image and in the image dialog provide a link path similar to (for my site), and make sure you have set a checkbox if it is there to URL is relative to page location. Of course, using this approach means you have already placed the referenced images in the path location so indicated
  4. When editing is complete, save the document using a logical name similar to the original, but include Format or some other standard designator to distinguish it (for example, DocNameFormat.html)
  5. And, because of the next step, you may need a text version. Thus, after saving the HTML version, save another version using the same name but with the .txt extension (for example, DocNameFormat.txt)

Prepare to Post

We are now at the final steps to remove some of the problems created because of the quirks in previous tools. If you use a different set of tools or perhaps emphasize different elements (such as forms) in your HTML, you may have slightly different steps that what I outline below. Also, I use MS Word as the final cleanup editor because 1) we began with Word doc problems; and 2) Word has the global search-and-replace (S & R) capabilities and tab and paragraph recognition abilities needed for these steps. Any other text editor that has these capabilities may be substituted. (Note. we needed to name our file as a .txt because Word infers formatting based on extension; other text editors can handle html extensions without this problem.). Thus the final preparation steps are:

  1. Open Word and make sure you turn on the show paragraph (¶) option; it acts to display paragraph, tab and spaces, useful for determining S & R patterns
  2. S & R first step for line breaks:  Replace </p>^p –>  </p>@ (the </p> is the actual HTML paragraph end or break, the ^p is the line break symbol, the @ sign is used temporarily as a pattern for later replacement)
  3. S & R second step for line breaks:  Replace ^p –> [single space] (don’t enter the square brackets, simply the single space. This step removes the insidious inserted line breaks and re-instates proper spacing)
  4. S & R third step for line breaks: Replace </p>@ –> </p>^p^p (this now re-instates the line break at the correct paragraph end and adds an extra break for code readability)
  5. S & R for bullet replacements may need to find HTML patterns created in previous steps that introduce funky characters. In my documents, I often see this symbol (§) as a replacement for the Word bullet, which is shown by the HTML of &sect;. Since the pattern I see is &sect;[single space], I replace that with null. Depending on the bullet type you use in Word, you may see different patterns
  6. S & R other changes as necessary
  7. Then, save the document under a logical name similar to the origiinal, but add Post or some other standard designator (for example, DocNamePost.txt).

Post Final

You are now ready to post to WordPress. To do so:

  1. Remove the first and last line from your DocNamePost.txt file. These are HTML header and body statements. At the top of the file, remove everything up to and including <body>; at the bottom of the file remove everything after and including </body>
  2. Select all and copy
  3. Finally, within the WordPress Write section, paste into the text area. If you are using a WYSIWYG editor, make sure you are using code view (< >) when you paste
  4. Publish or save as a draft.

Voilà, you are done.

Note that the various versioins created during this process enable you to return to any part of the process and make revisions from there.  However, should you need to make major changes to the content after you have posted it to WordPress,  you may be better off deleting the entire post and then re-creating a new one prior to re-posting. This saves update time for some reason.

Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so.  Because of work demands and other delays, the actual site was not released until July 18, 2005.  To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate.  However, the sequence and the content remain unchanged.  A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005.  mkb Markup

Preparing to Blog – Word Docs to HTML II




Earlier, I had documented my experience in testing Word docs to HTML conversion. As the time was approaching to actually do that with a huge 42 pp post, I began to put that process in play. I again discovered it was kludgy, went back to the drawing board, and refined my process and steps. This […]

see above


Leave a Reply

Your email address will not be published. Required fields are marked *