Now that I’m getting closer to going live, I have begun testing moving MS Word docs directly into the site. Most of my original research and analysis stuff is done in Word or Excel because I’m pretty much a power user in quick writing, assembly and formatting. If I could convert docs in a relatively straightforward way to AI3, that would be a real boon. However, as I discovered, there are major, major problems and issues with moving Word documents.
The Big Transfer
My first test involved a beast of an analysis piece I had done on the $3 trillion value of U.S. enterprise document assets. (It was eventually posted here.) In its long form, it is 42 pages long, with many tables, a few figures and more than 100 citations and links. It is over 1 MB (1070 KB) in its original form.
Most of us have created Web pages directly from a Word document, and I tried this first. I did the conversion with the filter option, then pasted the result directly into the editor, and attempted to update the site. The transfer seemed to take forever and then the server hung. My suspicion was that the Word HTML code was too complex
Cleaning the Word HTML
Before packing it in and splitting the original doc into multiple pieces so that my site could choke it down, I decided to do a bit of investigation on alternaltive clean up utilities and approaches. One good review I found was by Laurie Rowell on ‘Clean HTML from Word: Can it be Done?’. I recommend the four-parter.
Laurie’s review suggested some improvements could be made from the MS Web page with filtering option using third-party tools, but they did not appear enough to enable me to proceed without splitting my files. Nonetheless, after following that advice, mostly using the MS Web page with filtering, I again attempted a transfer with results as before.
After this initial cleaning, I used both Word and Composer (the Mozilla HTML editor) to do some search-and-replace removal of further HTML tags. Using pattern replacements, esp. with Word, it is possible to also replace line breaks and tab characters, so long as the base file does not have an expected Word extension such as .doc, .rtf or .html (for convention, I always use .txt). This was goling well in reducing the sizes of the files (the best I was able to achieve on the 1 MB file was about 450 KB), but the process was laborious even with global search-and-replace. Furthermore, WordPress was not choking down the smaller files. And unbeknownst at this stage, other issues were being introduced into the files that would make later steps even more difficult.
Splitting Files
I then split the document into six parts, the largest being about 250 kB. As before, they were cut-and-pasted into the editor and then posted. I again got server errors and time-outs. With the assistance of Kevin Klawonn of BrightPlanet, he was able to determine that the Apache server was timing out after 30 sec. With a minor parameter change, we were able to get all files uploaded.
However, while the system was now choking down the files, they looked terrible! Line breaks were totally messed up and being able to edit them within the Xinha editor was close to hopeless. Clearly, and unfortunately, the code was still not clean enough.
Problems with Composer
A natural assumption was that these open source editors were "buggy" and unable to handle more commercial strength requirements. As a natural response, I turned to Composer, a standard HTML utility in my Mozilla browser.
I had not used Composer much before, but found much to like. It has nice toggles between HTML source and WYSIWYG. It offers menu options for most "standard" activities I would undertake with a Web page. In short, I liked working with it and thought it might become an offline (at least from my blog) standard for doing HTML WYSIWG editing of large imports. I actually started becoming familiar with the app and its controls and features.
However, upon actual incoprporaton of the results, I found a nasty truth. Composer introduces forced line breaks at about margin 70. As a basis to incorporate into other apps — all of which need to work nicely together — this was fatal. So much for Composer. I was sad …..
Problems with Xinha
I have observed line breaks being introduced by Xinha, but have not been able to reproduce the actual steps. In general, the system seems to be OK about not introducing spurious breaks, including when editing moves from full to smaller screen.
Cleaning the Word HTML II – Textism
The entire process of using files in multiple applications with mutliple behaviors had worked to create a total HTML nightmare in my test file baseline. Remembering one of the options in Laurie Rowell’s piece (above), I decided to break my normal rule against paid products and check out the Textism site using the Word Clean utility. Dean Allen actually has an interesting pricing model which mixes aspects of free, seduction and low cost.
This is a superior and professional offering:
In short, a total recommendation. Any user needing to move a few files per month from Word to their blog should defintely consider this service.
Final Adopted Process
Depending on the length of the original Word document and its complexity, I recommend one of two approaches given current tools (at least the ones I have tested.)
For shorter Word documents, those with little complexity, or internal or external references:
For Word documents that do not meet these conditions, the path is tortuous and onerous:
I know this sounds like a pain, and it is. You should also keep saved versions of interim steps above to have fallbacks if necessary.
Note: There are instances when the size of the file and the degree of final HTML editing and clean-up may suggest offline editing because server-side editing is slow, updated posts may take forever or experience server time-outs, or they may simply crash the server. If offline editing is necessary, do make sure an HTLM editor is used that does not insert those insidious line breaks. If it does, you will spend hours of frustration trying to get everything clean again.
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb
I finally decided to bite the bullet and give some concerted attention to filling in some of the "standard" background material I have been contemplating for the site. I’d done some earlier work pulling together bio and mission material; I spent most of today (and productively so!) completing those items and some of the other linkage glue.
I’m pretty pleased with the near-readiness of the site for at least initial release. I have to make the plunge and just do it, and then see how exposure and a "live" environment require new efforts.
I remain very disappointed with how difficult it is to compose in a free-flowing manner without having to worry about formatting and HTML. The WYSIWYG editor I’m currently using is not available to both posts and pages, and copy and pasting between environments does really screwy things with respect to word wraps, picking up (or not) embedded HTML (if it doesn’t, it requires pretty painstaking editing).
But, all in all, good progress. I’m quite close to releasing, but may delay somewhat with a week of business travel coming up next week. [As noted elsewhere, this actual post is one month out of sync from the actual release date.]
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb
Two things broke in adding the WordPress permalink feature. This post describes those problems, attempted fixes and descriptions for each, then followed by current resolution and approach.
Why Permalinks?
Rather than use the ?page_id=num and ?p=num internal references for postings, WordPress provides a permalink feature that converts these ID references into URL strings that help in search engine indexing. Here is the permalink structure I settled upon for my site:
/index/ai3/%year%/%postname%/
This produces a URL that contains a truncation of the post name title, plus other relevant information. That is well and good and the search engines would love me, but turning this feature on caused: 1) images were lost due to reference changes; and 2) Jonathan Foucher’s ‘Popular Posts’ plugin ceased displaying.
Lost Images
The first problem is that all of my site images no longer could be found. According to the last entries in the WordPress support file, I needed to add these lines of code to my main site index file:
<?php $basehref = "http://".$_SERVER['SERVER_NAME'].($_SERVER['SCRIPT_NAME']); ?>
<base href="<?php echo"$basehref"; ?>">
As well, I needed to preface my internal image references by ‘/index.php/’.
These changes again allowed my images to be properly displayed, but I guess I’m not sure why all of the pieces worked. With this first problem fixed, I could now address the second.
Popular Links
I use Jonathan Foucher’s ‘most popular’ plugin, which broke when I introduced a permalink without an ID field. This plugin itself relies on the Randy Peterson’s StatTraq statistical reporting plugin.
In researching this problem, I came across a posting by Jonathan noting the issue was indeed when no post id is included in the permalink structure. Because the post id is not added to the StatTraq table in the ‘article_id’ field, it causes the ‘page views’ StatTraq report to show only ‘Mixed’ page views, not the actual posts viewed by visitors. I tried his suggested resolution by making these line (25 and 26) changes in the stattraq.php file:
if (($p != '')){
$p = intval($p);
With this replacement:
if (($post->ID != '')){
$p = intval($post->ID);
Unfortunately, that did not fix my problem.
Resolution
Since I could not get images and popular links to work simultaneously, I decided to pass. I suspect that late updates to StatTraq and WordPress will better address these problems (Randy has announced a pending update for StatTraq). While I like the fact these tools are extensible and many discuss successful hacks, it does concern me to hack code unnecessarily that might make installing later upgrades and bug fixes even more problematic.
So, in thnking about the fact that AI3 is likely to be very content-filled anyway, I decided to put off resolving the permalink issue until another day. I’d already spent too many hours on a dead-end.
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb
After some days of delays doing my normal day job, I was able to return to getting the site ready.
The key things for today were setting up the relationships, linkages and accounts with entities like Blogstreet, etc. I was very impressed with how easy it was to add these systems; their Web sites and on-site instructions with accompanying HTML and Javascript (if used) were excellent.
I set up links for:
I will continue on getting this prep stuff done, but upcoming business travel may push out my desired site release a few days further.
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb
After deciding that Xinha would be the WYSIWYG editor of choice (at least for the time being), I asked Kevin Klawonn, BrightPlanet’s most capable sys admin, to tackle the integration task. Here is his report on his experience; I’m sure later posts will update this matter:
Incorporating Xinha into WordPress has been more than a frustrating experience. I was hoping that because of the appearant widespread use of WordPress and Xinha, the integration would be seamless. At a bare minimum, I was hoping that someone else had achieved and documented exactly what I was trying to accomplish. After looking into how WordPress handles plugins, my goal was to configure Xinha so that it would be similar to the existing plugins. I wanted to be able to drop the Xinha folder into the plugins folder, click the activate button in the admin console, and have Xinha "magically" appear in the editing boxes of WordPress. However, this was not the case.
What I found was that for Xinha to work (sort of), I needed to do just as the instructions said and do some hardwiring into the existing pages of the blog. Once I had the new editor working on one page, I went back to looking into making the integration simpler. After scouring several sites for information and spending quite some time on the task of making the implementation of Xinha simple, I am back to the idea that I just need to make Xinha work. Now that I am back on the track of just getting it to work, I am still confronted with some problems.
In the Newbie Guide for Xinha, it says that the textareas that are to be converted to Xinha editors need to have their names listed in the my_config.js file. The two text areas, the ones on the edit-page-form.php and the edit-form-advanced.php pages, both have the id of "contents". So in the my_config.js file I entered "contents" in the appropriate list. I checked the two pages, and only the editor on the edit-form-advanced.php is using Xinha. Why would one work and not the other?
I read every entry I could on the Xinha website. The only thing that came remotely close to helping was a debunked entry saying to change the:
"window.onload = xinha_init;" line at the bottom of the my_config.js file to read "window.onload = xinha_init();"
Immediately following that entry on the website was another saying to NOT do that because it completely changes the meaning of that line and can have unforeseen consequences. I tried it anyway. Both editors appeared correctly. Unfortunately, if a person refreshed the webpage, the Xinha editors were replaced with the regular text areas. So adding the "()" to that line did not work correctly anyway.
Another problem that I have encountered so far with the Xinha editor is that when I do get it to work on a page, it only appears correctly when I use Firefox v1.0.4. If I use Opera v8.0 Xinha does not appear at all. If I use IE v6 I only get a partial view of Xinha. In my research I have found no one else that has these problems. So, once I finally get Xinha working on all the appropriate pages, I then need to find out why it is not working for certain browsers.
Our objective in this continuing effort is to get Xinha integrated as a true plug-in, allow it to be switched on of off with with existing minimal editor, ensure it works across browsers, and other things we have not yet discovered.
Nonetheless, Kevin has done a great job in areas I can’t even fathom. It is clear that some blog tasks should not be tackled by mere mortals.
Author’s Note: I actually decided to commit to a blog on April 27, 2005, and began recording soon thereafter my steps in doing so. Because of work demands and other delays, the actual site was not released until July 18, 2005. To give my ‘Prepare to Blog …’ postings a more contemporaneous feel, I arbitrarily changed posting dates on this series one month forward, which means some aspects of the actual blog were better developed than some of these earlier posts indicate. However, the sequence and the content remain unchanged. A re-factored complete guide will be posted at the conclusion of the ‘Prepare to Blog …’ series, targeted for release about August 18, 2005. mkb