Posted:September 28, 2005

Multi-part HTML Postings from Word

An earlier posting described a step-by-step process for converting a Word doc to clean HTML for posting on your site. Today’s posting updates that information, with specific reference to creating multi-part HTML postings.

A multi-part posting may make sense when the original document is too long for a single posting on your site, or if you wish to serialize its presentation over postings on multiple days.

Multi-part HTML postings pose a number of unique differences from a single page posting, namely in:

Needing to deal with multiple internal document cross-references (not only for a table of contents but also any Word doc cross-references ((Insert –> Reference –> Cross-reference) such as for internal headers, figures, tables, etc.
Organizing and splitting the table of contents (TOC) itself, and
Image naming and referencing.

So, how does one proceed with a multi-part HTML conversion in preparation for posting?

Specific Conversion Steps

The first requirement is that you must create your baseline Word document with a table of contents (TOC) (Insert –> Reference–> Index and Tables –> Table of Contents). You should give great care to the construction and organization of the TOC because it will dictate your eventual multi-part HTML pages and splits
When the Word doc is absolutely complete (and only then!), follow the steps in the earlier posting on Word docs to HTML to get absolutely as clean an HTML code base as possible. Include all global search and replaces (S & R) as the earlier post instructed. UNTIL THE ABSOLUTELY LAST SPECIFIC CONVERSION STEP #6 BELOW YOU WILL CONTINUE TO WORK WITH THIS SINGLE HTML DOCUMENT! For example, you may end up with clean HTML code for your TOC such as the following:

<a href=”#_Toc106767203″>EXECUTIVE SUMMARY. 1</a>

<a href=”#_Toc106767204″>I. INTRODUCTION. 3</a>

<a href=”#_Toc106767205″>Knowledge Economy. 3</a>

<a href=”#_Toc106767206″>Corporate Intellectual Assets. 4</a>

<a href=”#_Toc106767207″>Huge Implications. 4</a>

<a href=”#_Toc106767208″>Data Warehousing?. 6</a>

<a href=”#_Toc106767209″>Connecting the Dots. 6</a>

<a href=”#_Toc106767210″>II. INTERNAL DOCUMENTS. 7</a>

<a href=”#_Toc106767211″>‘Valuable’ Documents. 7</a>

<a href=”#_Toc106767212″>‘Costs’ to Create. 8</a>

<a href=”#_Toc106767213″>‘Cost’ to Modify. 9</a>

<a href=”#_Toc106767214″>‘Cost’ of a Missed. 9</a>

<a href=”#_Toc106767215″>Other Document ‘Cost’. 9</a>

<a href=”#_Toc106767216″>Archival Lifetime. 10</a>

<a href=”#_Toc106767217″>III. WEB DOCUMENTS AND SEARCH. 10</a>

<a href=”#_Toc106767218″>Time and Effort for Search. 11</a>

<a href=”#_Toc106767219″>Lost Searches. 11</a>

<a href=”#_Toc106767220″>‘Cost’ of a Portal. 14</a>

<a href=”#_Toc106767221″>‘Cost’ of Intranets. 16</a>

<a href=”#_Toc106767222″>IV. OPPORTUNITIES AND THREATS. 18</a>

<a href=”#_Toc106767223″>‘Costs’ of Proposals. 18</a>

<a href=”#_Toc106767224″>‘Costs’ of Regulation. 21</a>

<a href=”#_Toc106767225″>‘Cost’ of Misuse. 24</a>

<a href=”#_Toc106767226″>V. CONCLUSIONS. 25</a>

Do global S & R on the TOC references, replacing with internal page link (e.g., “./ …) references, as this example for the Intro shows:

There will need to be as many S & R replacements throughout the document as there are entries in the TOC. You should be careful to name your internal pages according to your anticipated final published structure for the multi-part HTML pages. Upon completion of the global S & R, you should then remove earlier Word doc page numbers and clean up spaces or other display issues. Thus, using the example above, you could end up with revised code for the TOC as follows:

<a href=”./summary.html”>EXECUTIVE SUMMARY</a>

<a href=”./intro.html”>I. INTRODUCTION</a>

<a href=”./intro.html#knowledge”>Knowledge Economy</a>

<a href=”./intro.html#assets”>Corporate Intellectual Assets</a>

<a href=”./intro.html#huge”>Huge Implications</a>

<a href=”./intro.html#data”>Data Warehousing?</a>

<a href=”./intro.html#dots”>Connecting the Dots</a>

<a href=”./internal.html”>II. INTERNAL DOCUMENTS</a>

<a href=”./internal.html#docs”>‘Valuable’ Documents</a>

<a href=”./internal.html#create”>‘Costs’ to Create</a>

<a href=”./internal.html#modify”>‘Cost’ to Modify</a>

<a href=”./internal.html#missed”>‘Cost’ of a Missed</a>

<a href=”./internal.html#etc”>Other Document ‘Cost’</a>

<a href=”./internal.html#archive”>Archival Lifetime</a>

<a href=”./web.html”>III. WEB DOCUMENTS AND SEARCH</a>

<a href=”./web.html#time”>Time and Effort for Search</a>

<a href=”./web.html#lost”>Lost Searches</a>

<a href=”./web.html#portal”>‘Cost’ of a Portal</a>

<a href=”./web.html#intranets”>‘Cost’ of Intranets</a>

<a href=”./opps.html”>IV. OPPORTUNITIES AND THREATS</a>

<a href=”./opps.html#proposals”>‘Costs’ of Proposals</a>

<a href=”./opps.html#regs”>‘Costs’ of Regulation</a>

<a href=”./opps.html#misuse”>‘Cost’ of Misuse</a>

<a href=”./conclusion.html”>V. CONCLUSIONS</a>

You may also need to do additional code cleanup. For example, in the snippet below, the first href refers to the TOC entry that will be replaced via steps #3 and #6. However, the second href is an internal cross-reference from another location (not the TOC) in the Word doc. For these additional cross-references, you will need either to chose to keep them and rename logically with S & R or to remove them. (Generally, since you are already splitting a long Word doc into multiple HTML pages such additional cross-references are excessive and unnecessary; you can likely remove.):

<h1><a name=”_Toc106767204″></a><a name=”_Toc90884898″> I. INTRODUCTION</a></h1>

How many documents does your organization create each year? What effort does this represent in terms of total staffing costs? Etc., etc.

You will then need to rename your images using global S & R, which were given sequential image numbers (not logical names) in the Word doc to HTML conversion. For example, you may have an image named:

<img width=”664″ height=”402″ src=”Document_files/image001.jpg”>

You will need to give that image a better logical name, and perhaps put it into its own image subdirectory, like the following:

<img width=”664″ height=”402″ src=”./images/CostChart1.jpg”>

Finally, your HTML is now fully prepped for splitting into multiple pages. You need to do three more things in this last step.

First, via cut-and-paste take your TOC and any intro text from the main HTML document and place it into an index.html HTML document. That should also be the parent directory for any of your subsequent split pages. Thus, in our example herein, you would have a directory structure that looks like:

MAIN (where index.html is located)

Summary

Intro

Internal

Web

Opps

Conclusion

Second, cut-and paste the HTML sections from the main HTML document that correspond to the five specific split pages (summary.html to conclusion.html) and place each of them into their own named, empty HTML shells with header information, etc. Thus, the pasted portions are what generally corresponds to the <body> . . . </body> portion of the HTML. This is how the various subparts.html get created.

Third, and last, delete each of the main page cross-references changed during global S & R (these are all of the references without internal anchor # tags); these references are now being handled directly via the multiple, split HTML page documents. For clarity, these deleted references are thus for our example:

<a href=”./summary.html”>EXECUTIVE SUMMARY</a>

<a href=”./intro.html”>I. INTRODUCTION</a>

<a href=”./intro.html#knowledge”>Knowledge Economy</a>

<a href=”./intro.html#assets”>Corporate Intellectual Assets</a>

<a href=”./intro.html#huge”>Huge Implications</a>

<a href=”./intro.html#data”>Data Warehousing?</a>

<a href=”./intro.html#dots”>Connecting the Dots</a>

<a href=”./internal.html”>II. INTERNAL DOCUMENTS</a>

<a href=”./internal.html#docs”>‘Valuable’ Documents</a>

<a href=”./internal.html#create”>‘Costs’ to Create</a>

<a href=”./internal.html#modify”>‘Cost’ to Modify</a>

<a href=”./internal.html#missed”>‘Cost’ of a Missed</a>

<a href=”./internal.html#etc”>Other Document ‘Cost’</a>

<a href=”./internal.html#archive”>Archival Lifetime</a>

<a href=”./web.html”>III. WEB DOCUMENTS AND SEARCH</a>

<a href=”./web.html#time”>Time and Effort for Search</a>

<a href=”./web.html#lost”>Lost Searches</a>

<a href=”./web.html#portal”>‘Cost’ of a Portal</a>

<a href=”./web.html#intranets”>‘Cost’ of Intranets</a>

<a href=”./opps.html”>IV. OPPORTUNITIES AND THREATS</a>

<a href=”./opps.html#proposals”>‘Costs’ of Proposals</a>

<a href=”./opps.html#regs”>‘Costs’ of Regulation</a>

<a href=”./opps.html#misuse”>‘Cost’ of Misuse</a>

<a href=”./conclusion.html”>V. CONCLUSIONS</a>

Voilà. You now have multiple HTML pages from a Word document!

Posted:September 27, 2005

TBL on the Semantic Web

Though it has been out since June, I just today came across an interview with Tim Berners-Lee on the Semantic Web that was conducted by Andrew Updegrove for the Consortium Standards Bulletin. I highly recommend this piece for any interested in an insider’s view to the creation and use of the semantic Web. Here are some highlights. All are direct quotes from Berners-Lee.

Here are some excerpts relating to the vision of the semantic Web:

The goal of the Semantic Web initiative is to create a universal medium for the exchange of data where data can be shared and processed by automated tools as well as by people. The Semantic Web is designed to smoothly interconnect personal information management, enterprise application integration, and the global sharing of commercial, scientific and cultural data.

Many large-scale benefits are, not surprisingly, evident for enterprise level applications. The benefits of being able to reuse and repurpose information inside the enterprise include both for savings and new discoveries. And of course, more usable data brings about a new wave of software development for data analysis, visualization, smart catalogues… not to mention new applications development. The point of the Semantic Web is in the potential for new uses of data on the Web, much of which we haven’t discovered yet.

As for status of the initiative, Berners-Lee directly addresses some critics by emphasizing the importance of automated tools and not author tagging:

It’s not about people encoding web pages; it’s about applications generating machine-readable data on an entirely different scale. Were the Semantic Web to be enacted on a page-by-page basis in this era of fully functional databases and content management systems on the Web, we would never get there. What is happening is that more applications — authoring tools, database technologies, and enterprise-level applications — are using the initial W3C Semantic Web standards for description (RDF) and ontologies (OWL).

Berners-Lee goes on to say:

One of the criticisms I hear most often is, “The Semantic Web doesn’t do anything for me I can’t do with XML”. This is a typical response of someone who is very used to programming things in XML, and never has tried to integrate things across large expanses of an organization, at short notice, with no further programming. One IT professional who made that comment around four years ago, said a year ago words to the effect, “After spending three years organizing my XML until I had a heap of home-made programs to keep track of the relationships between different schemas, I suddenly realized why RDF had been designed. Now I used RDF and its all so simple — but if I hadn’t have had three years of XML hell, I wouldn’t ever have understood.”

Many of the criticisms of the Semantic Web seems (to me at least!) the result of not having understood the philosophy of how it works. A critical part, perhaps not obvious from the specs, is the way different communities of practice develop independently, bottom up, and then can connect link by link, like patches sewn together at the edges. So some criticize the Semantic Web for being a (clearly impossible) attempt to make a complete top-down ontology of everything.

Others criticize the Semantic Web because they think that everything in the whole Semantic Web will have to be consistent, which is of course impossible. In fact, the only things I need to be consistent are the bits of the Semantic Web I am using to solve my current problem.

The web-like nature of the Semantic Web sometimes comes under criticism. People want to treat it as a big XML document tree so that they can use XML tools on it, when in fact it is a web, not a tree. A semantic tree just doesn’t scale, because each person would have their own view of where the root would have to be, and which way the sap should flow in each branch. Only webs can be merged together in arbitrary ways. I think I agree with criticisms of the RDF/XML syntax that it isn’t very easy to read. This raises the entry threshold. That’s why we wrote N3 and the N3 tutorial, to get newcomers on board with the simplicity of the concepts, without the complexity of that serialization.

Some of the other insights in the interview is that early adoption is likely to be internally by enterprises on their intranets, that there will definitely be first-mover advantages for software applications that embrace RDF and OWL, and that a more widely embraced rules-based language (think of a successor to Prolog) may likely emerge.

Highly recommended reading!

Posted:September 19, 2005

Comprehensive Guide to a Professional Blog Site

Author’s Note: I am pleased to offer this comprehensive guide prepared from my “Preparing to Blog” series covering the first four months of learning in creating this AI3 blog site.

The citation for this effort is:

Michael K. Bergman, “Comprehensive Guide to a Professional Blog Site: A WordPress Example,” A Guide Book from the AI3 Blog Site, September 2005, 80 pp.

Click here to obtain a copy of this free guide (80 pp, 1016 K)

Gone beyond Blogger? Want to really be aggressive in functionality and scope of content for your personal, professional or corporate blog? If so, this Comprehensive Guide to a Professsional Blog Site may be useful to you.

This Guide is the result of 350 hrs of learning and experimentation to test the boundaries of blog functionality, scope and capabilities. I myself began this process as a total newbie about six months ago — which likely shows in gaps and naïveté — but I have been aggressive in documenting as I have gone. The learning from my professional blog journey, still ongoing, is reflected in these pages.

This Guide addresses about 100 individual “how to” blogging topics and lessons, all geared to the content-focused and not occasional blogger. More than 140 citations from more than 80 experts provide additional guidance. The Guide itself occupies 80 pages. It is all free with no sign up required.

But there is hopefully more than one pony under the pile for those needing to join the “1% club” of purposeful, content-oriented, professional bloggers. In this Guide you will find discussion of these useful topics:

How to choose blogging software and add-on tools
Taking control of the blogging process by hosting your own site
Getting your blog to display and perform right
Effective techniques for converting existing documents to your blog site HTML
Being efficient in posting, organizing and work-flowing to allow your diarist activities to flow naturally and productively
Keeping the blog site pump primed with fresh and relevant content.

I created this Guide as a discipline in learning how to be a diarist or journalist, akin to the heyday era of “persons of letters” prior to the telegraph. In part, I undertook this discipline to rekindle those daily journal skills of the past. But, for the most part, I undertook the effort because I believe a fundamentally new means and mechanism for adaptive advantage is being created with social computing, of which blogging is a part.

Enjoy! And I welcome your corrections or suggestions for improvements.

Posted:September 18, 2005

Semantic Web and Ontology Tools

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

My current research efforts involve the semantic Web and ontologies. By the semantic Web I include that topic, plus the related technologies and standards of metadata, ontologies, taxonomies, thesauri, controlled vocabularies, XML, RDF and OWL.

A good starting point on tools is from Michael Denny, which is an update of his 2002 ontology editor survey. Other tools surveys include a 2003 HP review from the SIMILE research program on metadata and thesaurus tools; the Semantic Web has a listing of about 245 tools on its beta Web site; and the W3C, as might be expected with its role in RDF and related standards, has an excellent starting point for developer resources, including entries for related standards and technologies.

The ONTOLOG community also lists some tools resources, but more importantly has a very excellent recommended reading compendium. These links are essential starting points for anyone beginning their investigations into the semantic Web.

Finally, Kendall Clark, editor of XML.com, just posted a fascinating piece on SPARQL 2.0, a possible query language to the semantic Web and a longer article on the possible convergence of Web 2.0 and the semantic Web. As he puts it, I’m starting to catch the scent of one of those big convergence things just possibly starting to happen. It smells like money!

Posted:September 15, 2005

Getting Listed on Google Blog Search

Google announced its new beta blog search service this week, and I immediately went to check it out. To my dismay, none of my AI3 blog posts were listed! $#%&*#

My first hint of what to do came from the About Google Blog Search page, which indicated that while Google does not yet have a submission form for submitting pings, the new service does monitor updating services, specifically mentioning Weblogs.com. I then tried to access this site, which was slower than molasses and I timed out many times (I suspect many others were following the same path I was).

That got me into a whole investigation of ping and ping success in general with my WordPress installation. (See my earlier post on Pings and Trackbacks). I was alarmed to discover that many of my ping locations had not been updating well, for reasons that still remain somewhat murky (though others have noted sporadic miscues by WordPress in ping updates, not to mention some of the ping update sites recommended for it such as Ping-o-matic).

The WordPress dashboard suggested that Google was using Ping-o-matic as one of its update services for new listings, so I manually submitted my site again to Ping-o-matic and waited to see the results. Voila! After a reasonable hour or so delay, I found my posts and sites now on the Google blog search service and other locations.

Thus, in the interim before Google completes its submission expansions, I recommend that WordPress bloggers who are not yet listed in the Google blog search:

Occasionally manually ping Ping-o-matic rather than rely that your updates are being handled automatically (but, DON’T do it too frequently since that can be interpreted as spamming behavior)
On a one-time basis, up your synidcation feeds limit on the Options-Reading-Syndication Feeds panel in the dashboard to be large enough to include All of your desired recent postings
Manual submit an update at Ping-o-matic, and
Return the syndicated feeds number to your original amount in your dashboard.

With this simple approach, I can now happily report that all of the AI3 listings are now in the new Google blog search service, and so can yours!

Main Links

Search

Month: September 2005