Html2Wiki

This extension officially lives at https://www.mediawiki.org/wiki/Extension:Html2Wiki This extension to MediaWiki is used to import HTML content (including images) into the wiki.

Imagine having dozens, hundreds, maybe thousands of pages of HTML. And you want to get that into your wiki. Maybe you've got a website, or perhaps a documentation system that is in HTML format. You'd love to be able to use your wiki platform to edit, annotate, organize, and publish this content. That's where the Html2Wiki extension comes into play. You simply install the extension in your wiki, and then you are able to import entire zip files containing all the HTML + image content. Instead of months of work, you could be done in minutes.

Importing a file works like this

                     Select
                       |
                       v
                    Upload
                       |
                       v
       Tidy ----->  Normalize
                       |
                       v
   QueryParse----->  Clean
                       |
                       v
       Pandoc ----> Convert
                       |
                       v
                     Save

Requirements or Dependencies

This extension was built on MediaWiki version 1.25alpha. It may not be compatible with earlier releases since there are a number of external libraries such as jQuery which have changed over time. Contact Us if you have version compatibility issues.

Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the QueryPath project to provide a more flexible parsing platform. The best tutorial on QueryPath is this IBM DeveloperWorks article The most recent list of documentation for QueryPath is at this bug: https://github.com/technosophos/querypath/issues/151 The API docs contain a CSS selector reference

Html2Wiki can import entire document sets and maintain a hierarchy of those documents. The $wgNamespacesWithSubpages variable will allow you to create a hierarchy in your wiki's 'main' namespace; and even automatically create navigation links to parent article content. Taking this further, the SubPageList extension creates navigation blocks for subpages.

The document sets we were importing were based on generated source code documentation (coming from an open source documentation generator called Natural Docs) which creates DHTML "mouseovers" for glossary terms. To create similar functionality in the wiki environment, we will rely on the Lingo extension to create a Glossary of terms.

Usage

System Elements

Once installed, the Html2Wiki extension makes a new form available to Administrators of your wiki. Simply choose a file, click import and watch as your HTML is magically transformed into Wiki text.

You access the import HTML form at the Special:Html2Wiki page (similar to Special:Upload for regular media). The Html2Wiki extension also adds a convenient Import HTML link to the Tools panel of your wiki for quick easy access to the importer.

Single File

Enter a comment in the Comment field, which is logged in the 'Recent Changes' content as well as the Special:Log area.

The upload is automatically categorized according to the content provided.

You can optionally specify a "Collection Name" for your content. The Collection Name represents where this content is coming from (e.g. The book, or website). Any unique identifier will do. The "Collection Name" is used to tag (categorize) all content that is part of that collection. And, all content that is part of a Collection will be organized "under" that Collection Name in a hierarchy. This lets you have 2 or more articles in your wiki named "Introduction" if they belong to separate Collections. Specifying an existing Collection Name + article title will update the existing content. In fact, to reimport a single file and maintain it's 'position' in a collection, you would specify the full path to the file.

Zip File

Choose a zip file to import. The Zip file can contain any type of file, but only html and image files will be processed.

Mechanics

Zip archive handling

In order to handle the zip upload, we'll have to traverse all files and index hrefs as they exist. We'll need to map those to safe titles and rewrite the source to use those safe URLs. This has to be done for both anchors and images.

Practically speaking MW is probably more flexible than we need; but we'll want to check

[legaltitlechars] =>  %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+

Since MediaWiki uses (by default) First-letter capitals, you would normally need to account for that in rewriting all hrefs within the source. However, in practice, we use a Collection Name as the first path element, and MediaWiki will seamlessly redirect foo to Foo.

Styles and Scripts

Cascading Style Sheets (CSS) as well as JavaScript (js) are not kept as part of the transformation. Although, we are working on including CSS

Wiki Text markup

The fundamental requirement for this extension is to transform input (HTML) into Wiki Text (see http://www.mediawiki.org/wiki/Help:Formatting) because that is the format stored by the MediaWiki system. Originally, it was envisioned that we would make API calls to the Parsoid service which is used by the Visual Editor extension. However, Parsoid is not very flexible in the HTML that it will handle. To get a more flexible converter, we use the Pandoc project which is able to (read and) write to MediaWiki Text format.

For each source type (ie. UVM, InfoHub) we will need to survey the content to identify the essential content, and remove navigation, JavaScript, presentational graphics, etc. We should have a "fingerprint" that we can use to sniff out the type of document set that the user is uploading to the wiki.

As a result of sniffing the source type, we can properly index and import content only, while discarding the dross. We can likewise apply the correct transformation to the source. For example, there is a bunch of Verilog source in UVM, that should be converted to GeSHi <source> tags while maybe there isn't any in the InfoHubs.

Form file content is saved to server (tmp), and that triggers conversion attempt. A Title is proposed from text (checked in the db), and user can override naming HTML is converted to wiki text for the content of the article.

Image references are either assumed to be relative e.g. src="../images/foo.jpg" and contained in the zip file, or absolute e.g. src="" in which case they are not local to the wiki.

Want to check your source for a list of image files?

grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./my/html/files/

For each of the image files (png, jpg, gif) contained in the zip archive, the image asset is saved into the wiki with automatic file naming based on the "Collection Name" + path in the zip file.

Also, each image is tagged with the collection name for easier identification.

Image references in the HTML source are automatically updated to reference the in-wiki images.

@todo document the $wgEliminateDuplicateImages option

Database

The extension currently does not make any schema changes to the MediaWiki system.

What, if any, additional tables could we want in the database? ^[1]

We may need to store checksums for zip uploads, because we don't want to store the zip itself, but we may want to recognize a re-upload attempt?

Logging

Logging is provided at Special:Log/html2wiki The facility for logging will tap into LogEntry as outlined at https://www.mediawiki.org/wiki/Manual:Logging_to_Special:Log

Interestingly, SpecialUpload must call LogEntry from it's hooks SpecialImport calls LogPage which itself invokes LogEntry (see includes/logging).

@todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension

In order to use Parsoid at all, we need to have the content conform to the

MediaWikiDOMspec, which is based on HTML5 and RDFa https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Ref_and_References.

We need to parse the incoming content, validate and possibly transform the document type to HTML5 (<!DOCTYPE html>) and then transform the HTML5 to MediaWiki DOMspec http://www.w3.org/TR/html5/syntax.html#html-parser

UVM Content is assumed to be HTML4 Transitional (as some of it is expressly) @todo verify this with Tidy

The InfoHubs content is produced by Quadralay WebWorks AutoMap 2003 for FrameMaker 8.0.2.1385 and outputs XHTML 1.0

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

@link http://www.w3.org/TR/html5/obsolete.html

Parsoid offers an API with basically two actions: POST and GET You can test the API at http://parsoid-lb.eqiad.wikimedia.org/_html/

You can also test it locally on the vm through port 8000

We're converting Mentor Graphics Documentation System

mgc_html v3.2_2.03
infohub_core v3.2_2.02

Variables we care about

We probably want a variable that can interact with the max upload size
$wgMaxUploadSize[*] = 104857600 bytes (100 MB)
$wgFileBlacklist we don't care about because we use our own file upload and mime detection
$wgVisualEditorParsoidURL we can use for API requests to Parsoid
$wgLegalTitleChars we use to check for valid file naming
$wgMaxArticleSize default is 2048 KB, which may be too small?
$wgMimeInfoFile we don't yet use
Also, how do imagelimits come into play? http://localhost:8080/w/api.php?action=query&meta=siteinfo&format=txt

Features

Add a link to the sidebar for our extension. $wgUploadNavigationUrl is for overriding the regular 'upload' link (not what we want).

Instead, we have to edit MediaWiki:Common.js see https://www.mediawiki.org/wiki/Manual:Interface/Sidebar

Internationalization

http://localhost:8080/wiki/Special:Html2Wiki?uselang=qqx shows the interface messages You can see most of the messages in Special:AllMessages if you filter by the prefix 'Html2Wiki'

Error handling

submitting the form with no file There was an error handling the file upload: No file sent.
choosing a file that is too big: limit is set to 100 MB
choosing a file of the wrong type There was an error handling the file upload: Invalid file format.
choosing a file that has completely broken HTML: You could end up with no wiki markup, but it tries hard to be generous.

Developing

This extension was originally written by and is maintained by Greg Rundlett of eQuality Technology. Additional developers, testers, documentation helpers, and translators welcome!

The project code is hosted on both GitHub and WikiMedia Foundation servers on the Html2Wiki Extension page. You should use git to clone the project and submit pull requests. The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.

git clone https://gerrit.wikimedia.org/r/p/mediawiki/

or (with gerrit auth)

git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid

The best way to setup a full development environment is to use MediaWiki Vagrant. This handy bit of wizardry will create a full LAMP stack for you and package it into a VirtualBox container (among others).

↑ https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates

[1] ttps://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates

[1]

Html2Wiki

Requirements or Dependencies

Usage

System Elements

Single File

Zip File

Mechanics

Zip archive handling

Styles and Scripts

Wiki Text markup

Database

Logging

Variables we care about

Features

Internationalization

Error handling

Developing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Freephile

eQuality Technology

Categories

Print/export

Tools