Html2Wiki: Difference between revisions

Line 1:

This extension to MediaWiki is used to import HTML content (including images) into the wiki.

This extension officially lives at https://www.mediawiki.org/wiki/Extension:Html2Wiki This extension to MediaWiki is used to import HTML content (including images) into the wiki.

Imagine having dozens, hundreds, maybe thousands of pages of HTML. And you want to get that into your wiki. Maybe you've got a website, or perhaps a documentation system that is in HTML format. You'd love to be able to use your wiki platform to edit, annotate, organize, and publish this content. That's where the '''Html2Wiki''' extension comes into play. You simply install the extension in your wiki, and then you are able to import entire zip files containing all the HTML + image content.

Imagine having dozens, hundreds, maybe thousands of pages of HTML. And you want to get that into your wiki. Maybe you've got a website, or perhaps a documentation system that is in HTML format. You'd love to be able to use your wiki platform to edit, annotate, organize, and publish this content. That's where the '''Html2Wiki''' extension comes into play. You simply install the extension in your wiki, and then you are able to import entire zip files containing all the HTML + image content. Instead of months of work, you could be done in minutes.

== Requirements or Dependencies ==

This extension was built on MediaWiki version 1.25alpha~~, and is likely~~ not compatible with earlier releases since there are a number of external libraries such as [https://www.mediawiki.org/wiki/JQuery jQuery] which have changed over time

This extension was built on MediaWiki version 1.25alpha. It may not be compatible with earlier releases since there are a number of external libraries such as [https://www.mediawiki.org/wiki/JQuery jQuery] which have changed over time. Contact Us if you have version compatibility issues.

~~It may depend on a Parsoid service, which is used to transform an HTML DOM into wikitext. More at [http://www.mediawiki.org/wiki/Parsoid http://www.mediawiki.org/wiki/Parsoid]~~

Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the [http://querypath.org/ QueryPath project] to provide a more flexible parsing platform. The best tutorial on QueryPath is this [http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html IBM DeveloperWorks article] The most recent list of documentation for QueryPath is at this bug: https://github.com/technosophos/querypath/issues/151

Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the [http://querypath.org/ QueryPath project] to provide a more flexible parsing platform.

Html2Wiki can import entire document sets and maintain a hierarchy of those documents. The [http://www.mediawiki.org/wiki/Manual:$wgNamespacesWithSubpages $wgNamespacesWithSubpages] variable will allow you to create a hierarchy in your wiki's 'main' namespace; and even automatically create navigation links to parent article content. Taking this further, the [https://www.mediawiki.org/wiki/Extension:SubPageList SubPageList ] extension creates navigation blocks for subpages.

Line 45:

Line 43:

<pre>[legaltitlechars] => %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+</pre>

Since ~~we're using~~ First-letter capitals, ~~we'll~~ need to account for that in rewriting all hrefs.

Since MediaWiki uses (by default) First-letter capitals, you would normally need to account for that in rewriting all hrefs within the source. However, in practice, we use a Collection Name as the first path element, and MediaWiki will seamlessly redirect foo to Foo.

Line 55:

Line 53:

The fundamental requirement for this extension is to transform input (HTML) into Wiki Text (see [http://www.mediawiki.org/wiki/Help:Formatting http://www.mediawiki.org/wiki/Help:Formatting]) because that is the format stored by the MediaWiki system. Originally, it was envisioned that we would make API calls to the Parsoid service which is used by the Visual Editor extension. However, Parsoid is not very flexible in the HTML that it will handle. To get a more flexible converter, we use the [https://github.com/jgm/pandoc Pandoc] project which is able to (read and) [https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Writers/MediaWiki.hs write to MediaWiki Text format].

For each source type (ie. UVM, InfoHub) we will need to survey the content to identify the essential content, and remove navigation, JavaScript, presentational graphics, etc. We should have a "fingerprint" that we can use to sniff out the type of document set that the user is uploading to the wiki.

Line 63:

Line 60:

Form file content is saved to server (tmp), and that triggers conversion attempt. A Title is proposed from text (checked in the db), and user can override naming HTML is converted to wiki text for the content of the article.

Image references are ~~extracted from~~ source

Image references are either assumed to be relative e.g. <code>src="../images/foo.jpg"</code> and contained in the zip file, or absolute e.g. <code>src="http://example.com/images/foo.jpg"</code> in which case they are not local to the wiki.

Want to check your source for a list of image files?

grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./~~data/uvm-1.1d/docs~~/html/files/

grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./my/html/files/

</source>

~~or you could use PHPDOM, but it's awfully finicky and under-documented~~

For each of the image files (png, jpg, gif) contained in the zip archive, the image asset is saved into the wiki with automatic file naming based on the "Collection Name" + path in the zip file.

~~For~~ each ~~of those images, the~~ image ~~asset~~ is ~~retrieved, uploaded~~ with ~~automatic file naming based on~~ the ~~"prefix" + src attribute~~.

Also, each image is tagged with the collection name for easier identification.

~~Also, each image is tagged with~~ the ~~collection name for easier identification~~

Image references in the HTML source are automatically updated to reference the in-wiki images.

~~Once all~~ the images are contained in the wiki, the wiki markup for the article can be updated to reference those images. IOW, it may be possible to upload an HTML source file, and batch a job to import all images AND update the article source to use the images... Or, since we know ahead of time what the image file name will be, we can just reference the non-existant images in the article. They will exist in the wiki after a short delay required to fetch and process the image files.

@todo document the $wgEliminateDuplicateImages option

~~(Eliminate duplicate images based on checksum?)~~

== Database ==

The extension currently does not make any schema changes to the MediaWiki system.

What, if any, additional tables do we ~~need~~ in the database? <ref>https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates</ref>

What, if any, additional tables could we want in the database? <ref>https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates</ref>

We may need to store checksums for zip uploads, because we don't want to store the zip itself, but we may want to recognize a re-upload attempt?

== Logging ==

Logging is provided at [[Special:Log/html2wiki]] The facility for logging will tap into <code>LogEntry</code> as outlined at https://www.mediawiki.org/wiki/Manual:Logging_to_Special:Log

Interestingly, SpecialUpload must call <code>LogEntry</code> from it's hooks SpecialImport calls <code>LogPage</code> which itself invokes <code>LogEntry</code> (see includes/logging).

* @todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension

~~More information can be found at https://www.mediawiki.org/wiki/Extension:Html2Wiki~~

~~The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.~~

* @todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension

~~<source lang="bash">~~

~~git clone https://gerrit.wikimedia.org/r/p/mediawiki/~~

~~</source>~~

~~or (with gerrit auth)~~

~~<source lang="bash">~~

~~git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid~~

~~</source>~~

In order to use Parsoid at all, we need to have the content conform to the

Line 164:

Line 155:

== Developing ==

This extension was originally written by and is maintained by Greg Rundlett of [http://eQuality-Tech.com eQuality Technology]. Additional developers, testers, documentation helpers, and translators welcome!

The project code is hosted on both [https://github.com/freephile/Html2Wiki GitHub] and WikiMedia Foundation servers on the[https://www.mediawiki.org/wiki/Extension:Html2Wiki Html2Wiki Extension page]. You should use git to clone the project and submit pull requests.

The project code is hosted on both [https://github.com/freephile/Html2Wiki GitHub] and WikiMedia Foundation servers on the [https://www.mediawiki.org/wiki/Extension:Html2Wiki Html2Wiki Extension page]. You should use git to clone the project and submit pull requests. The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.

git clone https://gerrit.wikimedia.org/r/p/mediawiki/

</source>

or (with gerrit auth)

git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid

</source>

The best way to setup a full development environment is to use [https://www.mediawiki.org/wiki/MediaWiki-Vagrant MediaWiki Vagrant]. This handy bit of wizardry will create a full LAMP stack for you and package it into a VirtualBox container (among others).

~~== Uploads in MediaWiki ==~~

~~How does upload work in MediaWiki?~~

~~Parsoid.hooks.php has a hook~~ <~~code>onFileUpload</code>, which refers to <code>filerepo/file/LocalFile.php</code>~~

~~If I'm reading it correctly, LocalFile.php is mostly called from a Repo object.~~

~~<code>LocalFile::upload()</code> and <code>recordUpload()</code~~> ~~are informative.~~

~~{{References}}~~

@@ Line 1: / Line 1: @@
-This extension to MediaWiki is used to import HTML content (including images) into the wiki.
+This extension officially lives at https://www.mediawiki.org/wiki/Extension:Html2Wiki This extension to MediaWiki is used to import HTML content (including images) into the wiki.
-Imagine having dozens, hundreds, maybe thousands of pages of HTML.  And you want to get that into your wiki.  Maybe you've got a website, or perhaps a documentation system that is in HTML format.  You'd love to be able to use your wiki platform to edit, annotate, organize, and publish this content.  That's where the '''Html2Wiki''' extension comes into play.  You simply install the extension in your wiki, and then you are able to import entire zip files containing all the HTML + image content.
+Imagine having dozens, hundreds, maybe thousands of pages of HTML.  And you want to get that into your wiki.  Maybe you've got a website, or perhaps a documentation system that is in HTML format.  You'd love to be able to use your wiki platform to edit, annotate, organize, and publish this content.  That's where the '''Html2Wiki''' extension comes into play.  You simply install the extension in your wiki, and then you are able to import entire zip files containing all the HTML + image content.  Instead of months of work, you could be done in minutes.
 == Requirements or Dependencies ==
-This extension was built on MediaWiki version 1.25alpha, and is likely not compatible with earlier releases since there are a number of external libraries such as [https://www.mediawiki.org/wiki/JQuery jQuery] which have changed over time
+This extension was built on MediaWiki version 1.25alpha.  It may not be compatible with earlier releases since there are a number of external libraries such as [https://www.mediawiki.org/wiki/JQuery jQuery] which have changed over time.  Contact Us if you have version compatibility issues.
-It may depend on a Parsoid service, which is used to transform an HTML DOM into wikitext.  More at [http://www.mediawiki.org/wiki/Parsoid http://www.mediawiki.org/wiki/Parsoid]
+Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the [http://querypath.org/ QueryPath project] to provide a more flexible parsing platform.   The best tutorial on QueryPath is this [http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html IBM DeveloperWorks article] The most recent list of documentation for QueryPath is at this bug: https://github.com/technosophos/querypath/issues/151
-Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the [http://querypath.org/ QueryPath project] to provide a more flexible parsing platform.
 Html2Wiki can import entire document sets and maintain a hierarchy of those documents.  The [http://www.mediawiki.org/wiki/Manual:$wgNamespacesWithSubpages $wgNamespacesWithSubpages] variable will allow you to create a hierarchy in your wiki's 'main' namespace; and even automatically create navigation links to parent article content.  Taking this further, the [https://www.mediawiki.org/wiki/Extension:SubPageList SubPageList ] extension creates navigation blocks for subpages.
@@ Line 45: / Line 43: @@
 <pre>[legaltitlechars] =>  %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+</pre>
-Since we're using First-letter capitals, we'll need to account for that in rewriting all hrefs.
+Since MediaWiki uses (by default) First-letter capitals, you would normally need to account for that in rewriting all hrefs within the source.  However, in practice, we use a Collection Name as the first path element, and MediaWiki will seamlessly redirect foo to Foo.
@@ Line 55: / Line 53: @@
 The fundamental requirement for this extension is to transform input (HTML) into Wiki Text (see [http://www.mediawiki.org/wiki/Help:Formatting http://www.mediawiki.org/wiki/Help:Formatting]) because that is the format stored by the MediaWiki system.  Originally, it was envisioned that we would make API calls to the Parsoid service which is used by the Visual Editor extension.  However, Parsoid is not very flexible in the HTML that it will handle.  To get a more flexible converter, we use the [https://github.com/jgm/pandoc Pandoc] project which is able to (read and) [https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Writers/MediaWiki.hs write to MediaWiki Text format].
 For each source type (ie. UVM, InfoHub) we will need to survey the content to identify the essential content, and remove navigation, JavaScript, presentational graphics, etc.  We should have a "fingerprint" that we can use to sniff out the type of document set that the user is uploading to the wiki.
@@ Line 63: / Line 60: @@
 Form file content is saved to server (tmp), and that triggers conversion attempt. A Title is proposed from text (checked in the db), and user can override naming HTML is converted to wiki text for the content of the article.
-Image references are extracted from source
+Image references are either assumed to be relative e.g. <code>src="../images/foo.jpg"</code> and contained in the zip file, or absolute e.g. <code>src="http://example.com/images/foo.jpg"</code> in which case they are not local to the wiki.
+Want to check your source for a list of image files?
 <source lang="bash">
-grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./data/uvm-1.1d/docs/html/files/
+grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./my/html/files/
 </source>
-or you could use PHPDOM, but it's awfully finicky and under-documented
+For each of the image files (png, jpg, gif) contained in the zip archive, the image asset is saved into the wiki with automatic file naming based on the "Collection Name" + path in the zip file.
-For each of those images, the image asset is retrieved, uploaded with automatic file naming based on the "prefix" + src attribute.
+Also, each image is tagged with the collection name for easier identification.
-Also, each image is tagged with the collection name for easier identification
+Image references in the HTML source are automatically updated to reference the in-wiki images.
-Once all the images are contained in the wiki, the wiki markup for the article can be updated to reference those images.  IOW, it may be possible to upload an HTML source file, and batch a job to import all images AND update the article source to use the images... Or, since we know ahead of time what the image file name will be, we can just reference the non-existant images in the article.  They will exist in the wiki after a short delay required to fetch and process the image files.
+@todo document the $wgEliminateDuplicateImages option
-(Eliminate duplicate images based on checksum?)
+== Database ==
+The extension currently does not make any schema changes to the MediaWiki system.
-What, if any, additional tables do we need in the database? <ref>https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates</ref>
+What, if any, additional tables could we want in the database? <ref>https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates</ref>
 We may need to store checksums for zip uploads, because we don't want to store the zip itself, but we may want to recognize a re-upload attempt?
+== Logging ==
 Logging is provided at [[Special:Log/html2wiki]]  The facility for logging will tap into <code>LogEntry</code> as outlined at https://www.mediawiki.org/wiki/Manual:Logging_to_Special:Log
 Interestingly, SpecialUpload must call <code>LogEntry</code> from it's hooks  SpecialImport calls <code>LogPage</code> which itself invokes <code>LogEntry</code> (see includes/logging).
+* @todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension
-More information can be found at https://www.mediawiki.org/wiki/Extension:Html2Wiki
-The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.
-* @todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension
-<source lang="bash">
-git clone https://gerrit.wikimedia.org/r/p/mediawiki/
-</source>
-or (with gerrit auth)
-<source lang="bash">
-git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid
-</source>
 In order to use Parsoid at all, we need to have the content conform to the
@@ Line 164: / Line 155: @@
 == Developing ==
+This extension was originally written by and is maintained by Greg Rundlett of [http://eQuality-Tech.com eQuality Technology]. Additional developers, testers, documentation helpers, and translators welcome!
-The project code is hosted on both [https://github.com/freephile/Html2Wiki GitHub] and WikiMedia Foundation servers on the[https://www.mediawiki.org/wiki/Extension:Html2Wiki Html2Wiki Extension page].  You should use git to clone the project and submit pull requests.
+The project code is hosted on both [https://github.com/freephile/Html2Wiki GitHub] and WikiMedia Foundation servers on the [https://www.mediawiki.org/wiki/Extension:Html2Wiki Html2Wiki Extension page].  You should use git to clone the project and submit pull requests.  The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.
+<source lang="bash">
+git clone https://gerrit.wikimedia.org/r/p/mediawiki/
+</source>
+or (with gerrit auth)
+<source lang="bash">
+git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid
+</source>
 The best way to setup a full development environment is to use [https://www.mediawiki.org/wiki/MediaWiki-Vagrant MediaWiki Vagrant].  This handy bit of wizardry will create a full LAMP stack for you and package it into a VirtualBox container (among others).
+<References>
-== Uploads in MediaWiki ==
-How does upload work in MediaWiki?
-Parsoid.hooks.php has a hook <code>onFileUpload</code>, which refers to <code>filerepo/file/LocalFile.php</code>
-If I'm reading it correctly, LocalFile.php is mostly called from a Repo object.
-<code>LocalFile::upload()</code> and <code>recordUpload()</code> are informative.
-{{References}}