Difference between revisions of "Html2Wiki"

From Freephile Wiki
Jump to navigation Jump to search
(draft)
 
(Remove feature)
 
(17 intermediate revisions by one other user not shown)
Line 1: Line 1:
This extension to MediaWiki is used to import HTML content into the wiki.
+
__NOTOC__ __NOEDITSECTION__
 +
<!-- {{Feature
 +
|explains= Special:Html2Wiki
 +
|description=Convert  web pages, Google Docs, or entire websites (including images) to your wiki
 +
|notes=Authored by [[User:Freephile|Greg Rundlett]]
 +
|tests=Interesting test case: http://howtoreallypronouncegif.com/
 +
|examples=
 +
}} -->
 +
This extension officially lives at [[mw:Extension:Html2Wiki]]
  
== Requirements or Dependencies ==
+
See the documentation there, since it is maintained with the software.
  
This extension was built on MediaWiki version 1.25alpha, and is likely not compatible with earlier releases since there are a number of external libraries such as [https://www.mediawiki.org/wiki/JQuery jQuery] which have changed over time
+
This site may host development ideas or interesting examples/demos.
  
It may depend on a Parsoid service, which is used to transform an HTML DOM into wikitextMore at [http://www.mediawiki.org/wiki/Parsoid http://www.mediawiki.org/wiki/Parsoid]
+
== Other conversion tools ==
 +
Html2Wiki relies on <code>pandoc</code> to do format conversion.  Here are some other approaches to doing conversions.
 +
=== LibreOffice ===
 +
LibreOffice Writer can connect to a Wiki, and allow you to edit and save articles in the wiki.
 +
* Make sure your LibreOffice can export MediaWiki directly from any format that LibreOffice can read
 +
  sudo apt-get install libreoffice-wiki-publisher
 +
With this library installed, you can now export documents straight out of LibreOffice.
  
Since parsing the DOM is problematic when using PHP's native DOM manipulation (which is itself based on libxml), we use the [http://querypath.org/ QueryPath project] to provide a more flexible parsing platform.
+
=== Two-step conversion ===
 +
This isn't really better than using LibreOffice directly, but it is an option to at least compare the output (assuming direct export is giving a bad result)
 +
* Convert a doc to mediawiki by converting to HTML first, and then using <code>pandoc</code> to convert HTML to MediaWiki markup
 +
libreoffice --headless --convert-to html /tmp/awk.cheat.sheet.doc && \
 +
pandoc awk.cheat.sheet.html -o awk.cheat.sheet.mw -f html -t mediawiki
  
Html2Wiki can import entire document sets and maintain a hierarchy of those documents.  The [http://www.mediawiki.org/wiki/Manual:$wgNamespacesWithSubpages $wgNamespacesWithSubpages] variable will allow you to create a hierarchy in your wiki's 'main' namespace; and even automatically create navigation links to parent article content.  Taking this further, the [https://www.mediawiki.org/wiki/Extension:SubPageList SubPageList ] extension creates navigation blocks for subpages.
+
=== Online conversion ===
 +
https://devotter.com/converter is a webform interface to pandoc
  
The document sets we were importing were based on generated source code documentation (coming from an open source documentation generator called [http://naturaldocs.org/ Natural Docs]) which creates DHTML "mouseovers" for glossary terms. To create similar functionality in the wiki environment, we will rely on the [https://www.mediawiki.org/wiki/Extension:Lingo Lingo] extension to create a Glossary of terms.
+
Note: [[MediaWiki/Toolbox]] explains how we add a custom link to the "toolbox" element of this site.
  
== Usage ==
+
# {{@todo}} [https://phabricator.wikimedia.org/project/board/1094/ Html2Wiki workboard]
 
+
# {{@todo}} upgrade extension to work with new loading mechanism
Select a file using the import HTML form found on the Special:Html2Wiki page (similar to Special:Upload for regular images).  When Special:Html2Wiki is installed, it adds a link to the 'Tools' section of your wiki, for quick easy access to the importer.
+
# {{@todo}} Create a service to import Google Docs to wiki
 
+
# {{@todo}} Expand Html2Wiki to include anything that pandoc supports
Enter a comment in the Comment field, which is logged in the 'Recent Changes' content as well as the Special:Log area.
 
 
 
The upload is automatically categorized according to the content provided.
 
 
 
You can optionally specify a "Collection Name" for your content.  The Collection Name represents where this content is coming from (e.g. The book, or website).  Any unique identifier will do. The "Collection Name" is used to tag (categorize) all content that is part of that collection.  And, all content that is part of a Collection will be organized "under" that Collection Name in a hierarchy.  This lets you have 2 or more articles in your wiki named "Introduction" if they belong to separate Collections.  Specifying an existing Collection Name + article title will update the existing content.  In fact, to reimport a single file and maintain it's 'position' in a collection, you would specify the full path to the file.
 
 
 
== Uploads in MediaWiki ==
 
How does upload work?
 
 
 
Parsoid.hooks.php has a hook <code>onFileUpload</code>, which refers to <code>filerepo/file/LocalFile.php</code>
 
 
 
If I'm reading it correctly, LocalFile.php is mostly called from a Repo object.
 
 
 
<code>LocalFile::upload()</code> and <code>recordUpload()</code> are informative.
 
 
 
 
 
== Zip archive handling ==
 
 
 
 
 
In order to handle the zip upload, we'll have to traverse all files and index hrefs as they exist.  We'll need to map those to safe titles and rewrite the source to use those safe URLs.  This has to be done for both anchors and images.
 
 
 
Practically speaking MW is probably more flexible than we need; but we'll want to check
 
 
 
<pre>[legaltitlechars] =>  %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+</pre>
 
 
 
Since we're using First-letter capitals, we'll need to account for that in rewriting all hrefs.
 
 
 
 
 
== Styles and Scripts ==
 
 
 
Cascading Style Sheets (CSS) as well as JavaScript (js) are not kept as part of the transformation.  Although, we are working on including CSS
 
 
 
== Wiki Text markup ==
 
 
 
The fundamental requirement for this extension is to transform input (HTML) into Wiki Text (see [http://www.mediawiki.org/wiki/Help:Formatting http://www.mediawiki.org/wiki/Help:Formatting]) because that is the format stored by the MediaWiki system.  Originally, it was envisioned that we would make API calls to the Parsoid service which is used by the Visual Editor extension.  However, Parsoid is not very flexible in the HTML that it will handle.  To get a more flexible converter, we use the [https://github.com/jgm/pandoc Pandoc] project which is able to (read and) [https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Writers/MediaWiki.hs write to MediaWiki Text format].
 
 
 
 
 
For each source type (ie. UVM, InfoHub) we will need to survey the content to identify the essential content, and remove navigation, JavaScript, presentational graphics, etc.  We should have a "fingerprint" that we can use to sniff out the type of document set that the user is uploading to the wiki.
 
 
 
As a result of sniffing the source type, we can properly index and import content only, while discarding the dross.  We can likewise apply the correct transformation to the source.  For example, there is a bunch of Verilog source in UVM, that should be converted to GeSHi <nowiki><source></nowiki> tags while maybe there isn't any in the InfoHubs.
 
 
 
Form file content is saved to server (tmp), and that triggers conversion attempt. A Title is proposed from text (checked in the db), and user can override naming HTML is converted to wiki text for the content of the article.
 
 
 
Image references are extracted from source
 
<source lang="bash">
 
grep -P -r --only-matching '(?<=<img src=["'\''])([^"'\'']*)' ./data/uvm-1.1d/docs/html/files/
 
</source>
 
or you could use PHPDOM, but it's awfully finicky and under-documented
 
 
 
 
 
For each of those images, the image asset is retrieved, uploaded with automatic file naming based on the "prefix" + src attribute.
 
 
 
Also, each image is tagged with the collection name for easier identification
 
 
 
Once all the images are contained in the wiki, the wiki markup for the article can be updated to reference those images.  IOW, it may be possible to upload an HTML source file, and batch a job to import all images AND update the article source to use the images... Or, since we know ahead of time what the image file name will be, we can just reference the non-existant images in the article.  They will exist in the wiki after a short delay required to fetch and process the image files.
 
 
 
(Eliminate duplicate images based on checksum?)
 
 
 
What, if any, additional tables do we need in the database? <ref>https://www.mediawiki.org/wiki/Manual:Hooks/LoadExtensionSchemaUpdates</ref>
 
 
 
We may need to store checksums for zip uploads, because we don't want to store the zip itself, but we may want to recognize a re-upload attempt?
 
 
 
Logging is provided at [[Special:Log/html2wiki]]  The facility for logging will tap into <code>LogEntry</code> as outlined at https://www.mediawiki.org/wiki/Manual:Logging_to_Special:Log
 
 
 
Interestingly, SpecialUpload must call <code>LogEntry</code> from it's hooks  SpecialImport calls <code>LogPage</code> which itself invokes <code>LogEntry</code> (see includes/logging).
 
 
 
 
 
More information can be found at https://www.mediawiki.org/wiki/Extension:Html2Wiki
 
 
 
The code is simultaneously updated on MediaWiki servers and GitHub, so feel free to fork, or pull it from either location.
 
 
 
* @todo publish the extension upstream. first with the @link http://www.mediawiki.org/wiki/Template:Extension
 
 
 
<source lang="bash">
 
git clone https://gerrit.wikimedia.org/r/p/mediawiki/
 
</source>
 
 
 
or (with gerrit auth)
 
<source lang="bash">
 
git clone ssh://USERNAME@gerrit.wikimedia.org:29418/mediawiki/services/parsoid
 
</source>
 
 
 
In order to use Parsoid at all, we need to have the content conform to the
 
 
 
MediaWikiDOMspec, which is based on HTML5 and RDFa https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Ref_and_References.
 
 
 
 
 
We need to parse the incoming content, validate and possibly transform the document type to HTML5 (<code><!DOCTYPE html></code>) and then transform the HTML5 to MediaWiki DOMspec http://www.w3.org/TR/html5/syntax.html#html-parser
 
 
 
UVM Content is assumed to be HTML4 Transitional (as some of it is expressly) @todo verify this with Tidy
 
 
 
The InfoHubs content is produced by Quadralay WebWorks AutoMap 2003 for FrameMaker 8.0.2.1385 and outputs XHTML 1.0
 
<pre>
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 
</pre>
 
 
 
@link http://www.w3.org/TR/html5/obsolete.html
 
 
 
 
 
[https://www.mediawiki.org/wiki/Parsoid/API Parsoid offers an API] with basically two actions: POST and GET You can test the API at [http://parsoid-lb.eqiad.wikimedia.org/_html/ http://parsoid-lb.eqiad.wikimedia.org/_html/]
 
 
 
You can also test it locally on the vm through port 8000
 
 
 
 
 
We're converting Mentor Graphics Documentation System
 
* mgc_html v3.2_2.03
 
* infohub_core v3.2_2.02
 
 
 
 
 
== Variables we care about ==
 
 
 
# We probably want a variable that can interact with the max upload size
 
# [https://www.mediawiki.org/wiki/Manual:$wgMaxUploadSize $wgMaxUploadSize][*] = 104857600 bytes (100 MB)
 
# [https://www.mediawiki.org/wiki/Manual:$wgFileBlacklist $wgFileBlacklist] we don't care about because we use our own file upload and mime detection
 
# [https://www.mediawiki.org/wiki/Extension:VisualEditor $wgVisualEditorParsoidURL] we can use for API requests to Parsoid
 
# [https://www.mediawiki.org/wiki/Manual:$wgLegalTitleChars $wgLegalTitleChars] we use to check for valid file naming
 
# [https://www.mediawiki.org/wiki/Manual:$wgMaxArticleSize $wgMaxArticleSize] default is 2048 KB, which may be too small?
 
# [https://www.mediawiki.org/wiki/Manual:$wgMimeInfoFile $wgMimeInfoFile] we don't yet use
 
# Also, how do imagelimits come into play?  http://localhost:8080/w/api.php?action=query&meta=siteinfo&format=txt
 
 
 
 
 
== Features ==
 
 
 
Add a link to the sidebar for our extension.
 
$wgUploadNavigationUrl is for overriding the regular 'upload' link (not what we want).
 
 
 
Instead, we have to edit MediaWiki:Common.js
 
see [https://www.mediawiki.org/wiki/Manual:Interface/Sidebar https://www.mediawiki.org/wiki/Manual:Interface/Sidebar]
 
 
 
== Internationalization ==
 
 
 
[http://localhost:8080/wiki/Special:Html2Wiki?uselang=qqx http://localhost:8080/wiki/Special:Html2Wiki?uselang=qqx] shows the interface messages
 
You can see most of the messages in Special:AllMessages if you filter by the prefix 'Html2Wiki'
 
 
 
 
 
== Error handling ==
 
 
 
# submitting the form with no file There was an error handling the file upload: No file sent.
 
# choosing a file that is too big (limit is set to 100 MB, so far haven't tested)
 
# choosing a file of the wrong type There was an error handling the file upload: Invalid file format.
 
# choosing a file that has completely broken HTML
 
 
 
 
 
== Developing ==
 
 
 
The project code is hosted on both [https://github.com/freephile/Html2Wiki GitHub] and WikiMedia Foundation servers on the[https://www.mediawiki.org/wiki/Extension:Html2Wiki Html2Wiki Extension page].  You should use git to clone the project and submit pull requests.
 
 
 
The best way to setup a full development environment is to use [https://www.mediawiki.org/wiki/MediaWiki-Vagrant MediaWiki Vagrant].  This handy bit of wizardry will create a full LAMP stack for you and package it into a VirtualBox container (among others).
 
 
 
{{References}}
 

Latest revision as of 17:06, 13 May 2020

This extension officially lives at mw:Extension:Html2Wiki

See the documentation there, since it is maintained with the software.

This site may host development ideas or interesting examples/demos.

Other conversion tools

Html2Wiki relies on pandoc to do format conversion. Here are some other approaches to doing conversions.

LibreOffice

LibreOffice Writer can connect to a Wiki, and allow you to edit and save articles in the wiki.

  • Make sure your LibreOffice can export MediaWiki directly from any format that LibreOffice can read
sudo apt-get install libreoffice-wiki-publisher

With this library installed, you can now export documents straight out of LibreOffice.

Two-step conversion

This isn't really better than using LibreOffice directly, but it is an option to at least compare the output (assuming direct export is giving a bad result)

  • Convert a doc to mediawiki by converting to HTML first, and then using pandoc to convert HTML to MediaWiki markup
libreoffice --headless --convert-to html /tmp/awk.cheat.sheet.doc && \
pandoc awk.cheat.sheet.html -o awk.cheat.sheet.mw -f html -t mediawiki

Online conversion

https://devotter.com/converter is a webform interface to pandoc

Note: MediaWiki/Toolbox explains how we add a custom link to the "toolbox" element of this site.

  1. Gedit.svg todo Html2Wiki workboard
  2. Gedit.svg todo upgrade extension to work with new loading mechanism
  3. Gedit.svg todo Create a service to import Google Docs to wiki
  4. Gedit.svg todo Expand Html2Wiki to include anything that pandoc supports