Format conversion
See wp:File_format for information about File Formats
This page is about tools for converting content from one document/file format to another. We examine mostly office documents and wiki file formats. There are many other conversions we could discuss when including multimedia formats etc.
There are very useful tools and discussions about converting between MediaWiki wikitax and other target wiki syntaxes, XML, XHTML, DocBook, OpenDocument, Portable Document Format (PDF) and more
Office Suites and Formats[edit | edit source]
Converting Microsoft formats[edit | edit source]
The OpenOffice suite has built-in conversion tools to read and write a wide variety of Microsoft's proprietary file formats. However, since the Microsoft formats are proprietary, even the best conversion routines will fail somewhere. This is usually not a problem for everyday files. Conversion becomes more error prone with the addition of advanced document features such as special effects, transitions, macros, or advanced Object linking and embedding. Still, with the built-in conversion abilities of OpenOffice, it makes the majority of cases practical. Users of OpenOffice will usually be able to read documents sent to them by users of Microsoft Office products.
Conversion is routinely done by the OpenOffice suite when opening or saving a document. You can use the "File -> Save As" menu option to convert your document to the format of your choice.
Programmatic / Automated Conversion[edit | edit source]
You can do bulk conversions of entire collections of documents by hooking into the conversion capabilities of the OpenOffice suite. In this way, you could produce PDF output for your entire collection of marketing materials.
Basically following the information at http://www.xml.com/lpt/a/1638, create a local macro, and run it from the command line:
ooffice2 -invisible "macro:///Standard.MyConverters.SaveAsOOO(/home/greg/projects/slides/executive.ppt)"
(Note that OOO is just shorthand for OpenOffice.Org)
But, that process failed to create a good document.... When I tried to open the document, OpenOffice hung. After killing OO, and restarting it, OO would try to recover the document, but it would fail to bring up the document after the recovery. Searching the web, it turns out that exporting the following System variable will heal Impress (the snippet below will add it to your current environment and also your bash configuration file for future logins)
echo 'export MALLOC_CHECK_=2' >> ~/.bashrc && source ~/.bashrc
The following bash script will do a batch interactive conversion of any given directory, finding all Microsoft PowerPoint, Doc and Excel files.
#!/bin/bash
# setup an option where the user can abort
END_CONDITION=quit
directory=${1-`pwd`}
# Defaults to current working directory,
#+ if not otherwise specified.
# use an override during development
# directory='/home/greg/Documents'
echo "Using converter to process $directory"
# for file in "$( find $directory -type f -name '*ppt' -o -name '*doc' -o -name '*xls' )"
for file in `find $directory -type f -name '*ppt' -o -name '*doc' -o -name '*xls'`
do
if [ "$2" = "dry-run" ]
then
echo "found: $file"
else
until [ "$var1" = "$END_CONDITION" ]
do
echo "Do you want to convert?"
echo "$file"
echo "(type '$END_CONDITION' to abort processing; press [enter] to continue)"
read var1
# to do add case statement which checks for the existance of the target file and skip processing
echo "processing... "
ooffice2 -invisible "macro:///Standard.MyConverters.SaveAsOOO($file)";
done
fi
done
exit 0
I do believe the converter will work in tandem with a PHP script that I'm developing, even if there is no guarantee that the resulting file will be usable or faithful to the original. In order to get the PHP script to work you need to daemonize OpenOffice, which means that you have to give it a virtual frame buffer. Additionally you need to have the python interpreter installed and the python UNO bridge
sudo apt-get update && sudo apt-get install xvfb python python-uno
Wiki formats[edit | edit source]
Mediawiki DTD[edit | edit source]
http://meta.wikimedia.org/wiki/Wikipedia_DTD
pandoc[edit | edit source]
Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read markdown and (subsets of) reStructuredText, HTML, and LaTeX, and it can write markdown, reStructuredText, HTML, LaTeX, ConTeXt, Docbook XML, OpenDocument XML, GNU Texinfo, RTF, ODT, MediaWiki markup, groff man pages, and S5 HTML slide shows.
Or more simply, Pandoc rocks the free world! Because Pandoc does MediaWiki format, we used it in the Html2Wiki extension.
To convert an HTML document to MediaWiki syntax, you can simply issue a command like
pandoc --from html --to mediawiki foo.html --output foo.wiki.txt
Wiki To PDF[edit | edit source]
Announced in 2008, wikis have gone print-on-demand. http://wikimediafoundation.org/wiki/Press_releases/Wikis_Go_Printable Using the Collection extension, you can create "Books" (collections of wiki articles) that you can share, convert to PDF, and even print on-demand at a high-quality press. On each article on this wiki you should see a "PDF version" link in the toolbox and also a "Create A Book" section in the navigation bar. More information is available at Collections
Wiki To XML[edit | edit source]
There is a tool created by Magnus Manske (lead/core developer of Mediawiki) that converts Mediawiki documents into XML and a variety of file formats. Since Mediawiki has an XML DTD, it may well prove to be 100% XML based.
- XML
- Plain text Use *_/ markup Put ? before internal links
- Plain text, google-translated to (works only for wikipedia/wikibooks; probably depends on Google API key)
- XHTML
- DocBook XML
- DocBook PDF
- DocBook HTML
- OpenOffice XML
- OpenOffice ODT
Developer info In fact, that tool is one of many wiki worker tools that Magnus provides, so if you plan to, or already author in a wiki environment, you might want to check out the tools. http://tools.wikimedia.de/~magnus/
The converter itself is at Special:Wiki2XML
DocBook to Mediawiki[edit | edit source]
Apparently the blender project is converting their internal documentation to the MediaWiki format, and they have developed some useful PHP and Python scripts for doing this. The Python one seems more polished than the PHP version at the time I looked at it. Referencing it here for curiosity more than anything. I do not know of a current need for this particular conversion. http://mediawiki.blender.org/index.php/Meta/DocBook_to_Wiki
See also http://meta.wikimedia.org/wiki/DocBook_XML_export
Resources and external efforts[edit | edit source]
The Hula project has a lot of information on wiki format conversion http://www.hula-project.org/Wiki_Conversion
Various people are coordinating an effort to make PDF and ODF export of wikis http://wikimediafoundation.org/wiki/Wikis_Go_Printable
The OpenOffice Writer has an export filter that allows you to author in OpenOffice and then save your document in wiki format.
Other[edit | edit source]
html to pdf[edit | edit source]
wkhtmltopdf is an LGPLv3 tool to render HTML into PDF and various image formats using the QT Webkit rendering engine.