Open main menu

Format conversion

Revision as of 14:05, 8 August 2015 by Freephile (talk | contribs) (moves pandoc to the top of the list)

(diff) ← Older revision | Approved revision (diff) | Latest revision (diff) | Newer revision → (diff)

See wp:File_format for information about File Formats

This page is about tools for converting content from one document/file format to another. We examine mostly office documents and wiki file formats. There are many other conversions we could discuss when including multimedia formats etc.

There are very useful tools and discussions about converting between MediaWiki wikitax and other target wiki syntaxes, XML, XHTML, DocBook, OpenDocument, Portable Document Format (PDF) and more

Office Suites and Formats

Converting Microsoft formats

The OpenOffice suite has built-in conversion tools to read and write a wide variety of Microsoft's proprietary file formats. However, since the Microsoft formats are proprietary, even the best conversion routines will fail somewhere. This is usually not a problem for everyday files. Conversion becomes more error prone with the addition of advanced document features such as special effects, transitions, macros, or advanced Object linking and embedding. Still, with the built-in conversion abilities of OpenOffice, it makes the majority of cases practical. Users of OpenOffice will usually be able to read documents sent to them by users of Microsoft Office products.

Conversion is routinely done by the OpenOffice suite when opening or saving a document. You can use the "File -> Save As" menu option to convert your document to the format of your choice.

Programmatic / Automated Conversion

You can do bulk conversions of entire collections of documents by hooking into the conversion capabilities of the OpenOffice suite. In this way, you could produce PDF output for your entire collection of marketing materials.

Basically following the information at http://www.xml.com/lpt/a/1638, create a local macro, and run it from the command line:

ooffice2 -invisible "macro:///Standard.MyConverters.SaveAsOOO(/home/greg/projects/slides/executive.ppt)"

(Note that OOO is just shorthand for OpenOffice.Org)

But, that process failed to create a good document.... When I tried to open the document, OpenOffice hung. After killing OO, and restarting it, OO would try to recover the document, but it would fail to bring up the document after the recovery. Searching the web, it turns out that exporting the following System variable will heal Impress (the snippet below will add it to your current environment and also your bash configuration file for future logins)

echo 'export MALLOC_CHECK_=2' >> ~/.bashrc && source ~/.bashrc

The following bash script will do a batch interactive conversion of any given directory, finding all Microsoft PowerPoint, Doc and Excel files.

#!/bin/bash

# setup an option where the user can abort
END_CONDITION=quit

directory=${1-`pwd`}
#  Defaults to current working directory,
#+ if not otherwise specified.

# use an override during development
# directory='/home/greg/Documents'

echo "Using converter to process $directory"


# for file in "$( find $directory -type f -name '*ppt' -o -name '*doc' -o -name '*xls' )"
for file in `find $directory -type f -name '*ppt' -o -name '*doc' -o -name '*xls'`
do
if [ "$2" = "dry-run" ]
then
echo "found: $file"
else
until [ "$var1" = "$END_CONDITION" ]
do
echo "Do you want to convert?"
echo "$file"
echo "(type '$END_CONDITION' to abort processing; press [enter] to continue)"
read var1
# to do add case statement which checks for the existance of the target file and skip processing
echo "processing... "
ooffice2 -invisible "macro:///Standard.MyConverters.SaveAsOOO($file)";
done
fi
done

exit 0

I do believe the converter will work in tandem with a PHP script that I'm developing, even if there is no guarantee that the resulting file will be usable or faithful to the original. In order to get the PHP script to work you need to daemonize OpenOffice, which means that you have to give it a virtual frame buffer. Additionally you need to have the python interpreter installed and the python UNO bridge

sudo apt-get update && sudo apt-get install xvfb python python-uno


Wiki formats

Mediawiki DTD

http://meta.wikimedia.org/wiki/Wikipedia_DTD

pandoc

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read markdown and (subsets of) reStructuredText, HTML, and LaTeX, and it can write markdown, reStructuredText, HTML, LaTeX, ConTeXt, Docbook XML, OpenDocument XML, GNU Texinfo, RTF, ODT, MediaWiki markup, groff man pages, and S5 HTML slide shows.

Or more simply, Pandoc rocks the free world! Because Pandoc does MediaWiki format, we used it in the Html2Wiki extension.

To convert an HTML document to MediaWiki syntax, you can simply issue a command like

pandoc --from html --to mediawiki foo.html --output foo.wiki.txt

Wiki To PDF

Announced in 2008, wikis have gone print-on-demand. http://wikimediafoundation.org/wiki/Press_releases/Wikis_Go_Printable Using the Collection extension, you can create "Books" (collections of wiki articles) that you can share, convert to PDF, and even print on-demand at a high-quality press. On each article on this wiki you should see a "PDF version" link in the toolbox and also a "Create A Book" section in the navigation bar. More information is available at Collections

Wiki To XML

There is a tool created by Magnus Manske (lead/core developer of Mediawiki) that converts Mediawiki documents into XML and a variety of file formats. Since Mediawiki has an XML DTD, it may well prove to be 100% XML based.

  • XML
  • Plain text Use *_/ markup Put ? before internal links
  • Plain text, google-translated to (works only for wikipedia/wikibooks; probably depends on Google API key)
  • XHTML
  • DocBook XML
  • DocBook PDF
  • DocBook HTML
  • OpenOffice XML
  • OpenOffice ODT

Developer info In fact, that tool is one of many wiki worker tools that Magnus provides, so if you plan to, or already author in a wiki environment, you might want to check out the tools. http://tools.wikimedia.de/~magnus/

The converter itself is at Special:Wiki2XML


DocBook to Mediawiki

Apparently the blender project is converting their internal documentation to the MediaWiki format, and they have developed some useful PHP and Python scripts for doing this. The Python one seems more polished than the PHP version at the time I looked at it. Referencing it here for curiosity more than anything. I do not know of a current need for this particular conversion. http://mediawiki.blender.org/index.php/Meta/DocBook_to_Wiki

See also http://meta.wikimedia.org/wiki/DocBook_XML_export

Resources and external efforts

The Hula project has a lot of information on wiki format conversion http://www.hula-project.org/Wiki_Conversion

Various people are coordinating an effort to make PDF and ODF export of wikis http://wikimediafoundation.org/wiki/Wikis_Go_Printable

The OpenOffice Writer has an export filter that allows you to author in OpenOffice and then save your document in wiki format.

Other

html to pdf

wkhtmltopdf is an LGPLv3 tool to render HTML into PDF and various image formats using the QT Webkit rendering engine.

  • Gedit.svg todo update this page since it was last touched in 2009