From Word to Markdown to InDesign

Fully automated typesetting

Word to InDesign

If you’ve ever imported MS Word documents into InDesign, you know how troublesome that can be at times. Using File > Place, it should be a straightforward process — theoretically. By default, text placed in InDesign is not linked to the original document’s content, but embedded. This means the import is done only once, and any edits made to the external document thereafter (from within Word), will not drag over to InDesign, unless you explicitly select the link option in the File Handling preferences. Then, the linked file will appear in the Links panel, whence it can be easily updated1 — very much like you would do with linked graphical assets, like external .ai, .jpg, .png, .psd, and .tiff image files.2

InDesign happily supports various text import formats, including MS Word’s .docx, .html, and .xml. This is important because importing plain text from a file is child’s play, while preserving local formatting and styles of “rich text”, requires that InDesign is aware of the external file’s document format, its domain, model or schema, and knows how to convert its markup. But when it does, you can leave it over to InDesign to do the heavy-lifting of converting any markup it finds in the original document (like Word styles) to the paragraph and character styles you have prepared in your InDesign document.3

In practice, however, it can become all very cumbersome, because InDesign’s linked file placement workflow assumes that the original document has been correctly formatted to start with. Indeed, few Word users know of its styles (header, subheading, emphasis, etc.) — let alone they apply them consistently. Instead, people are doing all sorts of weird stuff while playing around with Word’s wysiwyg4 click-a-button tricks (bold, italics, tabs, indentation, font sizes, colors, “word art”). Rather than immediately structuring its meaning, they visually pimp the looks of their texts, as a poor proxy for the intent they want to confer — poor indeed, usually both as regards the underlying markup, as from a graphic design perspective. While on the screen, or printing directly from Word, the document may surely look how the author thinks it should, under the hood it’s become a markup mess. Direct, inline styling devastates the integrity of a document’s intended semantical structure, and cripples its portability: without serious manual effort and clean-up, the document may never be sufficiently exported to other file formats, be published on the Web, or put into print.

Few authors use Word’s styles to sensibly format and structure their documents. Most users apply visual inline styling instead, resulting in unusable markup.
Few authors use Word’s styles to sensibly format and structure their documents. Most users apply visual inline styling instead, resulting in unusable markup.
Direct styling is bad practice. Use “semantic” styles instead.
Direct styling is bad practice. Use “semantic” styles instead.

The hidden defects of sloppy markup generated as the debris of graphical live preview text editing, become apparent as soon as the Word file is placed in InDesign. Then the heterogenous cruft and clutter of .docx tag soup bubbles up, forcing the designer to weed out duplicate and redundant styles, using “Customized Style Import” and the “Style Mapping” dialog box, at the peril of misinterpreting the author’s intention. (Authors surely don’t like the DTP-er takes on the role of an editor.)5

InDesign’s Style Mapping dialog box requires manual clean-up of messy Word styles.
InDesign’s Style Mapping dialog box requires manual clean-up of messy Word styles.

It not only is an error-prone process, but one which can quickly grow into an expensive workflow, too. With each round of corrections, or iteration of copy edits made on the original Word document, the overhead costs are multiplied. Not to mention such a workflow surely doesn’t scale or can profit by automation.

Plain text as a default

If it were entirely for me to decide,6 I would ditch Word from the workflow altogether and force all collaborators and team-members on a project to work with plain text files7 exclusively. That way, I not only would enjoy a much more agreeable life as a designer, as I could spend my time more proficiently, could focus on the things that matter, rather than managing file formats. In the end, I could bill my clients less job work, as well. Moreover, clients, publishers and authors, would be better assured that no conversion errors unintentionally arise during the back-and-forth, requiring them to do less proof-reading and enjoy a speedier production process.

The truth is, clients will keep sending .docx files. The vast majority of professional writers still prefer authoring and editing their copy with MS Word (Google Docs is no different). And honestly: we can’t blame them. When used to good measure, MS Word still is a fine piece of an all-purpose word processor.

It all depends on the scale and complexity of a project, but quite frankly, when you don’t get plain text files from the client, it is oftentimes more efficient to just copy-paste a raw, unformatted dump of text into an InDesign text frame, and then manually re-apply those few italicized charstrings manually using your character styles. It’s the pragmatical thing to do for small-scale, one-off page layout projects.

With larger projects, however, duplication of effort is just not feasible. Then, unfortunately, either you are forced to go through the pain of File > Place styles weeding-out nevertheless, or you loose any significant markup in the copy-pasted bare-bones stripped-off plain text. Surely, you do want to preserve at least some of the styling applied by the author, particularly all markup which denotes non-textual, graphical semantics, structure and document organization. I don’t believe anyone would be dumb enough to recreate footnotes, for example, or mathematical equations, lists and tables in InDesign (going through yet another painful point-and-click experience), rather than persisting in avoiding the messy file placement dialog. Even something so silly as italicizing bits of text becomes tedious when there are hundreds, or even a small dozen of them.

Fortunately, there is a solution, which has you have the best of both worlds.

Markdown to the rescue

Markdown is a relatively new document format with an easy to learn syntax. Compared to html or xml, Markdown is a lightweight markup language, which can be easily read and written with only a few marker characters to remember. With Markdown, you simply type _underscores_ and **asterisks**, for example, instead of lengthy html tags like <em>emphasis</em> and <strong>strong <em>emphasis</em></strong>.8

When I first discovered Markdown, without hesitation I became at once an enthusiast user.9 To me, editing Markdown is much easier than the cumbersome select-point-and-click experience of a wysiwyg editor like Word, which disturbs my writing process’s stream of consciousness. But tastes may differ.

Compared to the bloat of wysiwyg word processors, editing plain text with Markdown’s easy-to-learn syntax really is refreshing!
Compared to the bloat of wysiwyg word processors, editing plain text with Markdown’s easy-to-learn syntax really is refreshing!

However, as much as possible, I also try to make a good habit of storing the master copies of all text — on each and every project, both personal and client projects — as plain text files, marked-up using Markdown, regardless of the application they were originally written in.

Steadily, Markdown is becoming the de facto standard in web-based publishing, and I happily jumped on that bandwagon for the web design projects I do for clients. In web development, there’s a vast ecosystem for working with Markdown, with dozens and dozes of open source software libraries to make it all happen. Unfortunately, print lags behind.

Markdown to InDesign

I just needed to enable Markdown in my print design workflow too, the sooner the better. Surely, I wasn’t the first to desperately want this, so I was confident at least some people might have come up with workable solutions, which I could hack on. And they did.

I found two little scripts, which basically offer the same functionality (with a similar approach in implementation): markdown.jsx (2011) and markdownID.jsx (2012). They’re both InDesign scripts of the JavaScript flavor (instead of AppleScript, which is a benefit when it comes to portability). Include the one you like in your InDesign scripts library (using the script panel as usual) and happily convert Markdown markup in an InDesign text frame to the proper paragraph and character styles! But mind you: both scripts support only a small subset of Markdown elements (i.e. no images, no footnotes; though you do can have tables).10 In many cases, though, these small utilities play their little trick just fine enough, and do what they need to do.

From these examples, I started crafting my own take on it. Meanwhile, my version has grown into a real workhorse, doing lots of the tedious formatting I formerly had to apply by hand. It features dozens of built-in character and paragraph styles, and relies on powerful “automagical” formatting additions using InDesign’s GREP and nested styles. It’s that I’ve become used to using it already, but it really is a joy each time you run it.

Dump a Markdown file into the system, and get a fully styled InDesign document in return!

Printmaster.scpt (2011) takes another approach: it’s a stand-alone AppleScript which you can run outside of InDesign. It assumes you can convert a Markdown file to xml first (html will do, with a proper declaration on top, and the contents of your <body> wrapped in a <root> element). From that xml, the script creates an fresh InDesign document, by copying an InDesign template (.indt) you’ve set up for the purpose, importing the xml, and mapping the template’s styles to the (html) tags found in the xml.

This is great, too, because if we could mash up both approaches and first do an automated Markdown→xml conversion, then pipe the resulting output to the auto-creation of a new InDesign document, based off a template, we would truly have an automated build process in place, which we could run as a background process, locally or on a server! Just dump a Markdown file into the system, and get a fully styled InDesign document in return!

Yet we can push things even a little further. After all, we’re still assuming to receive Markdown-formatted plain text files to start with. Which we won’t any time soon, at least not from clients, because that would require the entire publishing industry to switch from MS Word (or Google Docs and wysiwyg word processing in general) to Markdown. Obviously that’s wishful thinking, for the long foreseeable future. So, unless there would be an automatable way to obtain fully-formatted Markdown files from Word documents, our entire endeavor to improve on the print production flow using Markdown, would be in vain. The good news is, that solution already exists, too!

Pandoc to the rescue

Pandoc is the powerhouse we were looking for. It’s an open-source “universal document converter”11 which takes all sorts of text document formats as an input, parses such files into a normalized serialization of their contents, which it can then write to many supported output formats. Among other things, Pandoc reads html, xml (DocBook, ePub, OPML), LaTeX, Markdown, MS Word, and it can write to Markdown, html (xhtml and html5), xml (DocBook, ePub 2 and 3, OpenDocument), TeX (LaTeX and ConTeXt), MS Word .docx, OpenOffice/LibreOffice .odt, and any custom format you would want to code a writer for yourself. It’s a powerful utility indeed.12

Pandoc truly harnesses the power of Markdown. (In fact, Pandoc and its author indeed favor Markdown as the preferred, canonical format for storing and authoring text.) With Pandoc, we can now automatically convert the .docx files we receive from clients to those flat plain text files we wanted, with all original markup preserved in Markdown syntax. A simple command on the terminal suffices to convert a messy Word document into tidy plain text, which can then be further cleaned-up, edited and enriched with sound, uniform markup:

pandoc -o MS-Word.docx --extract-media images
--no-wrap --normalize --smart

And there’s more! As of version 1.13.2 (?) (May 2014), Pandoc now ships with a writer able to produce an ICML file from any Pandoc-supported input file format.13 ICML is the document format of Adobe InCopy, the forgettable companion word processor to InDesign. (If you remember why Adobe introduced InCopy in the first place, Pandoc’s new support for ICML might become a disruptive game-changer: just keep on using MS Word, in tandem with InDesign, much like you would with InCopy.) It’s as easy as running:

pandoc -s -f markdown -t icml -o my.icml

Once you got ICML files from Markdown input, they can be easily integrated into InDesign with File > Place, just like you would place an external Word document or html file — but without having to go through the hassle of having InDesign do the conversion. With such a workflow14, you are effectively outsourcing the heavy-lifting and normalization of document conversion to Pandoc’s robust and tested library, instead of relying on the buggy and limited built-in file conversion of InDesign. At last, no more tedious pointing-and-clicking through and endless list of conflicting styles, each time you update your client’s copy!

Typesetting as a web service

The real game-changer though, is that we can now pipe the whole lot in one smooth automated process. With Pandoc we convert all MS Word documents we get from clients to clean plain text files in Markdown, and have the option to work on them directly. With Git, we create a repository and track all changes made on the Markdown files, providing us with a solid backup history. Then we use Pandoc again to convert the Markdown files to ICML, and have a small shell script import those into an InDesign template and export to an InDesign document. From InDesign, finally, we will create a print-ready, ISO-compliant PDF/X-4:2010 to be sent off to the printer.

The scripts and libraries discussed above proved to be of immense value to me while hacking together my automated print production workflow. I coded up a versatile Makefile15, which I can easily call from my local system’s shell to perform all of the needed operations in one go. A simple make all command reads every .docx (and .md) file in the directory (need be, hundreds of them), and outputs a fully styled InDesign document. In addition, I got a bunch of other scripts added into the mix, which do some automated copy editing, too (like orthotypographical corrections of punctuation and syntax clean-up).

Now, the real goal is of course to port this Makefile proof-of-concept to an even more solid procedure, using a modern build tool which can be run on a server. Let’s say, something like an npm module. Well, actually, we already started building the thing!

We’re developing a web service for automated cross-media publishing, based on the file conversion routines discussed in this article.

Based on the ideas and tools discussed in this article, we’re developing a web service for automated cross-media publishing. The idea is simple enough: with our service, you will just dump some .docx (or markdown) files into your favorite cloud storage bucket (say Dropbox) and get .indd files in return, fully compatible with your already existing InDesign templates and styles, which you can then further tweak. Using our service, you could have your authors and editors collaborate on the copy, right from within Word, just like they used to. Any edits and changes are automatically pulled into the InDesign layout on which the designers or prepress people are working. The whole process will run transparently in the background, without any human intervention required any longer!

What about LaTeX?

In this article we focussed on a workflow in which Adobe InDesign is the environment from which the print-ready pdf will eventually be generated. Presumably, InDesign is the print industry’s de facto standard for layouting and typesetting well-crafted books, brochures, magazines and newspapers. Obviously, while, as a page layout application, InDesign (or, for that matter, Quark XPress) allows for the manual interventions of a skilled designer, putting in her creative problem-solving know-how — a valuable HR asset that cannot be programmed into an algorithm any time soon.

In reality however, lots of publishers and printing shops have viable alternative production flows in place, using typesetting systems based on (La)TeX, troff/groff or XSL-FO. Others still go to plate directly from MS Word. Such production flows are not necessarily old-fashioned or bad. On the contrary: when you do not need InDesign’s advanced layouting features, like you do with highly graphical publications, then fully automated typesetting really is the better solution. In those circumstances wherein the intervention of a graphical designer is unnecessary overhead, the production process must be automated as much as possible, yet without giving up on the typographical quality of the typesetting.

Academic journals and scholarly papers, college syllabi, text books, manuals, technical reports, codices of law, and so forth, can all be typeset fully automatically, without any manual intervention, from a single source of documents containing the manuscript, along with a template and/or stylesheet. In these cases, too, existing productions flows can highly benefit from the automated file conversions described in this article, leveraging the advantages that come with plain text in Markdown.

In the system we are building, instead of (or in addition to) converting Word documents to InDesign, we can as well go directly to TeX from which to generate the pdf. Alternatively, you could go to MS Word again (converting .docx to an automatically cleaned-up.docx) using your in-house templates, and “print” the pdf from within Word.

‘Printing’ to the Web

Now we have come so far to put web technologies like Markdown to good use in the print production workflow, it is obvious that we can of course use the exact same procedures to also produce online publications from the same source we use for the print edition. That is, “printing” an interactive edition of your printed book or magazine to the Web, is going to be almost trivial. Going simultaneously online, while the planos roll off the presses, soon is a feasible and cheap option, which publishers should not forgo.

In fact, the first thing we’re tackling while developing our automated typesetting service, is exactly the cross-media aspect of it all. We have already been putting tremendous efforts in the automated creation of dynamic and fully functional online publications, i.e. responsive, cross-browser websites, with a particular focus on typography, usability and navigation. We’re confident to go public with a demo any time soon…

If you’re a publisher trying to get your cost structure down and are triggered by the solutions for automated typesetting we recommend in this article, then do get in touch: I’m available for consulting.

If you’re a fellow designer/developer, thrilled when thinking of switching from Word to Markdown in your InDesign workflow, and if you’re up to hacking up your own workflow, then check out the resources discussed in this article. If you rather not mess around with the command line, then know that we are considering to opensource our shell scripts. As soon as we’ve got something more substantial, we’ll probably do a follow-up post. Do remind me on Twitter @rhythmvs!

Either way, you should follow me, and get notified as soon as we are launching our fully-automated web-based publishing service!

Subscribe for updates

  1. For detailed instructions, see Adobe’s online help page.

  2. Well, that’s not entirely correct. Image bitmaps are stored in binary files (except for .svg, which InDesign doesn’t support, anyway), while text should never be obfuscated in binaries — which is the whole argument of this article. You can’t edit images from within InDesign, but instead must use an external image editor, like Photoshop. This workflow is beneficial indeed, as it implements a strict separation of concerns; you once place a linked image, updates must be done on the link’s source, and go one-way. If InDesign would allow direct image manipulation, then, one way or another, these edits should propagate back upstream into the linked file, requiring a bidirectional syncing nightmare, versioning, concurrency management and state keeping, diffing and merge conflicts resolution, especially when you’re working with a team of collaborators on the same files. Unfortunately this is all true for text: obviously you can do text edits with InDesign — that’s not a bug or an oversight, but a core feature indeed. InDesign is a text layout application, after all. But then again, being able to do so, imposes a disciplined workflow wherein you deprive yourself of any possibility of direct text editing, do not touch the contents of a linked text frame, but, as you would with images, make edits on the original source document only, from within the files that store them.

  3. See the relevant online help page.

  4. WYSIWYG is an acronym for “What You See Is What You Get”. It’s the currently ruling paradigm for text editing applications, wherein users directly write and style text inline with a real-time updated preview — with all the associated disadvantages as regards the underlying markup, as we argue in this article. As opposed to WYSIWYG, in Web design, the shift has been made towards “WYSIWYM” (What You See Is What You Mean), which places semantics over visuals, separates the process of writing and editing a document from its presentation, and delegates the styling of the document to a design or typesetting post-process.

  5. The separation of content and presentation as the subject of two individual processes that should not be mixed, not only adheres to a sound programming philosophy which dictates the separation of concerns between computer programs, it also furthers task distribution amongst real people, with different skill sets, domain knowledge, and, consequently, accountability.

  6. When I send quotes to clients and prospects, I already require .txt or .md files as the solely accepted input format.

  7. Plain text files are those files that trigger Notepad and Text editor to fire up, have the .txt file extension, and you surely know from README files that come with some software bundles. They’re opposed to “rich text files” (.rtf). They surely do not look pretty when seen through the lens of your operating system’s bundled text editor — but they can be prettified.

  8. If you’re not familiar with Markdown already, then have a look at one of these tutorials and cheat sheets.

  9. An inveterate researcher at heart myself, I dove into the matter and started collecting a trove of documentation and published it in an open data repository on Github — which has enjoyed quite a bit of attention from the developer crowd. In the Land of Markdown, the state of the union is evolving very fast, with briefly ruling princes as quickly abdicated as they are crowned. I try to catch on while updating my repo, but with the advent of the Common Mark initiative the conversation on Markdown’s future is thriving on specialized fora.

  10. The implementation of both scripts are “naive” in that they rely heavily on a swath of regexes to do the transformations. If you need something performant and extensible, you instead want a full-blown PEG parser. With a true parser, we might get to the point to cover the whole spec of Markdown - whatever that may be in the case of, well, “Standard” Markdown…

  11. Pandoc is written in Haskell, a much undervalued yet powerful programming language, based on a solid theory of categoric logic and lambda calculus. Its author is John MacFarlane, a bright programming enthusiast who also happens to be a prolific Berkeley professor of philosophy.

  12. Instead of using a home-grown, regex-based, ad-hoc script, Pandoc’s robust PEG-based parser is a far more reliable tool, which always provides you with deterministic, normalized and predictable output.

  13. If you’re not scared away by the symbolic logic of program code, then do have a look a the source of Pandoc’s new ICML writer.

  14. The process is somewhat better explained here and here.

  15. Time and again, the old-timer workhorse GNU Make still proves to be a great all-round automation tool. Compared to modern build tools like Rake or Grunt, Make’s syntax is hard to digest — especially for newbies. Online documentation is terse, but Mike Bostock has a valuable primer (2013); for a more in-depth overview, check this one (2015).