This is a design document to build a test for Author to export to JATS for easy and robust ingest to the ACM TAPS. Google Gemini provided Notes for this.
1. Overview
This guide describes how to export Author’s .liquid document format to JATS XML (ANSI/NISO Z39.96-2024, version 1.4) using the Article Authoring Tag Set — the JATS variant designed for authors creating new content.
JATS is the internal XML format used by ACM (via TAPS/Atypon Literatum), Elsevier, Springer Nature, Wiley, and virtually every major academic publisher. A valid JATS export from Author would allow direct submission to publisher pipelines, bypassing the Word/LaTeX conversion step entirely.
Target spec: JATS Article Authoring 1.4 DTD: https://jats.nlm.nih.gov/articleauthoring/1.4/JATS-articleauthoring1-4.dtd Tag Library (element reference): https://jats.nlm.nih.gov/articleauthoring/tag-library/1.4/index.html Full standard (PDF):https://groups.niso.org/higherlogic/ws/public/download/31415/ANSI-NISO-z39.96-2024.pdf Schemas (DTD, RNG, XSD):https://public.nlm.nih.gov/projects/jats/articleauthoring/1.4/
2. What Makes JATS Different from EPUB/HTML
JATS is not a presentation format. It is a semantic description of a scholarly article’s structure and metadata. This distinction affects every aspect of the conversion.
| Aspect | EPUB/HTML | JATS XML |
|---|---|---|
| Headings | <h2>The Origami Approach</h2> | <sec id="sec-3"><title>The Origami Approach</title> |
| Bold text | <strong>PDF</strong> | <bold>PDF</bold> |
| Italic text | <em>Phaedrus</em> | <italic>Phaedrus</italic> |
| Citation in text | (Halevi 2015) as text | <xref ref-type="bibr" rid="ref-Halevi2015">Halevi 2015</xref> |
| Reference list | JSON sidecar or HTML list | Deeply structured <ref-list> with <element-citation> trees |
| Metadata | OPF Dublin Core or JSON | Extensive <front> block with <article-meta>, contributors, affiliations, abstract, keywords, permissions |
| Paragraphs | <p>text</p> | <p>text</p> (same, but must be inside <sec> or <body>) |
| Lists | <ul><li> | <list list-type="bullet"><list-item><p> |
| URLs | <a href="..."> | <ext-link ext-link-type="uri" xlink:href="..."> |
The fundamental challenge: Author’s .liquid stores content as styled RTF (visual formatting) and citations as UUID-keyed plists (Author’s own schema). JATS requires every element to be semantically classified and every citation to be decomposed into JATS-specific child elements with specific ordering constraints enforced by a DTD.
3. JATS Article Structure
A JATS Article Authoring document has exactly this top-level structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Article Authoring DTD v1.4 20241031//EN"
"JATS-articleauthoring1-4.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
article-type="research-article"
xml:lang="en">
<front>
<!-- Article metadata: title, authors, abstract, keywords, permissions -->
</front>
<body>
<!-- Article content: sections, paragraphs, figures, tables -->
</body>
<back>
<!-- Back matter: acknowledgements, references, appendices -->
</back>
</article>
These three children of <article> are required and in this order. The DTD enforces this. There is no flexibility here.
4. The <front> Block: Metadata
The <front> block is the most structurally demanding part of the conversion. It contains article metadata that publishers require but that Author’s .liquid format does not fully capture. The developer must handle both “data we have” and “data we must allow the user to provide.”
4.1 — Minimal valid <front>
<front>
<article-meta>
<title-group>
<article-title>Origami Text</article-title>
<subtitle>Minimal EPUB, Rich Metadata</subtitle>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Hegland</surname>
<given-names>Frode Alexander</given-names>
</name>
<xref ref-type="aff" rid="aff-1"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Cerf</surname>
<given-names>Vinton G.</given-names>
</name>
<xref ref-type="aff" rid="aff-2"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Serageldin</surname>
<given-names>Ismail</given-names>
</name>
<xref ref-type="aff" rid="aff-3"/>
</contrib>
</contrib-group>
<aff id="aff-1">University of Southampton</aff>
<!-- Additional affiliations as needed -->
<abstract>
<p>This paper proposes Origami Text, a disciplined approach to
scholarly document formatting based on simplified EPUB with
enhanced metadata...</p>
</abstract>
<kwd-group>
<kwd>EPUB</kwd>
<kwd>scholarly publishing</kwd>
<kwd>metadata</kwd>
<kwd>document formats</kwd>
<kwd>accessibility</kwd>
</kwd-group>
</article-meta>
</front>
4.2 — Mapping from .liquid to <front>
| JATS element | Source in .liquid | Notes |
|---|---|---|
<article-title> | RTF: first \f0\fs50 text block | Parse the title from the largest font size |
<subtitle> | RTF: first \f2\i (italic) text after title | The italic line immediately following the title |
<contrib> (first author) | Author.plist → firstName, middleName, lastName | Only the first author is in Author.plist |
<contrib> (co-authors) | NOT IN .liquid | Must be added via UI before export |
<aff> | Author.plist → institution | Only one institution currently stored |
<abstract> | NOT IN .liquid | Must be entered by user or generated |
<kwd-group> | NOT IN .liquid | Must be entered by user |
Critical gap: Author.plist stores only one author (the document owner). For a multi-author paper, the co-author information must come from somewhere. Options:
- Add a co-authors field to .liquid — the preferred long-term solution. Add a
CoAuthors.plistor extendAuthor.plistwith an array of contributor records. - Prompt at export time — present a dialog where the user adds co-authors, affiliations, abstract, and keywords before JATS export proceeds.
- Accept incomplete metadata — export what you have and mark missing fields with XML comments like
<!-- TODO: add co-author affiliations -->. Publishers will request the missing data anyway.
Recommendation: Option 2 for the initial implementation, with Option 1 on the roadmap. The export dialog should collect: co-authors (name + affiliation each), abstract text, and keywords.
4.3 — Optional but publisher-expected <front> elements
These are not required by the DTD but most publishers will want them:
<!-- Article identifiers -->
<article-id pub-id-type="doi">10.XXXX/XXXXX</article-id>
<!-- Permissions / licensing -->
<permissions>
<copyright-statement>Copyright © 2026 Hegland, Cerf, Serageldin</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access"
xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a Creative Commons
Attribution 4.0 International License.</license-p>
</license>
</permissions>
<!-- Funding -->
<funding-group>
<funding-statement>This work was not externally funded.</funding-statement>
</funding-group>
For the initial implementation, include <permissions> with a sensible default (the user can change it in the export dialog) and leave <funding-group> optional.
5. The <body> Block: Content
The body is where the RTF content must be transformed into semantic JATS markup. Every section heading creates a <sec> container that wraps all content until the next heading of the same or higher level.
5.1 — Section structure
The RTF uses font/size to distinguish headings. The mapping:
| RTF pattern | JATS element |
|---|---|
\f0\fs50 (Baskerville 25pt) | <article-title> in <front> — NOT in <body> |
\f0\fs38 (Baskerville 19pt) | <sec><title> |
\f1\fs34 (TimesNewRoman 17pt) body text | <p> inside the current <sec> |
The nesting rule: Each <sec> contains a <title> followed by the block elements (paragraphs, lists, etc.) that belong to that section. If the document had sub-headings (e.g. an \fs30 level), those would be nested <sec> elements within the parent <sec>. The sample document has only one heading level below the title, so all sections are siblings.
<body>
<!-- Opening paragraphs before the first heading go directly in <body> -->
<p>We have long celebrated the pen but we have paid less
attention to what it writes on...</p>
<p>In Socrates' framing, the soul is the true substrate...</p>
<p>Digital text holds the promise of something greater...</p>
<sec id="sec-1">
<title>Why Current Scholarly Formats Fall Short</title>
<p>To solve the urgent, complex problems facing our world...</p>
<list list-type="bullet">
<list-item>
<p><bold>PDF</bold> provides stability, portability, and
robustness, but it fails on interactivity...</p>
</list-item>
<list-item>
<p><bold>HTML</bold> delivers exceptional interactivity,
but it fails on portability...</p>
</list-item>
</list>
<p>EPUB has existed as an open standard for decades...</p>
</sec>
<sec id="sec-2">
<title>The Origami Approach</title>
<p>A suitable knowledge format must be robust but not static...</p>
</sec>
<sec id="sec-3">
<title>The Three Pillars</title>
<p>Our approach addresses the legacy limitations...</p>
<list list-type="bullet">
<list-item>
<p><bold>Minimalist Formatting:</bold> We strip back...</p>
</list-item>
<list-item>
<p><bold>High-Resolution Addressing:</bold> To resolve...</p>
</list-item>
<list-item>
<p><bold>Rich Structural Metadata:</bold> The EPUB format...</p>
</list-item>
</list>
</sec>
<!-- Continue for all sections... -->
</body>
5.2 — Inline formatting conversion
| RTF | JATS | Notes |
|---|---|---|
\b or \f3 (Bold font) | <bold>text</bold> | |
\i or \f2 (Italic font) | <italic>text</italic> | |
\b \i (Bold italic) | <bold><italic>text</italic></bold> | Nesting order doesn’t matter |
\'91 / \'92 | ' / ' (Unicode curly quotes) | Convert to actual Unicode characters |
\'93 / \'94 | " / " (Unicode curly quotes) | Convert to actual Unicode characters |
\'95 (bullet) | Start of <list-item> | Context-dependent |
\'97 (em dash) | — (Unicode em dash) | Direct character replacement |
HYPERLINK field | <ext-link ext-link-type="uri" xlink:href="URL">text</ext-link> | Requires xlink namespace |
\cf4 (key sentence colour) | <!-- key-sentence --> comment or custom attribute | No JATS native equivalent; see 5.5 |
| Tab-indented paragraph | New <p> element | Not visual indentation |
5.3 — Citation cross-references in body text
This is one of the hardest parts. In the RTF, citations appear as parenthetical text like (Halevi, Moed, Bar-Ilan 2015). In JATS, they must become:
<xref ref-type="bibr" rid="ref-Halevi2015">Halevi, Moed, & Bar-Ilan 2015</xref>
The rid attribute must match the id of the corresponding <ref> element in the <ref-list> in <back>.
Matching algorithm:
- Build a lookup table from
Citations.plist: for each citation UUID, extract the display form that Author generates (author surnames + year) using the citation format fromVersion.plist(nameAndDateInBrackets). - Scan the RTF body text for parenthetical citation patterns matching
(AuthorNames Year). - For each match, find the corresponding citation UUID by matching author surnames and year against the lookup table.
- Replace the parenthetical text with an
<xref>element whoseridpoints to the reference ID derived from the UUID.
Edge cases to handle:
- Multiple citations in one parenthetical:
(Smith 2020; Jones 2021)→ two<xref>elements separated by; - Citations with “et al.”: Author may abbreviate long author lists
- Citations used in running text without parentheses: “Halevi et al. (2015) found that…” — the
<xref>wraps only the year portion - Author’s RTF stores the citation in-text as rendered text, not as a structured reference — your parser must do fuzzy matching
Generating reference IDs: For each citation UUID like C3736192-02AE-4635-AD53-8DC896A6F500, generate a short, stable JATS ID. Recommended: ref- + first author surname + year, e.g. ref-Halevi2015. Handle collisions (same author, same year) with letter suffixes: ref-Smith2020a, ref-Smith2020b.
5.4 — Lists
RTF bullet lists (identified by \li300\fi-300 paragraph formatting with \'95 bullet characters) must become JATS <list>elements:
<list list-type="bullet">
<list-item>
<p>First item text...</p>
</list-item>
<list-item>
<p>Second item text...</p>
</list-item>
</list>
Important: In JATS, <list-item> must contain <p> — you cannot put bare text directly inside <list-item>. This is a common validation error.
For ordered/numbered lists, use list-type="order".
5.5 — Key sentences (\cf4 text)
JATS has no native element for “key sentence” or “thesis statement.” Options:
- Ignore the distinction — merge
\cf4text into regular<p>content. Simplest, but loses valuable semantic information. - Use
<named-content>— JATS provides<named-content content-type="key-sentence">for publisher-defined inline semantics. This is DTD-valid and preserves the information:
<p><named-content content-type="key-sentence">The rise of large
language models has made the format question unexpectedly
urgent.</named-content> When an LLM ingests a PDF, it must
reverse-engineer structure from visual layout...</p>
Recommendation: Use <named-content>. It’s valid JATS, it’s self-documenting, and publishers can choose to use or ignore it.
5.6 — Author biography section
The sample document includes an “About the Authors” section. In JATS, author biographies go in the <back> matter, not the <body>:
<!-- In <back>, not <body> -->
<bio>
<p><bold>Frode Hegland</bold> is an academic focused on the future
of text for academic use...</p>
</bio>
Or, if there are multiple author bios, each <contrib> in <front> can contain its own <bio> element.
6. The <back> Block: References
6.1 — Reference list structure
<back>
<ref-list>
<title>References</title>
<ref id="ref-Halevi2015">
<element-citation publication-type="journal">
<!-- structured citation content -->
</element-citation>
</ref>
<ref id="ref-Knauff2014">
<element-citation publication-type="journal">
<!-- structured citation content -->
</element-citation>
</ref>
<!-- ... one <ref> per citation ... -->
</ref-list>
</back>
6.2 — Citation type mapping
The Citations.plist bibTeXType field maps to JATS publication-type:
| BibTeX type in .liquid | JATS publication-type |
|---|---|
article | journal |
book | book |
inproceedings | confproc |
misc | webpage (if URL present) or other |
incollection | chapter |
phdthesis | thesis |
| (empty string) | Infer from available fields |
6.3 — Complete citation transformation
Here is the full mapping from a single entry in Citations.plist to JATS XML, using the Halevi et al. citation as the example.
Source data (from Citations.plist):
identifier: C3736192-02AE-4635-AD53-8DC896A6F500
bibTeXType: article
title: "Accessing, Reading and Interacting with Scientific Literature..."
citationAuthors: [
{firstName: "Gali", lastName: "Halevi"},
{firstName: "Henk", middleName: "F.", lastName: "Moed"},
{firstName: "Judit", lastName: "Bar-Ilan"}
]
yearComponent: 2015
journal: "Publishing Research Quarterly"
volume: "31"
pageRange: "102--121"
doi: "10.1007/s12109-015-9404-9"
Target JATS XML:
<ref id="ref-Halevi2015">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Halevi</surname>
<given-names>Gali</given-names>
</name>
<name>
<surname>Moed</surname>
<given-names>Henk F.</given-names>
</name>
<name>
<surname>Bar-Ilan</surname>
<given-names>Judit</given-names>
</name>
</person-group>
<article-title>Accessing, Reading and Interacting with Scientific
Literature as a Factor of Academic Role</article-title>
<source>Publishing Research Quarterly</source>
<year iso-8601-date="2015">2015</year>
<volume>31</volume>
<fpage>102</fpage>
<lpage>121</lpage>
<pub-id pub-id-type="doi">10.1007/s12109-015-9404-9</pub-id>
</element-citation>
</ref>
6.4 — Field-by-field transformation rules
Authors → <person-group>:
// Pseudocode
for author in citation.citationAuthors {
let givenNames = [author.firstName, author.middleName]
.filter { !$0.isEmpty }
.joined(separator: " ")
// Handle prefix (e.g. "ten" in "ten Brinke")
if !author.prefixName.isEmpty {
// Prefix goes before surname in <surname>
surname = author.prefixName + " " + author.lastName
}
emit("<name>")
emit(" <surname>\(surname)</surname>")
emit(" <given-names>\(givenNames)</given-names>")
emit("</name>")
}
For institutional/anonymous authors (where isAnonymous is true or author names are clearly institutional like “Overleaf”, “arXiv”, “Taylor & Francis”):
<person-group person-group-type="author">
<collab>Taylor & Francis</collab>
</person-group>
Page range → <fpage> and <lpage>: The .liquid stores page ranges as strings like "102--121" or "e115069". Parse these:
if pageRange.contains("--") {
let parts = pageRange.split(separator: "--")
emit("<fpage>\(parts[0])</fpage>")
emit("<lpage>\(parts[1])</lpage>")
} else if !pageRange.isEmpty {
// Electronic article number (e.g. "e115069")
emit("<elocation-id>\(pageRange)</elocation-id>")
}
Title cleanup: The .liquid sometimes stores titles with BibTeX escaping: {Rehabilitation Act}, {EU}, {People's Republic of China}. Strip the curly braces and & entities:
let cleanTitle = rawTitle
.replacingOccurrences(of: "{", with: "")
.replacingOccurrences(of: "}", with: "")
.replacingOccurrences(of: "\\&", with: "&")
URL → <ext-link>:
<ext-link ext-link-type="uri"
xlink:href="https://arxiv.org/abs/2410.03022">
https://arxiv.org/abs/2410.03022
</ext-link>
Only emit if webAddress is non-empty and doi is empty (avoid redundancy — DOI is preferred).
Note field: The note field in Citations.plist often contains legal citations or supplementary info. Map to <comment>:
<comment>29 U.S.C. §794d. Amended by the Workforce Investment Act of 1998</comment>
6.5 — Complete element-citation templates by type
Journal article (publication-type="journal"):
<element-citation publication-type="journal">
<person-group person-group-type="author">...</person-group>
<article-title>TITLE</article-title>
<source>JOURNAL NAME</source>
<year iso-8601-date="YYYY">YYYY</year>
<volume>VOL</volume>
<issue>ISSUE</issue> <!-- if available -->
<fpage>FIRST</fpage>
<lpage>LAST</lpage>
<pub-id pub-id-type="doi">DOI</pub-id>
</element-citation>
Book (publication-type="book"):
<element-citation publication-type="book">
<person-group person-group-type="author">...</person-group>
<source>BOOK TITLE</source> <!-- note: <source>, not <article-title> -->
<year iso-8601-date="YYYY">YYYY</year>
<publisher-name>PUBLISHER</publisher-name>
<publisher-loc>LOCATION</publisher-loc>
<isbn>ISBN</isbn>
</element-citation>
Conference paper (publication-type="confproc"):
<element-citation publication-type="confproc">
<person-group person-group-type="author">...</person-group>
<article-title>PAPER TITLE</article-title>
<conf-name>CONFERENCE/PROCEEDINGS NAME</conf-name>
<conf-loc>LOCATION</conf-loc>
<year iso-8601-date="YYYY">YYYY</year>
<fpage>FIRST</fpage>
<lpage>LAST</lpage>
<series>SERIES NAME</series>
</element-citation>
Webpage / misc (publication-type="webpage"):
<element-citation publication-type="webpage">
<person-group person-group-type="author">...</person-group>
<article-title>PAGE TITLE</article-title>
<year iso-8601-date="YYYY">YYYY</year>
<ext-link ext-link-type="uri" xlink:href="URL">URL</ext-link>
<date-in-citation content-type="access-date"
iso-8601-date="2026-05-27">Accessed May 2026</date-in-citation>
<comment>NOTES</comment>
</element-citation>
6.6 — Element ordering within <element-citation>
This is critical. The JATS DTD enforces a specific order for child elements of <element-citation>. Emitting elements out of order will cause DTD validation to fail. The required order is approximately:
<person-group>(authors/editors)<collab>(if institutional author, without person-group)<article-title>(for articles/chapters) OR<source>(for books)<source>(journal name / book series for articles)<edition><publisher-loc>,<publisher-name><year>,<month>,<day><date-in-citation><volume>,<issue><fpage>,<lpage>OR<elocation-id><pub-id>(DOI, PMID, etc.)<ext-link>(URL)<comment>
In practice: build each <element-citation> by emitting elements in this order, skipping any that have no data. The DTD is lenient about missing elements but strict about order.
7. Origami Extensions in JATS
JATS has a mechanism for custom metadata that can carry Origami-specific information without breaking validation.
7.1 — Custom metadata in <front>
<article-meta>
<!-- ... standard metadata ... -->
<custom-meta-group>
<custom-meta>
<meta-name>origami-version</meta-name>
<meta-value>1.0</meta-value>
</custom-meta>
<custom-meta>
<meta-name>origami-addressing</meta-name>
<meta-value>true</meta-value>
</custom-meta>
<custom-meta>
<meta-name>visual-meta-url</meta-name>
<meta-value>https://visual-meta.info/origami-text</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
7.2 — Section IDs as Purple Numbers
JATS <sec> elements already support id attributes. Use the same ot-NNN scheme as the EPUB export:
<sec id="ot-sec-3">
<title>The Three Pillars</title>
<p id="ot-015">Our approach addresses the legacy limitations...</p>
</sec>
The id attribute is valid on <p>, <sec>, <list-item>, <fig>, <table-wrap>, and most other JATS elements. This means the entire Origami addressing scheme is natively expressible in JATS without any extensions.
7.3 — Glossary terms
JATS has a native <glossary> element in <back>:
<back>
<glossary>
<title>Defined Concepts</title>
<def-list>
<def-item>
<term>Origami Text</term>
<def><p>A disciplined approach to scholarly document formatting
based on simplified EPUB with enhanced metadata.</p></def>
</def-item>
</def-list>
</glossary>
<ref-list>...</ref-list>
</back>
8. Data Not in .liquid That JATS Requires
The following information is expected by publishers but does not exist in the current .liquid format. The JATS export must either collect it from the user at export time or provide sensible defaults.
| JATS element | Required by | Recommendation |
|---|---|---|
| Co-authors + affiliations | All publishers | Export dialog: author entry form |
| Abstract | All publishers | Export dialog: text field (or auto-generate from key sentences) |
| Keywords | Most publishers | Export dialog: comma-separated field |
| DOI | Publisher assigns after acceptance | Leave empty or use placeholder |
| Permissions / license | All publishers | Default to CC-BY 4.0 with override |
| Funding statement | Most publishers | Export dialog: optional text field |
| Article type | All publishers | Default to research-article with dropdown |
| Conflict of interest | Many publishers | Export dialog: optional text field |
| Acknowledgements | Optional | Check if present in document body |
| Corresponding author email | Most publishers | Export dialog: email field |
Suggested UI: “Prepare for Submission” dialog that appears before JATS export, with fields for all the above. Pre-populate what can be inferred from the .liquid (first author from Author.plist, institution, key sentences as draft abstract).
9. Validation
9.1 — DTD validation
The exported XML must validate against the JATS Article Authoring 1.4 DTD. Use an XML validator (such as xmllintwith the DTD, or oXygen XML Editor) during development.
# Command-line validation with xmllint
xmllint --dtdvalid JATS-articleauthoring1-4.dtd --noout output.xml
Download the DTD and its dependent module files from: https://public.nlm.nih.gov/projects/jats/articleauthoring/1.4/
Common validation errors to watch for:
- Element ordering within
<element-citation>(see Section 6.6) - Missing required child elements (e.g.
<surname>inside<name>) <p>missing inside<list-item>- Text content directly inside
<sec>without being wrapped in a block element <xref>with anridthat doesn’t match anyidin the document- Missing
xlinknamespace on<ext-link>elements - Ampersands not escaped as
& - Curly/smart quotes as raw Unicode (this is fine in UTF-8 XML, but verify)
9.2 — Schematron / business rules
Beyond DTD validity, publishers enforce additional rules. For ACM specifically (via Atypon Literatum), the XML must also pass content checks for: presence of article ID, contributor names, journal/publisher data, and an article title. These are documented in Atypon’s Content Tagging Guide (not publicly available, but the oXygen framework at https://github.com/le-tex/oXygenJATSframework_Literatum includes Schematron checks that reveal many of these rules).
9.3 — Test workflow
- Export from Author to JATS XML
- Validate against DTD with
xmllint - Open in oXygen XML Editor (if available) for visual inspection
- Run the NCBI Preview Stylesheets to generate an HTML preview:
https://github.com/NCBITools/JATSPreviewStylesheets - Verify that the HTML preview renders all sections, citations, and cross-references correctly
10. Complete Example: The Origami Text Article
Here is a skeleton of the complete JATS output for the Origami Text article, showing the structure with abbreviated content:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article
PUBLIC "-//NLM//DTD JATS (Z39.96) Article Authoring DTD v1.4 20241031//EN"
"JATS-articleauthoring1-4.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
article-type="research-article"
xml:lang="en">
<front>
<article-meta>
<title-group>
<article-title>Origami Text</article-title>
<subtitle>Minimal EPUB, Rich Metadata</subtitle>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Hegland</surname>
<given-names>Frode Alexander</given-names></name>
<email>frode@augmentedtext.info</email>
<xref ref-type="aff" rid="aff-1"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Cerf</surname>
<given-names>Vinton G.</given-names></name>
<xref ref-type="aff" rid="aff-2"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Serageldin</surname>
<given-names>Ismail</given-names></name>
<xref ref-type="aff" rid="aff-3"/>
</contrib>
</contrib-group>
<aff id="aff-1">The Augmented Text Company; University of Southampton</aff>
<aff id="aff-2">Google</aff>
<aff id="aff-3">Library of Alexandria</aff>
<abstract>
<p>This paper proposes Origami Text, a disciplined approach to
scholarly document formatting based on simplified EPUB 3 with
enhanced metadata layers. We argue that neither PDF nor HTML
adequately serves the needs of modern scholarly communication...</p>
</abstract>
<kwd-group>
<kwd>EPUB</kwd>
<kwd>scholarly publishing</kwd>
<kwd>metadata</kwd>
<kwd>accessibility</kwd>
<kwd>large language models</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>origami-version</meta-name>
<meta-value>1.0</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<p id="ot-001"><named-content content-type="key-sentence">We have
long celebrated the pen but we have paid less attention to what it
writes on.</named-content> The pen is indeed mighty; what the pen
writes upon determines what can be written...</p>
<p id="ot-002">In Socrates' framing, <named-content
content-type="key-sentence">the soul is the true substrate of
knowledge,</named-content> not the 'dead' page of papyrus...</p>
<sec id="ot-sec-1">
<title>Why Current Scholarly Formats Fall Short</title>
<p id="ot-004">To solve the urgent, complex problems facing our
world, <named-content content-type="key-sentence">we require
richly interactive knowledge environments</named-content>...</p>
<list list-type="bullet">
<list-item id="ot-005">
<p><bold>PDF</bold> provides stability, portability, and
robustness, but it fails on interactivity...</p>
</list-item>
<list-item id="ot-006">
<p><bold>HTML</bold> delivers exceptional interactivity,
but it fails on portability...</p>
</list-item>
</list>
</sec>
<sec id="ot-sec-4">
<title>Why Format Fidelity Matters for AI</title>
<p id="ot-020"><named-content content-type="key-sentence">All of
this metadata, along with the main text, is not only cleanly
accessible to EPUB reader software but also to AI for clear
parsing.</named-content>...</p>
<p id="ot-021"><named-content content-type="key-sentence">The rise
of large language models has made the format question unexpectedly
urgent.</named-content> When an LLM ingests a PDF, it must
reverse-engineer structure from visual layout...</p>
</sec>
<sec id="ot-sec-6">
<title>Beyond the Conventions of Print</title>
<p id="ot-030">Much academic publishing today still passes through
LaTeX which is a powerful system...
(<xref ref-type="bibr" rid="ref-Knauff2014">Knauff &
Nejasmic 2014</xref>)...</p>
</sec>
<!-- ... remaining sections ... -->
</body>
<back>
<glossary>
<title>Defined Concepts</title>
<def-list>
<def-item>
<term>Origami Text</term>
<def><p>A disciplined approach to scholarly document formatting
based on simplified EPUB with enhanced metadata.</p></def>
</def-item>
</def-list>
</glossary>
<ref-list>
<title>References</title>
<ref id="ref-Halevi2015">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Halevi</surname>
<given-names>Gali</given-names></name>
<name><surname>Moed</surname>
<given-names>Henk F.</given-names></name>
<name><surname>Bar-Ilan</surname>
<given-names>Judit</given-names></name>
</person-group>
<article-title>Accessing, Reading and Interacting with
Scientific Literature as a Factor of Academic
Role</article-title>
<source>Publishing Research Quarterly</source>
<year iso-8601-date="2015">2015</year>
<volume>31</volume>
<fpage>102</fpage>
<lpage>121</lpage>
<pub-id pub-id-type="doi">10.1007/s12109-015-9404-9</pub-id>
</element-citation>
</ref>
<ref id="ref-Knauff2014">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Knauff</surname>
<given-names>Markus</given-names></name>
<name><surname>Nejasmic</surname>
<given-names>Jelica</given-names></name>
</person-group>
<article-title>An Efficiency Comparison of Document Preparation
Systems Used in Academic Research and
Development</article-title>
<source>PLOS ONE</source>
<year iso-8601-date="2014">2014</year>
<volume>9</volume>
<elocation-id>e115069</elocation-id>
<pub-id pub-id-type="doi">10.1371/journal.pone.0115069</pub-id>
</element-citation>
</ref>
<ref id="ref-Kumar2024">
<element-citation publication-type="webpage">
<person-group person-group-type="author">
<name><surname>Kumar</surname>
<given-names>Anukriti</given-names></name>
<name><surname>Wang</surname>
<given-names>Lucy Lu</given-names></name>
</person-group>
<article-title>Uncovering the New Accessibility Crisis in
Scholarly PDFs</article-title>
<year iso-8601-date="2024">2024</year>
<pub-id pub-id-type="doi">10.48550/arXiv.2410.03022</pub-id>
<ext-link ext-link-type="uri"
xlink:href="https://arxiv.org/abs/2410.03022">
https://arxiv.org/abs/2410.03022</ext-link>
</element-citation>
</ref>
<ref id="ref-tenBrinke2025a">
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name><surname>ten Brinke</surname>
<given-names>Wouter</given-names></name>
<name><surname>Griepsma</surname>
<given-names>Bart</given-names></name>
<name><surname>Ignatovič</surname></name>
</person-group>
<article-title>On the Structuring of LaTeX
Projects</article-title>
<conf-name>Proceedings of the 24th Belgium-Netherlands Software
Evolution Workshop</conf-name>
<conf-loc>Enschede, The Netherlands</conf-loc>
<year iso-8601-date="2025">2025</year>
<fpage>1</fpage>
<lpage>7</lpage>
<series>BENEVOL '25</series>
</element-citation>
</ref>
<ref id="ref-TaylorFrancis2024">
<element-citation publication-type="webpage">
<person-group person-group-type="author">
<collab>Taylor & Francis</collab>
</person-group>
<article-title>Taylor & Francis Joins DAISY Consortium's
Inclusive Publishing Partner Program</article-title>
<year iso-8601-date="2024">2024</year>
<ext-link ext-link-type="uri"
xlink:href="https://newsroom.taylorandfrancisgroup.com/taylor-and-francis-joins-daisy-consortium-inclusive-publishing-partner-program/">
Taylor & Francis Newsroom</ext-link>
<date-in-citation content-type="access-date"
iso-8601-date="2026-05">Accessed May
2026</date-in-citation>
</element-citation>
</ref>
<!-- ... remaining references ... -->
</ref-list>
</back>
</article>
11. Implementation Phases
Phase 1: Skeleton generator (1 week)
Build the JATS document scaffolding: <article>, <front>, <body>, <back> with hardcoded metadata. Get DTD validation passing on a minimal document.
Phase 2: Front matter from .liquid (1–2 weeks)
Parse Author.plist to populate <title-group> and first <contrib>. Build the export dialog for co-authors, abstract, and keywords. Achieve DTD validation on <front>.
Phase 3: Body conversion (2–3 weeks)
RTF → JATS body parser: sections from heading detection, paragraphs, lists, inline formatting (<bold>, <italic>), hyperlinks, key sentence markup. This is the largest single task.
Phase 4: Citation transformation (2–3 weeks)
Citations.plist → <ref-list> converter for all citation types. In-text citation matching and <xref> insertion. Reference ID generation with collision handling. This is the most error-prone task.
Phase 5: Validation and testing (1–2 weeks)
DTD validation, Schematron checks, NCBI Preview Stylesheet rendering, edge case testing (empty fields, special characters, Unicode, long documents).
Phase 6: Origami extensions (1 week)
<custom-meta-group> for Origami version, <named-content> for key sentences, id attributes on all block elements, <glossary>from glossary.json.
Total: 8–12 weeks for one developer, assuming familiarity with XML and the Author codebase.
12. Key Resources
| Resource | URL |
|---|---|
| JATS 1.4 Article Authoring Tag Library | https://jats.nlm.nih.gov/articleauthoring/tag-library/1.4/index.html |
| JATS 1.4 DTD (download) | https://jats.nlm.nih.gov/articleauthoring/1.4/JATS-articleauthoring1-4.dtd |
| All JATS 1.4 schemas | https://public.nlm.nih.gov/projects/jats/articleauthoring/1.4/ |
| Full JATS 1.4 standard (PDF) | https://groups.niso.org/higherlogic/ws/public/download/31415/ANSI-NISO-z39.96-2024.pdf |
| NCBI Preview Stylesheets | https://github.com/NCBITools/JATSPreviewStylesheets |
| Atypon/Literatum JATS framework (oXygen) | https://github.com/le-tex/oXygenJATSframework_Literatum |
| ACM TAPS documentation | https://authors.acm.org/proceedings/production-information/taps-production-workflow |
| JATS discussion mailing list | https://jats.nlm.nih.gov/jats-list.html |
| Sample JATS articles (NLM) | https://jats.nlm.nih.gov/articleauthoring/tag-library/1.4/chapter/samples.html |