The structure of Origami Text follows that of EPUB, with the following simplifications and enhancements. an EPUB is literally a zip archive with the extension renamed from .zip to .epub. The contents is as follows:
- origami-document.epub
- mimetype (uncompressed file)
- package.opf (file)
- visual-meta.json (file)
- META-INF/ (directory, containing container.xml)
- content/ (directory, containing paper.html, optionally style.css, and an images/ subdirectory)
Notes for Implementing Origami Text Creation.
Simplifications
Single HTML content document. Where EPUB allows — and often encourages — splitting content across multiple XHTML files (one per chapter, section, or even page), Origami Text requires exactly one HTML file. No chapter splitting. No separate navigation document beyond what package.opf provides. The entire document lives in paper.html.
Single optional CSS file. At most one style.css. No inline styles, no style attributes in the HTML. Presentation is cleanly separated and entirely optional — the document must be meaningful and readable without it or when viewed with the user’s own CSS style.
Semantic markup only. The HTML carries document structure — what is a heading, what is a citation, what is a figure — but makes no visual claims. Standard HTML5 elements (<article>, <section>, <h1>–<h6>, <figure>, <blockquote>, <cite>) encode meaning, not appearance.
Images as inline figures only. Images appear in document flow inside <figure> elements. No floats, no absolute positioning, no background images, no decorative graphics.
No scripting. No JavaScript in the content document. The document is inert text with structure.
Enhancements
Visual-Meta in dual locations. A visual-meta.json file sits at the package root for programmatic access. The same data is also embedded as a <script type="application/json"> block at the end of paper.html, so that if someone extracts just the HTML from the package, the metadata travels with it. This carries the full intellectual map of the document: metadata, defined concepts, glossary entries, and spatial layout coordinates (x, y, z positions and an ID for each node) for reconstitution in XR environments.
The standalone visual-meta.json needs to be listed in the manifest section of package.opf but not in the spine, for Apple Books compatibility. Inside paper.html, the same data embedded as <script type="application/json"> is also safe because Apple Books doesn’t execute or render non-JavaScript script blocks.
The reason to have both is exactly what the structure page describes — they serve different consumers. A tool that wants the metadata programmatically grabs the JSON file directly without parsing HTML. A person who extracts just paper.htmlfrom the archive still has the metadata travel with the document. They’re redundant by design.
So the root level stays at three files and two directories. That said, if you wanted to be maximally cautious, you could test an EPUB with an unknown JSON file in the manifest against Apple Books and EPUBCheck to confirm. I’d be surprised if either complained, but it’s a five-minute test.
Round-trippable citations. Every entry in the References section of the HTML carries machine-readable citation data (BibTeX and/or CSL-JSON) in data- attributes, in addition to the human-readable formatted text. A reader can extract structured citation data from any reference — not just a formatted string. The same citation data appears in visual-meta.json.
Spatial layout metadata. The Visual-Meta includes spatial coordinates for glossary nodes, citation nodes, and document sections, enabling an XR-capable reader (such as Author’s Reader for visionOS) to reconstitute the full spatial knowledge environment. This metadata is deliberately minimal — position, ID, and nothing more (no aspect, orientation, visual appearance etc.) — so that different renderers can interpret it differently according to user preferences.
Hierarchical addressing. Every addressable element in the document — every heading, paragraph, list item, and figure — receives an HTML anchor (id attribute) at export time, following a hierarchical scheme inspired by Doug Engelbart’s purple numbers. A paragraph in section 3, subsection 2, might carry the address 3B, meaning the second element under section 3. Because Origami Text is an export format, the document structure is frozen at the moment of export, so these addresses are stable and permanent. The anchors are not visually displayed in the text itself, but they are present in the HTML so that a reading application can choose to reveal them if the reader wishes. In practice, academic papers already number their headings (1, 1.1, 2, 3.2, and so on), which gives readers a natural way to speak about locations in the document; the hierarchical anchors extend this same principle down to the paragraph level. When a reader selects text and copies it as a citation, the clipboard carries the anchor address along with the quoted text and the document’s bibliographic metadata, so that the resulting citation can link directly to the relevant passage — not just to the document as a whole.
Graceful degradation across four levels. The same file works at every level of reader capability:
- An XR-capable reader reconstitutes the full spatial experience from the embedded metadata.
- A standard EPUB reader renders it as a styled document.
- A web browser opens the HTML directly as a readable page.
- A plain text editor or AI system sees clean semantic HTML with embedded JSON metadata.
Spatial Views
The document’s HTML already contains every element that matters — headings, paragraphs, glossary terms, citation entries, figures — and each of these already carries an id attribute through the hierarchical addressing scheme described above. The spatial layout system does not duplicate or redescribe any of this content. A spatial view is simply a list of IDs and positions: it takes elements that already exist in the document and gives each one an x, y, and z coordinate. That is all.
This means a layout entry is extremely small. A single node in a spatial view looks like this:
{ "ref": "3B", "x": 1.2, "y": 0.5, "z": 0.0 }
The ref value points to an id that already exists in the HTML. The coordinates are in metres, using a right-handed, Y-up system matching visionOS conventions. There is no rendering information — no colour, no size, no font. How a node looks when opened or closed is the reader application’s decision, not the document’s.
A document may contain multiple spatial views, each arranging the same elements differently. An author might provide a default arrangement, a timeline view, and a citation-cluster view, all in the same file. Each layout is a named array of these ID-and-position entries. A layout does not need to include every element in the document — it includes only the ones the author chose to place in that particular view. The same element can appear in multiple layouts at different positions.
Because the nodes are defined once in the HTML and merely referenced by the layouts, the spatial data adds very little weight to the file. A layout of fifty nodes is roughly fifty lines of JSON. The intellectual content — the definitions, the citation metadata, the prose — lives in the HTML where it belongs. The spatial views are just a way of saying where things go.
What Origami Text Is Not
Origami Text is an export and distribution format. Currently, in our proof of concept word processor, the working format is .liquid (the Author document format). Origami Text is what a .liquid document becomes when it leaves the author’s workspace and enters the world — a flat sheet that folds into a spatial object and unfolds back again without losing anything.
As AI can reconstruct space from ‘paper’, folding origami, here is an AI tribute to Doug and his Viking heritage, as we are working to implement high resolution addressing and spatial knowledge views: