Due to its roots in PostScript, a printer language, a PDF is very similar to a vector graphic, and contains little or no machine-readable information about the document’s logical structure; this information lets the computer know how text should be reflowed when editing and states the role of each element (e.g. heading, table), etc. This information is present when editing the document in a word processor, but is lost when “printing” to PDF.
The first step is Hybrid PDF: A PDF file is a container that can contain any other type of information. By embedding the document’s source in the PDF, it can later be edited by the source application and a new PDF can be regenerated. This feature has already been implemented in OpenOffice (and its fork, LibreOffice) several years ago, but has been poorly advertised.
However, such basic hybrid PDF files have several limitations: They are effectively two independent files in one; opening the file for editing on another system does not guarantee that the layout will be preserved, particularly if fonts are missing or other technical issues occur. Furthermore, it is not possible to edit the document with a different application and combine the best set of tools to perform the job.
Each application has its own internal model of how a document is represented. However, there are many features that are common across a wide range of mostly-text documents from different sources, such as headings, lists, tables, etc. By defining a universal model, these core features can be represented in a standardized way, enabling them to be edited in a predictable way across a wide variety of applications.