

The first PDF reader was released about 25 years ago. Dictionaries, that contain key-value pairs surrounded by double angle brackets, where the keys are names, such as >.Arrays, represented as space-separated objects surrounded by square brackets, such as or /bar true 45.6].Names, introduced by a slash / such as /bar or /Hello#20world (the number sign # is used to escape characters).They can also be in hexadecimal, between angle brackets like (equivalent to (Hello world)). Strings, delimited by parenthesis such as (foo) or (Hello world\n) (the backslash \ is used to escape characters).Real numbers, such as 1.23 or -0.456 (exponential notation such as 1.23e4 is not allowed, one must use leading/trailing zeros instead).Booleans, represented as true and false.They are composed of the following data types. Think of these objects as JSON ( RFC 7159), but with a different syntax. regular characters: all the other bytes.Ī PDF document is structured into objects.delimiters, to construct tokens: ()/%.white-space, to separate tokens: 0x00 (null), 0x09 (horizontal tab), 0x0A (line feed), 0x0C (form feed), 0x0D (carriage return), 0x20 (space).In practice, the textual nature of PDF means that bytes are grouped into several character classes: This means that a human could in theory write or read it manually, but I don’t know many people who do it with real files. Syntax of objectsĬontrary to what people expect, PDF is before all a textual file format, whose syntax is mostly inherited from PostScript. The specification is sometimes quite verbose, but the section titles are mostly self-explainatory, so we will start this journey at section 7 (Syntax). This means that a PDF specification was published, and you can indeed find a copy on Adobe’s website.Īfter downloading this document (itself delivered as a PDF file!), you can have a nice 756-page reading, or just lazily follow my summary blog posts.
#Pdf basics iso
The first PDF reader was released in the early 1990s, and PDF was later formalized as an ISO standard (number 32000-1) in 2008. Object streams and xref streams: let’s compress!įrom a historical perspective, PDF stems from PostScript, a vector graphics format born in the 1980s.Incremental updates: who said that PDF was read-only?.In this first post, I give an overview of the PDF syntax.

#Pdf basics portable
However, it is possible to parse, analyze and validate PDFs, and these were the objectives of the Caradoc project.Īfter my work on this project (still in progress), I am starting a series of posts to demystify the Portable Document Format and share some relevant feedback. This is partly true, in the sense that the format is so complex that it would take years as a full-time job to fully understand it. By Guillaume Endignoux - popular belief, PDF is often seen as an obscure binary format that is almost impossible to modify.
