The PDF file format is nothing more than a text file at it simplest form. You could create a pdf file using nothing more that a text editor (Note Pad). This is true the same as you could create a VB.Net application with nothing more that a text editor. Does anybody do it? I don’t think so but it is possible. So the best way is to create it using a tool and there are many tools available. The program I wrote is for learning only and you are welcome to make it better and use it free of charge. Great you say we can create the file using Note Pad because it only a text file. So how do we do it? Before we start looking at the code we need to get a birds eye view of the file structure. The file has four basic parts that are required for every pdf file.
1. The File Header – Simple the identifying version of the PDF specification which the file conforms.
2. The File Body - consists of a sequence of indirect objects representing the contents of a document.
3. The Cross Reference Table - contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object.
4. The File Trailer - enables an application reading the file to quickly find the cross-reference table and certain special objects. Applications should read a PDF from it end first. Bottoms Up..
So there you have it the four major parts of a PDF File. Simple to comprehend this concept. There is more to the file than this but we need to build on the basic first. The file header is the simple part of the file it can be hard code to what every version you choose to create. I will get into the detail a little more later.
The file body is where the work gets done. The file body is made up of objects or blocks of code kinda like VB.Net has sub or function. The rule is every object has to be unique number but does not have to be in order. This I did not like because when I was trying to figure out the structure of the file using the example code stated above and keep jumping around from one object to the next. I was thinking I was back in the days when spaghetti coding. To make matters worse the PDF readers need to know where every object is located base on the offset(byte) in the file. So for the learning process I have put some order to the file structure and this make it much easy to understand. Once you know what you are doing feel free to create a mess with your code just don't think any one else can understand it.
The Cross-reference table is the only thing that has to be in order. Thank God some order finally! The table is the offset pointer to the objects in the file body. You must have at least one table for each file. When the file is updated it will create another table pointing to the new offsets. I don’t go into updating a PDF file just creating a new one in this tutorial.
The File Trailer is the first thing read by the Adobe Reader. It tells the reader where the start of the cross-reference table is located. It also tell the reader how many object are in the file and the object number for the information object. The information object is not require but recommend to have a way of knowing who created the file and what in the file. The file trailer also point to the root object and must be include for the file to work.
Now that we have some understand of the structure of the file I will try to explain the code I have created along with some more detail. I have decide to create the file using the string builder class because of it speed and ability to add to the string without rewriting the complete string in memory. This approach is better for learning because you have the ability to keep track of the file as it is developed. I also decided to keep all the PDF objects in collections so I could keep track of which object are need and only right out those object. Next I will give more detail on program flow.
The program is set up to build the file on the fly. The property are set for the information object and are read when the writepdf method is invoked. This is straight forward and nothing special. The way I choose to build the objects was to have the public methods calls create the PDF objects and store them in a collection. Once the end user is done creating the file by calling the methods and setting the property they only have to call the writepdf method. This in turn builds the file in a structure manner that is easy to debug. It only load up the resources needed for the file by looping thru the collection of PDF object.
The region label “File Flow Function” has the order in which the file is created. The File Header is simple the only thing that need to be explained is the comment line which must have at least four binary characters. These tell the readers that the file has binary data not just text data. This is only used when files are being crawled like on web sites from my understanding.
The Information Object is just information about the file like who created it and what it about. I implemented this using property which can be set or not set depending on the end user. The next area in the file is the Resources! What? You say that not following the File Flow structure you have set up in the program. Well yes and no! This area of the file is where I store all the resource that the file needs like (Standard Fonts , Images, and True Type Fonts). These are called Xobject in PDF files. This is different from the Resources object further down the program flow. It is related to this because the Resources object point to these Xobjects.
The Root Object also know as the documentation catalog is next. This is the starting point from the file trailer when the Adobe Reader first open the file. The root object tells the reader of all the Page Trees and set flags on how to display the pages in the reader.
The Outline Object is at the same level as the Page Tree and is use to documentation of the file layout. I am not implemented this yet but I still create the object with no children.
The Page Tree Object is inheritable meaning that other page trees or pages can inherit it properties. This concept should be easy to understand if you have been programming in dot net any amount of time. This object tells the Adobe Reader how many nodes(Kids) are under its tree. You can think of this as a family tree starting with Adam and Eve and everyone else is child of Adam and Eve or a child of there children for example. The page tree object must have a pointer to any page trees under it or there pages.
The Resources Object is the pointer to the fonts and image that was added in the resources of the file. I know this is kinda tricky to get but bear with me. Think of it as VB ByRef key word it not the real data just a pointer to the data.
The Page Object can inherit it definition from the Page Tree or can describe it own. This is to say it can say if the page is letter size or post card size, displayed vertical or horizontal (landscape). It has to point to both it parent and its content object.
The Content Object is where the work gets done. It uses a stream of data to tell the Adobe Reader how to paint the page for viewing or printing. This is like the file stream reader in VB.Net and Adobe will parse this steam looking for control characters telling it how to display the text and images. These control characters are documented in the PDF Referencer. You should view a few files created by the program and you will notice how they work. Then look them up in the PDF Referencer. These streams can be very large in a complex page which has a lot of lines and text. So it common to have the size of the stream recorded in it own object called the Stream Length object. This object just tells the Adobe Reader how much memory to set aside for the content to my understanding.
The Cross-reference table is very simple concept. This table has a define structure which is well documented in the code and PDF Reference. Basic it just has the offset for each PDF object in the file. Each object must have a line in the cross-reference table the points to the start of the object. So if you have 10 object you must have 10 line items which point to the object offset in the file by bytes. These line will have either a N or F at the end saying in use or free.
The File Trailer is just a pointer to the root and cross-reference table. It also tells how many object are in the PDF File.
Now that you have a good understanding of the file structure of a PDF file the code will start to make sense to you when you step through it. The best way I learn anything is hands on and so with that said lets get started. Start the program and click the Hello World button. This will create the Hello World file PDF on your desk top. Open it first with Adobe Reader to make sure it works. Then close it down and open it with Note Pad by right clicking the file and selecting open with note pad. This is the simplest file you can create using the build in fonts. If you read the text file from top to bottom you will see the object in the order of the program flow. I have comment the file to aid in the learning process. You will need to become familiar with the file structure in order to program in PDF. Once you understand the structure you will be able to look at the file to find bugs when you develop you own.
Take time to run all the example and step through the file until you get a full understanding of the process and I am sure you will learn a lot. Look at each file using Note Pad to see how the program created the file and notice the pointer from one object to the next object. I hope this program help you to figure out PDF files and saves you time. If it does then I did what I set out to do by writing this tutorial.
I have try to documented the code but if you find something you don’t understand just send me email CharlesCope @ msn.com and I will try to explain it.
- In Visual Studio create a new windows form application
- Create a new class -I named it clsPdfWriter
- Paste the text you copied from the above link (view PDF class in text) into the new class you just created
- Add a new button to your form and name is bntHelloWorld
- Add the following code inside of bntHellowWorld's click event (double click on the button):