Jim's Blog Ramblings about novels, comics, programming, and other geek topics

14Nov/0818

How to extract pages from a PDF document

Google AdSense

One of the great things about being a programmer is that when you need software that you don't have, you can usually write a small utility application to do what you need without having to purchase the software.

If you own Adobe Acrobat ($299 USD) or FoxIt Editor ($99.00 USD), then you can just right click and extract pages from an existing PDF to create a new PDF document. However, if you don't want to shell out the money for it, then you can always write your own code to perform the same task.

I used iTextSharp in a previous application to provide a capability of exporting documents to PDF, so I was familiar with the iTextSharp project. In that application, I was able to create PDF documents for various paper sizes ranging from 8 1/2 x 11 and legal sizes to large format plotter pages sizes such as D and E-size pages. The library worked great and produce near perfect PDF documents of our displays.

Today, I needed to extract a couple pages from a PDF file to create a new PDF document and I realized that I didn't have a PDF editor... So I went out and downloaded iTextSharp library and wrote a small program to do the work for me.

I'll just brush over the basic program setup.

  • I used a C# console application as my starting point
  • There are 4 command line arguments (input file, output file, starting page, and ending page)
  • Validate that the input file exists
  • Validate that the input file is a PDF document
  • Validate that the starting and ending page numbers are valid
  • Add two using directives for iTextSharp.text and iTextSharp.text.pdf

Okay, here's the primary method of the application. You can see the four input parameters and the code comments should provide enough information to walk you through the steps.

private static void ExtractPages(string inputFile, string outputFile,
    int start, int end)
{
    // get input document
    PdfReader inputPdf = new PdfReader(inputFile);

    // retrieve the total number of pages
    int pageCount = inputPdf.NumberOfPages;

    if (end < start || end > pageCount)
    {
        end = pageCount;
    }

    // load the input document
    Document inputDoc =
        new Document(inputPdf.GetPageSizeWithRotation(1));

    // create the filestream
    using (FileStream fs = new FileStream(outputFile, FileMode.Create))
    {
        // create the output writer 
        PdfWriter outputWriter = PdfWriter.GetInstance(inputDoc, fs);
        inputDoc.Open();
        PdfContentByte cb1 = outputWriter.DirectContent;

        // copy pages from input to output document
        for (int i = start; i <= end; i++)
        {
            inputDoc.SetPageSize(inputPdf.GetPageSizeWithRotation(i));
            inputDoc.NewPage();

            PdfImportedPage page =
                outputWriter.GetImportedPage(inputPdf, i);
            int rotation = inputPdf.GetPageRotation(i);

            if (rotation == 90 || rotation == 270)
            {
                cb1.AddTemplate(page, 0, -1f, 1f, 0, 0,
                    inputPdf.GetPageSizeWithRotation(i).Height);
            }
            else
            {
                cb1.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
            }
        }

        inputDoc.Close();
    }
}

 

Resources

kick it on DotNetKicks.com

James Welch

James Welch is a software engineer in Vermont working for a large information technology company and specializing in .NET. Additionally, he holds a Master’s Degree in Software Engineering and a Bachelor of Science Degree in Computer Science. Jim also enjoys local craft beer, comic books, and science-fiction and fantasy novels, games, and movies.

Twitter Google+ 

Comments (18) Trackbacks (0)
  1. hi its good to see this rarely available snippet on the net.But can u teel me where can we get the PDFReader class from …

  2. The PdfReader class is part of the iTextSharp library.

  3. There’s links to everything at the end of the blog entry above.

  4. hi, would it be able to extract pdf to images such as jpg or png with this library? if so can u show me pls tks

  5. I don’t know. There’s documentation associated with the library that you can read to figure it out.

  6. Great Article, but my PDF don´t copy exactly, there is a simbols and images don´t copy.

    why? Decode library´s?

    thanks

  7. Hi ,
    First of all congratulation for writing such good article.

    I have issue with this example.
    When i delete(Skip) page and generate document . The document is not fillable. While the source document is fillable.

    Please help me on this.

    Thanks

  8. Chirag,

    When you’re deleting a page are you saving the document before trying to extract or are you doing it at the same time? You may just be having an off by one error because if you delete page 2, then page 2 becomes page 3, etc. It’s easier to just loop through the document twice, once to delete pages and the second time to extract pages, or just skip pages and extract only what you need.

  9. Jim,

    Same issue as chirag is facing.

    Here are my steps.

    1. I am generating temp.pdf first. using the same code which you have suggested in this post.

    2. And I am trying to fill this pdf in next step.

    But the prob is temp.pdf remain no longer fillable though source pdf is fillable.

    I am not changing anything in source document. I am just adding source document into temp.pdf. so prob like page 3 become page 2 is also not an issue in my case.

    Please help me out.

    Hitesh

  10. Hitesh,

    I’m not sure what could be the problem. It could be something in the PDF file or just a minor coding issue.

  11. Thanx! awesome example

  12. We have tested this code and it works nicely for most PDF files. Though we have had a few people with odd page rotation end up with an upside-down PDF. If the internal rotation (even though the PDF views right-side-up) is 90 or 270 degrees, the output PDF has the correct page removed but all pages in the output PDF are upside-down. Any ideas or suggestions? Thanks.

    • I don’t have any ideas on your problems. I only used that for a one-time project and I haven’t worked with that library in a while. You could try asking on StackOverflow, there’s quite a bit of users there that use the library.

  13. Hi,

    I have an application requirement where i want to extract pages from pdf and want to create new pdf out of those extracted pages. Any idea of how to achieve it? For e.g. say i want to extract pages 2,6,8 and then create new pdf in which has these 3 pages in the same order basically i don’t want to change the order.

    Thanks,
    Mehul

    • The above code shows you exactly how to do that. You just need to change the looping process to use a list of page numbers rather than a start and end value.

  14. Great Article!!!! Thanks for such good info.

  15. Great article for beginners, I would like to know is it possible to extract swf files from pdf?.

    Kind Regards,
    Raghu.M


Leave a Reply

No trackbacks yet.