Indexing PDF: not so fast

Tuesday, January 6, 2015

ASP.NET Orchard PDF

In the last post, I showed how to index PDF using PdfSharp. Unfortunately, the library hasn’t been updated in years, nobody seems to have forked it, and it can’t read many recent files. That makes it unfortunately unsuitable for the task. Back to square 1.

Another library that’s commonly used is iTextSharp. As I said in the last post, it has some serious licensing issues. Fortunately, once upon a time, it was under LGPL, which is a fine license. That LGPL version is still available. That’s hardly an ideal situation, as this is still a library stuck in the past. In particular, recent versions have a very handy API that extracts the text from a PDF in one operation that doesn’t exist in the old LGPL version. If you have the budget to buy a full license, it may be worth it (although the prices are not public, so who knows?).

So what are our open-source options? Well, I was able to get reasonable results with the iTextSharp LGPL version.

Here is the code for the indexing method of the handler, modified to use an iTextSharp that has been specially compiled from the LGPL source:

OnIndexing<DocumentPart>((context, part) => {
    var mediaPart = part.As<MediaPart>();
    if (mediaPart == null || Path.GetExtension(mediaPart.FileName) != ".pdf") return;
    var document = _storageProvider.GetFile(
        Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
    using (var documentStream = document.OpenRead()) {
        var pdfReader = new PdfReader(documentStream);
        var text = new StringBuilder();
        try {
            for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
                var pdfPage = pdfReader.GetPageN(page);
                var content = pdfPage.Get(PdfName.CONTENTS);

                ScanPdfContents(content, pdfReader, text);
            }
        }
        catch (PdfException ex) {
            Logger.Error(ex,
            string.Format("Unable to index {0}/{1}", mediaPart.FolderPath, mediaPart.FileName));
        }
        finally {
            pdfReader.Close();
        }
        context.DocumentIndex
                   .Add("body", text.ToString()).Analyze();
    }
});

And the ScanPdfContents method:

private static void ScanPdfContents(PdfObject content, PdfReader pdfReader, StringBuilder text) {
    var ir = content as PRIndirectReference;

    if (ir != null) {
        var value = pdfReader.GetPdfObject(ir.Number);

        if (!value.IsStream()) {
            return;
        }
        var stream = (PRStream) value;

        var streamBytes = PdfReader.GetStreamBytes(stream);

        var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

        try {
            while (tokenizer.NextToken()) {
                if (tokenizer.TokenType == PRTokeniser.TK_STRING) {
                    var str = tokenizer.StringValue;
                    text.Append(str);
                }
            }
        }
        finally {
            tokenizer.Close();
        }
    }

    var array = content as PdfArray;

    if (array != null) {
        for (var i = 0; i < array.Size; i++) {
            ScanPdfContents(array[i], pdfReader, text);
        }
    }
}

I hope this helps. If anyone wants to share their experience with indexing PDF, please let me know.

Note: The code above took some hints from this post:
http://stackoverflow.com/questions/10143098/how-to-extract-text-with-itextsharp-4-1-6

- Code updated to include PdfArray -

6 Comments

You have excluded IFilter as an option however I used it before and it works fine. You only have to write one piece of code and then it is possible to index about everything, like pdf, word, dwg (Autocad), mp3 etc. Also products like SharePoint still uses this technology for indexing.

Harold - Tuesday, January 6, 2015 6:55:17 PM

Well, I excluded IFilter for a reason: it requires an install on the server.

bleroy - Tuesday, January 6, 2015 8:09:55 PM

ah, sorry, I stopped reading at "antique, COM-based" :-)

Harold - Tuesday, January 6, 2015 8:25:56 PM

Don't bother with IFilter, unless you have a good reason to. And if you have find a good one - avoid Adobe one like fire.
It will only make you pull your hair out. Is slow, not thread-safe etc.

Piotr Szmyd - Wednesday, January 7, 2015 3:44:07 AM

It's a shame Lucene.Net doesn't support indexing rich content out of the box - both ElasticSearch and Solr can index PDF, office docs, RTF with relative ease.

It would be great if we could improve the indexing API to support attachments so that future indexing modules can use the underlying indexes built-in support, and add some extension points to the Lucene module so that those who wish to continue using Lucene can develop support for those document types as and when they are needed.

Matt Melling - Wednesday, January 7, 2015 8:58:41 AM

@Matt: I'm not entirely sure I'm following, but that sounds like a suggestion you might want to post on the forums?

bleroy - Wednesday, January 7, 2015 11:16:22 PM

Comments have been disabled for this content.