Indexing PDF: once again with a big red nose

Wednesday, January 7, 2015

ASP.NET Orchard PDF

A commenter pointed me to an oddly-named library that I didn’t know about: PdfClown. This is a library that is built by the same author both for Java and .NET, and the .NET version actually looks pretty nice, with not too many Java-isms beyond the namespaces. The license is a nice LGPL 3, the author Stefano Chizzolini seems to be available for advice and consulting, and there’s quite a lot of blog posts and quality documentation and samples. Sounds like a dream, doesn’t it?

Here is the code for my Orchard indexing handler, modified to use PdfClown:

using System;
using System.IO;
using System.Linq;
using Orchard.ContentManagement;
using Orchard.ContentManagement.Handlers;
using Orchard.FileSystems.Media;
using Orchard.Logging;
using Orchard.MediaLibrary.Models;
using org.pdfclown.tools;
using PdfFile = org.pdfclown.files.File;
using PdfStream = org.pdfclown.bytes.Stream;

namespace Decent.DocumentIndexing.Handlers {
    public class PdfIndexingHandler : ContentHandler {
        private readonly IStorageProvider _storageProvider;

        public PdfIndexingHandler(IStorageProvider storageProvider) {
            _storageProvider = storageProvider;
            Logger = NullLogger.Instance;


            OnIndexing<DocumentPart>((context, part) => {
                var mediaPart = part.As<MediaPart>();
                if (mediaPart == null
                    || Path.GetExtension(mediaPart.FileName) != ".pdf") return;
                var document = _storageProvider
                    .GetFile(Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
                using (var documentStream = document.OpenRead()) {
                    try {
                        var pdfStream = new PdfStream(documentStream);
                        var pdfFile = new PdfFile(pdfStream);
                        var pdfDocument = pdfFile.Document;
                        var textExtractor = new TextExtractor();
                        var strings = pdfDocument.Pages
                            .SelectMany(page => textExtractor
                                .Extract(page).Values
                                .SelectMany(stringCollection => stringCollection
                                    .Select(textString => textString.Text)));
                        var text = string.Join(" ", strings);
                        context.DocumentIndex.Add("body", text).Analyze();
                    }
                    catch (Exception ex) {
                        Logger.Error(ex, string.Format(
                            "Unable to index {0}/{1}",
                            mediaPart.FolderPath, mediaPart.FileName));
                    }
                }
            });
        }
    }
}

What I really like is that there is no need to walk the PDF tree myself, as the library provides that logic under a ready-to-use API. The Java version is even easier, but that’s not too bad really. So what’s the catch?

The catch is that PdfClown can’t handle all files: it doesn’t handle encrypted files, and I’ve seen it fail on the odd PDF for no particular reason that I could determine. It’s almost perfect, but even a small percentage of failure may be unacceptable.

PDF is a horribly complicated format, and building a library for it is a lot of work. There doesn’t seem to be a perfect solution that works all the time. So far, the one that came closest was iTextSharp, but because of its change of license, it has become a more dangerous option. Then, we had PdfSharp and PdfClown that were both very nice, but failed on some documents. There is a possibility that PdfSharp will have a new and improved version but who knows when that will be available?

In the end, the best solution may be to include multiple libraries, and to fall back to a different one in case of failure.

Another consideration is performance: if you have a large number of heavy PDFs, you may run into concurrency issues with Orchard’s background process. One solution out of this can be to isolate the indexing into a completely separate service, but that’s a story for another post, on Piotr’s blog, or so I’ve heard…

6 Comments

Wouldn't it be a viable option, at least for now, to use both PdfClown and PdfSharp and try those files with the other one that fails with the first? That is, if you can easily detect failure (like with an exception being thrown). Not nice, but for this very specific use-case it may be good enough.

Zoltán Lehóczky - Wednesday, January 7, 2015 5:46:09 PM

:) Zoltan... That's exactly what I was suggesting when I wrote "In the end, the best solution may be to include multiple libraries, and to fall back to a different one in case of failure."

bleroy - Wednesday, January 7, 2015 5:48:51 PM

Have you considered Tika on DotNet (https://kevm.github.io/tikaondotnet/)? It works like a charm.

var extractor = new TextExtractor();
string content = extractor.Extract(@"c:\doc\pdfs\my.pdf");

Christopher Boumenot - Wednesday, January 7, 2015 7:16:16 PM

Thanks for the pointer, but this runs on IKVM.

bleroy - Wednesday, January 7, 2015 7:52:47 PM

Have you tested with commercial libraries?
You can try EvoPdf, it has a 30-day trial for testing. If it suceeds, we know there is a paying alternative.

Bruno - Wednesday, January 7, 2015 8:38:30 PM

@Bruno: I have not, because I want a solution that works everywhere, like Orchard does: free or commercial uses should both be permitted. If you want to try and write about that, however, I'll gladly point to it.

bleroy - Wednesday, January 7, 2015 11:13:25 PM

Comments have been disabled for this content.