Indexing PDF: once again with a big red nose
A commenter pointed me to an oddly-named library that I didn’t know about: PdfClown. This is a library that is built by the same author both for Java and .NET, and the .NET version actually looks pretty nice, with not too many Java-isms beyond the namespaces. The license is a nice LGPL 3, the author Stefano Chizzolini seems to be available for advice and consulting, and there’s quite a lot of blog posts and quality documentation and samples. Sounds like a dream, doesn’t it?
Here is the code for my Orchard indexing handler, modified to use PdfClown:
using System;
using System.IO;
using System.Linq;
using Orchard.ContentManagement;
using Orchard.ContentManagement.Handlers;
using Orchard.FileSystems.Media;
using Orchard.Logging;
using Orchard.MediaLibrary.Models;
using org.pdfclown.tools;
using PdfFile = org.pdfclown.files.File;
using PdfStream = org.pdfclown.bytes.Stream;
namespace Decent.DocumentIndexing.Handlers {
public class PdfIndexingHandler : ContentHandler {
private readonly IStorageProvider _storageProvider;
public PdfIndexingHandler(IStorageProvider storageProvider) {
_storageProvider = storageProvider;
Logger = NullLogger.Instance;
OnIndexing<DocumentPart>((context, part) => {
var mediaPart = part.As<MediaPart>();
if (mediaPart == null
|| Path.GetExtension(mediaPart.FileName) != ".pdf") return;
var document = _storageProvider
.GetFile(Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
using (var documentStream = document.OpenRead()) {
try {
var pdfStream = new PdfStream(documentStream);
var pdfFile = new PdfFile(pdfStream);
var pdfDocument = pdfFile.Document;
var textExtractor = new TextExtractor();
var strings = pdfDocument.Pages
.SelectMany(page => textExtractor
.Extract(page).Values
.SelectMany(stringCollection => stringCollection
.Select(textString => textString.Text)));
var text = string.Join(" ", strings);
context.DocumentIndex.Add("body", text).Analyze();
}
catch (Exception ex) {
Logger.Error(ex, string.Format(
"Unable to index {0}/{1}",
mediaPart.FolderPath, mediaPart.FileName));
}
}
});
}
}
}
What I really like is that there is no need to walk the PDF tree myself, as the library provides that logic under a ready-to-use API. The Java version is even easier, but that’s not too bad really. So what’s the catch?
The catch is that PdfClown can’t handle all files: it doesn’t handle encrypted files, and I’ve seen it fail on the odd PDF for no particular reason that I could determine. It’s almost perfect, but even a small percentage of failure may be unacceptable.
PDF is a horribly complicated format, and building a library for it is a lot of work. There doesn’t seem to be a perfect solution that works all the time. So far, the one that came closest was iTextSharp, but because of its change of license, it has become a more dangerous option. Then, we had PdfSharp and PdfClown that were both very nice, but failed on some documents. There is a possibility that PdfSharp will have a new and improved version but who knows when that will be available?
In the end, the best solution may be to include multiple libraries, and to fall back to a different one in case of failure.
Another consideration is performance: if you have a large number of heavy PDFs, you may run into concurrency issues with Orchard’s background process. One solution out of this can be to isolate the indexing into a completely separate service, but that’s a story for another post, on Piotr’s blog, or so I’ve heard…