Indexing PDF in Orchard (and elsewhere.NET)
Indexing custom contents in Orchard is really easy: write a new handler derived from ContentHandler, then write an event handler for OnIndexing:
public class PdfIndexingHandler : ContentHandler {
public PdfIndexingHandler(IStorageProvider storageProvider) {
OnIndexing<DocumentPart>((context, part) => {
context.DocumentIndex
.Add("body", thePdfText).Analyze();
});
}
}
Orchard will then hand the text over to Lucene, which will index it. Orchard already handles PDF documents stored in its media gallery, so we should be good to go if we can somehow extract the text from a PDF file. Unfortunately, that’s a rather big if, and the main difficulty.
There are a few libraries available in .NET to handle PDFs. They are usually built mainly to create new PDF files, but most can also read them. Text in PDF files is a scattered set of fragments of text in a complex tree structure (not a big concern for indexing), when it’s not a dirty set of scanned images that would need to be OCR’ed in order to be read. I’ll ignore the OCR case for this post. Libraries give you access to the document’s tree, but usually don’t hand you a text property directly, so we’ll have to build this ourselves.
Here’s a list of some of the libraries available, with the challenges they present:
- IFilter is the venerable, antique, COM-based way of indexing documents on Windows. It requires an install on the server. I need something that is xcopy-deployable.
- iTextSharp is the .NET version of a quite commonly used Java library, but it has an exotic GPL-like license designed to push you to buy a commercial license for an undisclosed amount of money.
- SquarePdf.Net is another adaptation of a Java library, but it uses IKVM to emulate a Java virtual machine, where it runs the original Java library. This is clearly insane.
- Aspose is pure .NET, but is quite expensive.
- PdfSharp is under MIT, is pure .NET, but hasn’t been updated in a very long while. It also suffers from some nasty bugs.
As you can see, there’s no great solution. I picked PdfSharp because it’s real open source, and real .NET, despite the bugs and lack of updates. One bug in particular was an infinite loop that some documents generated from Word can cause. Fortunately, I was able to find a fix for that on the PdfSharp forums, and recompile the latest source code with it.
The following code (adapted from this forum post), walks the tree and adds the strings in finds to a StringBuilder:
private static void ExtractText(CObject cObject, StringBuilder builder) {
if (cObject is COperator) {
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name != OpCodeName.Tj.ToString()
&& cOperator.OpCode.Name != OpCodeName.TJ.ToString()) return;
foreach (var cOperand in cOperator.Operands) {
ExtractText(cOperand, builder);
}
}
else if (cObject is CSequence) {
var cSequence = cObject as CSequence;
foreach (var element in cSequence) {
ExtractText(element, builder);
}
}
else if (cObject is CString) {
var cString = cObject as CString;
builder.Append(cString.Value);
}
}
The rest of the work is just getting to the document’s stream from the part, then hand it over to PdfSharp and scan each page:
OnIndexing<DocumentPart>((context, part) => {
var mediaPart = part.As<MediaPart>();
if (mediaPart == null || Path.GetExtension(mediaPart.FileName) != ".pdf") return;
var document = _storageProvider.GetFile(
Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
using (var documentStream = document.OpenRead()) {
var pdfDocument = PdfReader.Open(documentStream, PdfDocumentOpenMode.ReadOnly);
var text = new StringBuilder();
foreach (var page in pdfDocument.Pages.OfType<PdfPage>()) {
var pageContent = ContentReader.ReadContent(page);
ExtractText(pageContent, text);
text.AppendLine();
}
context.DocumentIndex
.Add("body", String.Join(" ", text.ToString())).Analyze();
}
});
Special thanks to Piotr Szmyd for sharing some of his research on this with me.
UPDATE: I found out, by trying more PDF files, that lots of recent files won’t get read by PdfSharp. I’ll post tomorrow with a new update, an link from here.