Indexing PDF: not so fast
In the last post, I showed how to index PDF using PdfSharp. Unfortunately, the library hasn’t been updated in years, nobody seems to have forked it, and it can’t read many recent files. That makes it unfortunately unsuitable for the task. Back to square 1.
Another library that’s commonly used is iTextSharp. As I said in the last post, it has some serious licensing issues. Fortunately, once upon a time, it was under LGPL, which is a fine license. That LGPL version is still available. That’s hardly an ideal situation, as this is still a library stuck in the past. In particular, recent versions have a very handy API that extracts the text from a PDF in one operation that doesn’t exist in the old LGPL version. If you have the budget to buy a full license, it may be worth it (although the prices are not public, so who knows?).
So what are our open-source options? Well, I was able to get reasonable results with the iTextSharp LGPL version.
Here is the code for the indexing method of the handler, modified to use an iTextSharp that has been specially compiled from the LGPL source:
OnIndexing<DocumentPart>((context, part) => {
var mediaPart = part.As<MediaPart>();
if (mediaPart == null || Path.GetExtension(mediaPart.FileName) != ".pdf") return;
var document = _storageProvider.GetFile(
Path.Combine(mediaPart.FolderPath, mediaPart.FileName));
using (var documentStream = document.OpenRead()) {
var pdfReader = new PdfReader(documentStream);
var text = new StringBuilder();
try {
for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
var pdfPage = pdfReader.GetPageN(page);
var content = pdfPage.Get(PdfName.CONTENTS);
ScanPdfContents(content, pdfReader, text);
}
}
catch (PdfException ex) {
Logger.Error(ex,
string.Format("Unable to index {0}/{1}", mediaPart.FolderPath, mediaPart.FileName));
}
finally {
pdfReader.Close();
}
context.DocumentIndex
.Add("body", text.ToString()).Analyze();
}
});
And the ScanPdfContents method:
private static void ScanPdfContents(PdfObject content, PdfReader pdfReader, StringBuilder text) {
var ir = content as PRIndirectReference;
if (ir != null) {
var value = pdfReader.GetPdfObject(ir.Number);
if (!value.IsStream()) {
return;
}
var stream = (PRStream) value;
var streamBytes = PdfReader.GetStreamBytes(stream);
var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));
try {
while (tokenizer.NextToken()) {
if (tokenizer.TokenType == PRTokeniser.TK_STRING) {
var str = tokenizer.StringValue;
text.Append(str);
}
}
}
finally {
tokenizer.Close();
}
}
var array = content as PdfArray;
if (array != null) {
for (var i = 0; i < array.Size; i++) {
ScanPdfContents(array[i], pdfReader, text);
}
}
}
I hope this helps. If anyone wants to share their experience with indexing PDF, please let me know.
Note: The code above took some hints from this post:
http://stackoverflow.com/questions/10143098/how-to-extract-text-with-itextsharp-4-1-6
- Code updated to include PdfArray -