Appropriate Method for Storing PDF Repository in Azure

I am interested in leveraging Azure storage to host a large repository of PDFs.  Each PDF contains multiple reports in it, and upon request (via web app) we extract page ranges from these PDFs and serve them up.  My attempts to use block blob storage for this failed miserably because extracting page ranges from PDFs requires a great deal of seeking to different byte offsets and reading several small amounts of data.  The overhead incurred in all the range requests across the REST interface for block blob storage makes this impossible to do at any reasonable speed.  I thought perhaps page blobs would be better suited for this, but I haven't seen anybody use page blobs for anything other that VHDs or specialized data structures (circular logs, etc), and when I attempted to copy a PDF to my storage account using the Set-AzureStorageBlobContent Powershell command, I received an error that the file size is invalid for a page blob (because of the 512 byte boundary).  This lead me to feel like I'm trying to use this service incorrectly.

TL;DR - If I need fast random access to thousands of large files in a Azure Storage so that I can extract page ranges from PDFs, what would be the best way to go about that?

September 4th, 2015 11:15am

Hi,

We'd need more time to research on this, we'll keep you updated with our findings.
We regret the inconvenience caused and appreciate the patience.

Regards,
Malar.


Free Windows Admin Tool Kit Click here and download it now
September 5th, 2015 6:37am

Hi,

We'd need more time to research on this, we'll keep you updated with our findings.
We regret the inconvenience caused and appreciate the patience.

Regards,
Malar.


September 5th, 2015 10:36am

Hi,

We'd need more time to research on this, we'll keep you updated with our findings.
We regret the inconvenience caused and appreciate the patience.

Regards,
Malar.


Free Windows Admin Tool Kit Click here and download it now
September 5th, 2015 10:36am

My attempts to use block blob storage for this failed miserably because extracting page ranges from PDFs requires a great deal of seeking to different byte offsets and reading several small amounts of data.  The overhead incurred in all the range requests across the REST interface for block blob storage makes this impossible to do at any reasonable speed.  

Hi BobMcLare,

Since I don't know anything about your requirements, I am not sure if this suggestion is even apropriated for your case:

Instead of seeking throughout the content of those PDFs, directly in the files. Wouldn't it be more efficient if you index all PDFs content using Azure Search (or Elasticsearch) and search the produced index instead? Take a look at this example: http://wp.sjkp.dk/azure-search-pdf-indexing/

Hope this helps!

September 6th, 2015 2:51pm

Thanks so much for your suggestion Carlos.  I have actually read that article and think it's great.  I am seriously considering using the Azure Search functionality for indexing my PDFs.  However, this does not address my root problem, which is efficiently extracting page ranges from a PDF on Azure storage.  The search service may be able to do a great job of telling me where to find the pages, but extracting them from the PDF is where my challenge lies.
Free Windows Admin Tool Kit Click here and download it now
September 7th, 2015 10:35am

I am not quite with you on what you are going to use your PDFs for, but have looked into Azure Files?

It is a bit more vanilla solution though probably not so robust as Blob storage.

September 8th, 2015 6:03am

Thanks Alex,

I have looked at the Azure Files service, and indeed it looks like it would suit my needs well, but based on the latest article I read on the subject, it appears to still be in preview as of 8/4/2015.  This is for a client-facing, billable product with SLAs, so I want to make sure whatever service I am using is proven, tested, and guaranteed to stick around.

Free Windows Admin Tool Kit Click here and download it now
September 8th, 2015 9:43am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics