I'm trying to export out our attached documents in the T_CUST_DOC doc_contents field, and having some trouble. To test, I exported one out using BCP, but adobe gave me an error "Acrobat could not open '762bcp.pdf' because it is either not a supported file type or because the file has been damaged;
I read in another message that these attached documents are stored as OLE Objects with a header. I noticed the size difference between my output file and one manually extracted file, and wrote a quick python script to export out the binary file and strip off the first 1437KB. It created a file of the same size as the original, but adobe still won't open it and a file compare shows they different.
So any advice here on how to export these images? They're taking up too much space and we've decided to let them live outside of Tess.
Thanks,
John
Thanks Steven. That makes sense but I can’t seem to find the pattern for the adobe headers.
I wonder if I could use COM to strip out the header info? Could you tell what COM call Tess uses to extract the files? I should have access to all the COM comments (I’m using activepython which includes lots on win32 libraries.)
From: Steven Bras [mailto:bounce-stevenbras7585@tessituranetwork.com] Sent: Thursday, November 18, 2010 2:32 PM To: John Stephenson Subject: Re: [Tessitura Technical Forum] need help removing pdf files from T_CUST_DOC
The OLE header may vary in length, but is always within the first 300 characters of the file. Further, it varies in length and position by file type. Simply removing data from the start of the file to make sizes match won't do the trick, I'm afraid.
This article gives a good example of a function used to extract the root binary file and eliminate the OLE header, for various graphic file types. It might be easy to locate the corresponding data in your PDF files and modify the function to work for those as well:
http://blogs.msdn.com/b/pranab/archive/2008/07/15/removing-ole-header-from-images-stored-in-ms-access-db-as-ole-object.aspx
Hope this helps!
From: John Stephenson <bounce-johnstephenson2236@tessituranetwork.com> Sent: 11/18/2010 11:47:05 AM
This message was sent automatically to you by www.tessituranetwork.com because you subscribed to the Tessitura Technical Forum. You may reply to this message to post to the Technical forum or visit the site to search, read and post to the forums. In the interest of keeping the forum posts from becoming cluttered, we encourage you to delete previous message text from your reply before sending. Thank you!
I finally got this to work. The begining of the header was %PDF- (in hex 25 50 44 46 2D) and the ending was %EOF. (25 45 4F 46 0D) so I trimmed off everything not between the header/ending.
We're testing the exported PDFs now but so far the look good.
Thanks for steering me in the right direction.