Questions in regards to full-text search with PDF SQL Server 2008 restricted to embedded text only?

StackOverflow https://stackoverflow.com/questions/22228360

Domanda

What i'm curious about is lets say I have 100 pdfs. And all of them have the words "happy apple". Lets say that only 20 of these have embedded text that has "happy apple".

When i do a search for "happy apple" will i receive all 100 docs or only 20? I'm unable to find a clear answer on this question.

È stato utile?

Soluzione

Flat out impossible to answer without any further information on your search tool and the actual PDFs.

"Happy apple" will be found if the text is 1. not compressed, 2. not encrypted, 3. not weirdly constructed, 4. not re-encoded, or 5. re-encoded but the translation table to Unicode is present and correct.

ad 1: Usually data streams in a PDF are compressed, using one or more algorithms from the standard set (usually LZW or Flate).

ad 2: PDFs may be encrypted with a password, preventing casual inspection. Levels of security range from mid-difficult to theoretically uncrackable with current technology.

ad 3: Single characters may appear on your page in any order. The software used to create it may, at its whim, split up text string in separate parts or even draw each individual character at any position, and omit all spaces. Only strict sorting on absolute x and y coordinates of each text fragment may reveal the original text.

ad 4: If a font gets subsetted, a PDF composer may decide to store 'h' as 0, 'a' as 1 and 'p' as 2 (and so on). The correct glyphs are still associated with these codes, but "the" text now may appear as "0 1 2 2 3 4 1 2 2 5 6" in the text stream. Also, even if it does not subset the font, a PDF composer is free to move characters around anyway.

ad 5: To revert this re-encoding, software may include a ToUnicode table. This is to associate character codes back to the original Unicode values again; one table per re-encoded font. If the table is missing, there usually is no straightforward way to create it.

There is even an ad 6 I did not think of: text may be outlined or appear in bitmaps only.

Only the very simplest PDFs can be searched with a general tool such as command-line grep. For anything else, you need a good PDF decoding tool -- and the better it is, the more points of this list you can tick off. Except, then, #5 and #6.


(Later edit) Oh wait. You obfuscated your actual question enough to entirely throw me off the target, which (I think!) was "does sql-server-2008 search for entire phrases or for individual words?"

Good thing, then, the above still holds. If you cannot search inside your PDFs anyway, the actual question is moot.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top