That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.
You wonder
What could be causing this issue?
As an example how that happens let's look at Pelle Più bella
In the page content that phrase actually looks like the visual representation in capital letters:
/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj
Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters
But the font also has a ToUnicode mapping, and this mapping maps
<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù
(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?
Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.
A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.
For example using Java and the iText library you can do that like this:
PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
PdfObject obj = reader.getPdfObject(i);
if (obj != null && obj.isDictionary())
{
PdfDictionary dic = (PdfDictionary) obj;
if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
{
dic.remove(PdfName.TOUNICODE);
}
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();
After this manipulation Adobe Reader text extraction results in
PELLE PIÙ BELLA
This obviously only works in situations like the one in your sample document.
If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.