Question

While Extracting Content from PDF using the MuPDF library, i am getting the Font name only not its font-face.

Do i guess (eg.bold in font-name though not the right way) or there is any other way to detect that specific font is Bold/Italic/Plain.

Was it helpful?

Solution

I have used itextsharp to extract font-family ,font color etc

public void Extract_inputpdf() {

  text_input_File = string.Empty;

  StringBuilder sb_inputpdf = new StringBuilder();
  PdfReader reader_inputPdf = new PdfReader(path); //read PDF
  for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {

    TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
    text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);

    sb_inputpdf.Append(text_input_File);
    input_pdf = sb_inputpdf.ToString();
  }
  reader_inputPdf.Close();
  clear();
}

public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
  public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {

    string curFont = renderInfo.GetFont().PostscriptFontName;
    string divide = curFont;
    string[] fontnames = null;

    //split the words from postscript if u want separate. it will be in this
  }
}
public string GetResultantText() {

  return result.ToString();
}

OTHER TIPS

The PDF spec contains entries which allow you to specify the style of a font. However unfortunately in the real world you will often find that these are absent.

If the font is referenced rather than embeded this generally means you are stuck with the PostScript name for the font. It requires some heuristics but normally the name provides sufficient clues as to the style. It sounds this is pretty much where you are.

If the font is embedded you can parse it and try and find style information from the embedded font program. If it is subsetted then in theory this information might be removed but in general I don't think it will be. However parsing TrueType/OpenType fonts is boring and you may not feel that it is worth it.

I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top