Detecting Unicode text ligatures in Clojure/Java

https://stackoverflow.com/questions/3466565

28-09-2019
|

Question

Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र.

When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Firefox, it is shown as a proper ligature.

So my question is, how to detect such ligatures programmatically while reading the file from my code. Since Firefox does it, there must exist a way to do it programmatically. Are there any Unicode properties which contain this information or do I need to have a map to all such ligatures?

SVG CSS property text-rendering when set to optimizeLegibility does the same thing (combine code points into proper ligature).

PS: I am using Java.

EDIT

The purpose of my code is to count the characters in the Unicode text assuming a ligature to be a single character. So I need a way to collapse multiple code points into a single ligature.

Solution 4

While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of java.awt.font.GlyphVector and playing a lot on the Clojure REPL, I was able to write a function which does what I want.

The idea is to find the width of glyphs in the glyphVector and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.

(ns net.abhinavsarkar.unicode
  (:import [java.awt.font TextAttribute GlyphVector]
           [java.awt Font]
           [javax.swing JTextArea]))

(let [^java.util.Map text-attrs {
        TextAttribute/FAMILY "Arial Unicode MS"
        TextAttribute/SIZE 25
        TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
      font (Font/getFont text-attrs)
      ta (doto (JTextArea.) (.setFont font))
      frc (.getFontRenderContext (.getFontMetrics ta font))]
  (defn unicode-partition
    "takes an unicode string and returns a vector of strings by partitioning
    the input string in such a way that multiple code points of a single
    ligature are in same partition in the output vector"
    [^String text]
    (let [glyph-vector 
            (.layoutGlyphVector
              font, frc, (.toCharArray text),
              0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
          glyph-num (.getNumGlyphs glyph-vector)
          glyph-positions
            (map first (partition 2
                          (.getGlyphPositions glyph-vector 0 glyph-num nil)))
          glyph-widths
            (map -
              (concat (next glyph-positions)
                      [(.. glyph-vector getLogicalBounds width)])
              glyph-positions)
          glyph-indices 
            (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
          glyph-index-width-map (zipmap glyph-indices glyph-widths)
          corrected-glyph-widths
            (vec (reduce
                    (fn [acc [k v]] (do (aset acc k v) acc))
                    (make-array Float (count glyph-index-width-map))
                    glyph-index-width-map))]
      (loop [idx 0 pidx 0 char-seq text acc []]
        (if (nil? char-seq)
          acc
          (if-not (zero? (nth corrected-glyph-widths idx))
            (recur (inc idx) (inc pidx) (next char-seq)
              (conj acc (str (first char-seq))))
            (recur (inc idx) pidx (next char-seq)
              (assoc acc (dec pidx)
                (str (nth acc (dec pidx)) (first char-seq))))))))))

Also posted on Gist.

OTHER TIPS

The Computer Typesetting wikipedia page says -

The Computer Modern Roman typeface provided with TeX includes the five common ligatures ff, fi, fl, ffi, and ffl. When TeX finds these combinations in a text it substitutes the appropriate ligature, unless overridden by the typesetter.

This indicates that it's the editor that does substitution. Moreover,

Unicode maintains that ligaturing is a presentation issue rather than a character definition issue, and that, for example, "if a modern font is asked to display 'h' followed by 'r', and the font has an 'hr' ligature in it, it can display the ligature."

As far as I see (I got some interest in this topic and just now reading few articles), the instructions for ligature substitute is embeded inside font. Now, I dug into more and found these for you; GSUB - The Glyph Substitution Table and Ligature Substitution Subtable from the OpenType file format specification.

Next, you need to find some library which can allow you to peak inside OpenType font files, i.e. file parser for quick access. Reading the following two discussions may give you some directions in how to do these substitutions:

Chromium bug http://code.google.com/p/chromium/issues/detail?id=22240
Firefox bug https://bugs.launchpad.net/firefox/+bug/37828

What you are talking about are not ligatures (at least not in Unicode parlance) but grapheme clusters. There is a standard annex that is concerned with discovering text boundaries, including grapheme cluster boundaries:

http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries

Also see the description of tailored grapheme clusters in regular expressions:

http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters

And the definition of collation graphemes:

http://www.unicode.org/reports/tr10/#Collation_Graphemes

I think that these are starting points. The harder part will probably be to find a Java implementation of the Unicode collation algorithm that works for Devanagari locales. If you find one, you can analyze strings without resorting to OpenType features. This would be a bit cleaner since OpenType is concerned with purely presentational details and not with character or grapheme cluster semantics, but the collation algorithm and the tailored grapheme cluster boundary finding algorithm look as if they can be implemented independently of fonts.

You may be able to get this information from the GlyphVector class.

For a given String a Font instance can create a GlyphVector that can provide information about the rendering of the text.

The layoutGlyphVector() method on the Font can provide this.

The FLAG_COMPLEX_GLYPHS attribute of the GlyphVector can tell you if the text does not have a 1 to 1 mapping with the input characters.

The following code shows an example of this:

JTextField textField = new JTextField();
String textToTest = "abcdefg";
FontRenderContext fontRenderContext = textField.getFontMetrics(font).getFontRenderContext();

GlyphVector glyphVector = font.layoutGlyphVector(fontRenderContext, textToTest.toCharArray(), 0, 4, Font.LAYOUT_LEFT_TO_RIGHT);
int layoutFlags = glyphVector.getLayoutFlags();
boolean hasComplexGlyphs = (layoutFlags & GlyphVector.FLAG_COMPLEX_GLYPHS) != 0;
int numberOfGlyphs = glyphVector.getNumGlyphs();

numberOfGlyphs should represent the number of characters used to display the input text.

Unfortunately you need to create a java GUI component to get the FontRenderContext.

I think that what you are really looking for is Unicode Normalization.

For Java you should check http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html

By choosing the proper normalization form you can obtain what you are looking for.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow