Does anyone have a good Proper Case algorithm

https://stackoverflow.com/questions/32149

09-06-2019
|

Question

Does anyone have a trusted Proper Case or PCase algorithm (similar to a UCase or Upper)? I'm looking for something that takes a value such as "GEORGE BURDELL" or "george burdell" and turns it into "George Burdell".

I have a simple one that handles the simple cases. The ideal would be to have something that can handle things such as "O'REILLY" and turn it into "O'Reilly", but I know that is tougher.

I am mainly focused on the English language if that simplifies things.

UPDATE: I'm using C# as the language, but I can convert from almost anything (assuming like functionality exists).

I agree that the McDonald's scneario is a tough one. I meant to mention that along with my O'Reilly example, but did not in the original post.

Solution

Unless I've misunderstood your question I don't think you need to roll your own, the TextInfo class can do it for you.

using System.Globalization;

CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")

Will return "George Burdell. And you can use your own culture if there's some special rules involved.

Update: Michael (in a comment to this answer) pointed out that this will not work if the input is all caps since the method will assume that it is an acronym. The naive workaround for this is to .ToLower() the text before submitting it to ToTitleCase.

OTHER TIPS

@Zack: I'll post it as a separate reply.

Here's an example based on kronoz's post.

void Main()
{
    List<string> names = new List<string>() {
        "bill o'reilly", 
        "johannes diderik van der waals", 
        "mr. moseley-williams", 
        "Joe VanWyck", 
        "mcdonald's", 
        "william the third", 
        "hrh prince charles", 
        "h.r.m. queen elizabeth the third",
        "william gates, iii", 
        "pope leo xii",
        "a.k. jennings"
    };

    names.Select(name => name.ToProperCase()).Dump();
}

// Define other methods and classes here

// http://stackoverflow.com/questions/32149/does-anyone-have-a-good-proper-case-algorithm
public static class ProperCaseHelper {
    public static string ToProperCase(this string input) {
        if (IsAllUpperOrAllLower(input))
        {
            // fix the ALL UPPERCASE or all lowercase names
            return string.Join(" ", input.Split(' ').Select(word => wordToProperCase(word)));
        }
        else
        {
            // leave the CamelCase or Propercase names alone
            return input;
        }
    }

    public static bool IsAllUpperOrAllLower(this string input) {
        return (input.ToLower().Equals(input) || input.ToUpper().Equals(input) );
    }

    private static string wordToProperCase(string word) {
        if (string.IsNullOrEmpty(word)) return word;

        // Standard case
        string ret = capitaliseFirstLetter(word);

        // Special cases:
        ret = properSuffix(ret, "'");   // D'Artagnon, D'Silva
        ret = properSuffix(ret, ".");   // ???
        ret = properSuffix(ret, "-");       // Oscar-Meyer-Weiner
        ret = properSuffix(ret, "Mc");      // Scots
        ret = properSuffix(ret, "Mac");     // Scots

        // Special words:
        ret = specialWords(ret, "van");     // Dick van Dyke
        ret = specialWords(ret, "von");     // Baron von Bruin-Valt
        ret = specialWords(ret, "de");      
        ret = specialWords(ret, "di");      
        ret = specialWords(ret, "da");      // Leonardo da Vinci, Eduardo da Silva
        ret = specialWords(ret, "of");      // The Grand Old Duke of York
        ret = specialWords(ret, "the");     // William the Conqueror
        ret = specialWords(ret, "HRH");     // His/Her Royal Highness
        ret = specialWords(ret, "HRM");     // His/Her Royal Majesty
        ret = specialWords(ret, "H.R.H.");  // His/Her Royal Highness
        ret = specialWords(ret, "H.R.M.");  // His/Her Royal Majesty

        ret = dealWithRomanNumerals(ret);   // William Gates, III

        return ret;
    }

    private static string properSuffix(string word, string prefix) {
        if(string.IsNullOrEmpty(word)) return word;

        string lowerWord = word.ToLower();
        string lowerPrefix = prefix.ToLower();

        if (!lowerWord.Contains(lowerPrefix)) return word;

        int index = lowerWord.IndexOf(lowerPrefix);

        // If the search string is at the end of the word ignore.
        if (index + prefix.Length == word.Length) return word;

        return word.Substring(0, index) + prefix +
            capitaliseFirstLetter(word.Substring(index + prefix.Length));
    }

    private static string specialWords(string word, string specialWord)
    {
        if(word.Equals(specialWord, StringComparison.InvariantCultureIgnoreCase))
        {
            return specialWord;
        }
        else
        {
            return word;
        }
    }

    private static string dealWithRomanNumerals(string word)
    {
        List<string> ones = new List<string>() { "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX" };
        List<string> tens = new List<string>() { "X", "XX", "XXX", "XL", "L", "LX", "LXX", "LXXX", "XC", "C" };
        // assume nobody uses hundreds

        foreach (string number in ones)
        {
            if (word.Equals(number, StringComparison.InvariantCultureIgnoreCase))
            {
                return number;
            }
        }

        foreach (string ten in tens)
        {
            foreach (string one in ones)
            {
                if (word.Equals(ten + one, StringComparison.InvariantCultureIgnoreCase))
                {
                    return ten + one;
                }
            }
        }

        return word;
    }

    private static string capitaliseFirstLetter(string word) {
        return char.ToUpper(word[0]) + word.Substring(1).ToLower();
    }

}

There's also this neat Perl script for title-casing text.

http://daringfireball.net/2008/08/title_case_update

#!/usr/bin/perl

#     This filter changes all words to Title Caps, and attempts to be clever
# about *un*capitalizing small words like a/an/the in the input.
#
# The list of "small words" which are not capped comes from
# the New York Times Manual of Style, plus 'vs' and 'v'. 
#
# 10 May 2008
# Original version by John Gruber:
# http://daringfireball.net/2008/05/title_case
#
# 28 July 2008
# Re-written and much improved by Aristotle Pagaltzis:
# http://plasmasturm.org/code/titlecase/
#
#   Full change log at __END__.
#
# License: http://www.opensource.org/licenses/mit-license.php
#


use strict;
use warnings;
use utf8;
use open qw( :encoding(UTF-8) :std );


my @small_words = qw( (?<!q&)a an and as at(?!&t) but by en for if in of on or the to v[.]? via vs[.]? );
my $small_re = join '|', @small_words;

my $apos = qr/ (?: ['’] [[:lower:]]* )? /x;

while ( <> ) {
  s{\A\s+}{}, s{\s+\z}{};

  $_ = lc $_ if not /[[:lower:]]/;

  s{
      \b (_*) (?:
          ( (?<=[ ][/\\]) [[:alpha:]]+ [-_[:alpha:]/\\]+ |   # file path or
            [-_[:alpha:]]+ [@.:] [-_[:alpha:]@.:/]+ $apos )  # URL, domain, or email
          |
          ( (?i: $small_re ) $apos )                         # or small word (case-insensitive)
          |
          ( [[:alpha:]] [[:lower:]'’()\[\]{}]* $apos )       # or word w/o internal caps
          |
          ( [[:alpha:]] [[:alpha:]'’()\[\]{}]* $apos )       # or some other word
      ) (_*) \b
  }{
      $1 . (
        defined $2 ? $2         # preserve URL, domain, or email
      : defined $3 ? "\L$3"     # lowercase small word
      : defined $4 ? "\u\L$4"   # capitalize word w/o internal caps
      : $5                      # preserve other kinds of word
      ) . $6
  }xeg;


  # Exceptions for small words: capitalize at start and end of title
  s{
      (  \A [[:punct:]]*         # start of title...
      |  [:.;?!][ ]+             # or of subsentence...
      |  [ ]['"“‘(\[][ ]*     )  # or of inserted subphrase...
      ( $small_re ) \b           # ... followed by small word
  }{$1\u\L$2}xig;

  s{
      \b ( $small_re )      # small word...
      (?= [[:punct:]]* \Z   # ... at the end of the title...
      |   ['"’”)\]] [ ] )   # ... or of an inserted subphrase?
  }{\u\L$1}xig;

  # Exceptions for small words in hyphenated compound words
  ## e.g. "in-flight" -> In-Flight
  s{
      \b
      (?<! -)                 # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (in-flight)
      ( $small_re )
      (?= -[[:alpha:]]+)      # lookahead for "-someword"
  }{\u\L$1}xig;

  ## # e.g. "Stand-in" -> "Stand-In" (Stand is already capped at this point)
  s{
      \b
      (?<!…)                  # Negative lookbehind for a hyphen; we don't want to match man-in-the-middle but do want (stand-in)
      ( [[:alpha:]]+- )       # $1 = first word and hyphen, should already be properly capped
      ( $small_re )           # ... followed by small word
      (?! - )                 # Negative lookahead for another '-'
  }{$1\u$2}xig;

  print "$_";
}

__END__

But it sounds like by proper case you mean.. for people's names only.

I wrote this today to implement in an app I'm working on. I think this code is pretty self explanatory with comments. It's not 100% accurate in all cases but it will handle most of your western names easily.

Examples:

mary-jane => Mary-Jane

o'brien => O'Brien

Joël VON WINTEREGG => Joël von Winteregg

jose de la acosta => Jose de la Acosta

The code is extensible in that you may add any string value to the arrays at the top to suit your needs. Please study it and add any special feature that may be required.

function name_title_case($str)
{
  // name parts that should be lowercase in most cases
  $ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
  // name parts that should be lower even if at the beginning of a name
  $always_lower   = array('van', 'der');

  // Create an array from the parts of the string passed in
  $parts = explode(" ", mb_strtolower($str));

  foreach ($parts as $part)
  {
    (in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
  }

  // Determine the first part in the string
  reset($rules);
  $first_part = key($rules);

  // Loop through and cap-or-dont-cap
  foreach ($rules as $part => $rule)
  {
    if ($rule == 'caps')
    {
      // ucfirst() words and also takes into account apostrophes and hyphens like this:
      // O'brien -> O'Brien || mary-kaye -> Mary-Kaye
      $part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
      $c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
    }
    else if ($part == $first_part && !in_array($part, $always_lower))
    {
      // If the first part of the string is ok_to_be_lower, cap it anyway
      $c13n[] = ucfirst($part);
    }
    else
    {
      $c13n[] = $part;
    }
  }

  $titleized = implode(' ', $c13n);

  return trim($titleized);
}

I did a quick C# port of https://github.com/tamtamchik/namecase, which is based on Lingua::EN::NameCase.

public static class CIQNameCase
{
    static Dictionary<string, string> _exceptions = new Dictionary<string, string>
        {
            {@"\bMacEdo"     ,"Macedo"},
            {@"\bMacEvicius" ,"Macevicius"},
            {@"\bMacHado"    ,"Machado"},
            {@"\bMacHar"     ,"Machar"},
            {@"\bMacHin"     ,"Machin"},
            {@"\bMacHlin"    ,"Machlin"},
            {@"\bMacIas"     ,"Macias"},
            {@"\bMacIulis"   ,"Maciulis"},
            {@"\bMacKie"     ,"Mackie"},
            {@"\bMacKle"     ,"Mackle"},
            {@"\bMacKlin"    ,"Macklin"},
            {@"\bMacKmin"    ,"Mackmin"},
            {@"\bMacQuarie"  ,"Macquarie"}
        };

    static Dictionary<string, string> _replacements = new Dictionary<string, string>
        {
            {@"\bAl(?=\s+\w)"         , @"al"},        // al Arabic or forename Al.
            {@"\b(Bin|Binti|Binte)\b" , @"bin"},       // bin, binti, binte Arabic
            {@"\bAp\b"                , @"ap"},        // ap Welsh.
            {@"\bBen(?=\s+\w)"        , @"ben"},       // ben Hebrew or forename Ben.
            {@"\bDell([ae])\b"        , @"dell$1"},    // della and delle Italian.
            {@"\bD([aeiou])\b"        , @"d$1"},       // da, de, di Italian; du French; do Brasil
            {@"\bD([ao]s)\b"          , @"d$1"},       // das, dos Brasileiros
            {@"\bDe([lrn])\b"         , @"de$1"},      // del Italian; der/den Dutch/Flemish.
            {@"\bEl\b"                , @"el"},        // el Greek or El Spanish.
            {@"\bLa\b"                , @"la"},        // la French or La Spanish.
            {@"\bL([eo])\b"           , @"l$1"},       // lo Italian; le French.
            {@"\bVan(?=\s+\w)"        , @"van"},       // van German or forename Van.
            {@"\bVon\b"               , @"von"}        // von Dutch/Flemish
        };

    static string[] _conjunctions = { "Y", "E", "I" };

    static string _romanRegex = @"\b((?:[Xx]{1,3}|[Xx][Ll]|[Ll][Xx]{0,3})?(?:[Ii]{1,3}|[Ii][VvXx]|[Vv][Ii]{0,3})?)\b";

    /// <summary>
    /// Case a name field into its appropriate case format 
    /// e.g. Smith, de la Cruz, Mary-Jane,  O'Brien, McTaggart
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    public static string NameCase(string nameString)
    {
        // Capitalize
        nameString = Capitalize(nameString);
        nameString = UpdateIrish(nameString);

        // Fixes for "son (daughter) of" etc
        foreach (var replacement in _replacements.Keys)
        {
            if (Regex.IsMatch(nameString, replacement))
            {
                Regex rgx = new Regex(replacement);
                nameString = rgx.Replace(nameString, _replacements[replacement]);
            }                    
        }

        nameString = UpdateRoman(nameString);
        nameString = FixConjunction(nameString);

        return nameString;
    }

    /// <summary>
    /// Capitalize first letters.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string Capitalize(string nameString)
    {
        nameString = nameString.ToLower();
        nameString = Regex.Replace(nameString, @"\b\w", x => x.ToString().ToUpper());
        nameString = Regex.Replace(nameString, @"'\w\b", x => x.ToString().ToLower()); // Lowercase 's
        return nameString;
    }

    /// <summary>
    /// Update for Irish names.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateIrish(string nameString)
    {
        if(Regex.IsMatch(nameString, @".*?\bMac[A-Za-z^aciozj]{2,}\b") || Regex.IsMatch(nameString, @".*?\bMc"))
        {
            nameString = UpdateMac(nameString);
        }            
        return nameString;
    }

    /// <summary>
    /// Updates irish Mac & Mc.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateMac(string nameString)
    {
        MatchCollection matches = Regex.Matches(nameString, @"\b(Ma?c)([A-Za-z]+)");
        if(matches.Count == 1 && matches[0].Groups.Count == 3)
        {
            string replacement = matches[0].Groups[1].Value;
            replacement += matches[0].Groups[2].Value.Substring(0, 1).ToUpper();
            replacement += matches[0].Groups[2].Value.Substring(1);
            nameString = nameString.Replace(matches[0].Groups[0].Value, replacement);

            // Now fix "Mac" exceptions
            foreach (var exception in _exceptions.Keys)
            {
                nameString = Regex.Replace(nameString, exception, _exceptions[exception]);
            }
        }
        return nameString;
    }

    /// <summary>
    /// Fix roman numeral names.
    /// </summary>
    /// <param name="nameString"></param>
    /// <returns></returns>
    private static string UpdateRoman(string nameString)
    {
        MatchCollection matches = Regex.Matches(nameString, _romanRegex);
        if (matches.Count > 1)
        {
            foreach(Match match in matches)
            {
                if(!string.IsNullOrEmpty(match.Value))
                {
                    nameString = Regex.Replace(nameString, match.Value, x => x.ToString().ToUpper());
                }
            }
        }
        return nameString;
    }

    /// <summary>
    /// Fix Spanish conjunctions.
    /// </summary>
    /// <param name=""></param>
    /// <returns></returns>
    private static string FixConjunction(string nameString)
    {            
        foreach (var conjunction in _conjunctions)
        {
            nameString = Regex.Replace(nameString, @"\b" + conjunction + @"\b", x => x.ToString().ToLower());
        }
        return nameString;
    }
}

Usage

string name_cased = CIQNameCase.NameCase("McCarthy");

This is my test method, everything seems to pass OK:

[TestMethod]
public void Test_NameCase_1()
{
    string[] names = {
        "Keith", "Yuri's", "Leigh-Williams", "McCarthy",
        // Mac exceptions
        "Machin", "Machlin", "Machar",
        "Mackle", "Macklin", "Mackie",
        "Macquarie", "Machado", "Macevicius",
        "Maciulis", "Macias", "MacMurdo",
        // General
        "O'Callaghan", "St. John", "von Streit",
        "van Dyke", "Van", "ap Llwyd Dafydd",
        "al Fahd", "Al",
        "el Grecco",
        "ben Gurion", "Ben",
        "da Vinci",
        "di Caprio", "du Pont", "de Legate",
        "del Crond", "der Sind", "van der Post", "van den Thillart",
        "von Trapp", "la Poisson", "le Figaro",
        "Mack Knife", "Dougal MacDonald",
        "Ruiz y Picasso", "Dato e Iradier", "Mas i Gavarró",
        // Roman numerals
        "Henry VIII", "Louis III", "Louis XIV",
        "Charles II", "Fred XLIX", "Yusof bin Ishak",
    };

    foreach(string name in names)
    {
        string name_upper = name.ToUpper();
        string name_cased = CIQNameCase.NameCase(name_upper);
        Console.WriteLine(string.Format("name: {0} -> {1}  -> {2}", name, name_upper, name_cased));
        Assert.IsTrue(name == name_cased);
    }

}

What programming language do you use? Many languages allow callback functions for regular expression matches. These can be used to propercase the match easily. The regular expression that would be used is quite simple, you just have to match all word characters, like so:

/\w+/

Alternatively, you can already extract the first character to be an extra match:

/(\w)(\w*)/

Now you can access the first character and successive characters in the match separately. The callback function can then simply return a concatenation of the hits. In pseudo Python (I don't actually know Python):

def make_proper(match):
    return match[1].to_upper + match[2]

Incidentally, this would also handle the case of “O'Reilly” because “O” and “Reilly” would be matched separately and both propercased. There are however other special cases that are not handled well by the algorithm, e.g. “McDonald's” or generally any apostrophed word. The algorithm would produce “Mcdonald'S” for the latter. A special handling for apostrophe could be implemented but that would interfere with the first case. Finding a thereotical perfect solution isn't possible. In practice, it might help considering the length of the part after the apostrophe.

Here's a perhaps naive C# implementation:-

public class ProperCaseHelper {
  public string ToProperCase(string input) {
    string ret = string.Empty;

    var words = input.Split(' ');

    for (int i = 0; i < words.Length; ++i) {
      ret += wordToProperCase(words[i]);
      if (i < words.Length - 1) ret += " ";
    }

    return ret;
  }

  private string wordToProperCase(string word) {
    if (string.IsNullOrEmpty(word)) return word;

    // Standard case
    string ret = capitaliseFirstLetter(word);

    // Special cases:
    ret = properSuffix(ret, "'");
    ret = properSuffix(ret, ".");
    ret = properSuffix(ret, "Mc");
    ret = properSuffix(ret, "Mac");

    return ret;
  }

  private string properSuffix(string word, string prefix) {
    if(string.IsNullOrEmpty(word)) return word;

    string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
    if (!lowerWord.Contains(lowerPrefix)) return word;

    int index = lowerWord.IndexOf(lowerPrefix);

    // If the search string is at the end of the word ignore.
    if (index + prefix.Length == word.Length) return word;

    return word.Substring(0, index) + prefix +
      capitaliseFirstLetter(word.Substring(index + prefix.Length));
  }

  private string capitaliseFirstLetter(string word) {
    return char.ToUpper(word[0]) + word.Substring(1).ToLower();
  }
}

a simple way to capitalise the first letter of each word (seperated by a space)

$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
$result .= “$s “;
}
$string = trim($result);

in terms of catching the "O'REILLY" example you gave splitting the string on both spaces and ' would not work as it would capitalise any letter that appeared after a apostraphe i.e. the s in Fred's

so i would probably try something like

$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {

$s = strtolower($words[$i]);

if (substr($s, 0, 2) === "o'"){
$s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3);
}else{
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
}
$result .= “$s “;
}
$string = trim($result);

This should catch O'Reilly, O'Clock, O'Donnell etc hope it helps

Please note this code is untested.

Kronoz, thank you. I found in your function that the line:

`if (!lowerWord.Contains(lowerPrefix)) return word`;

must say

if (!lowerWord.StartsWith(lowerPrefix)) return word;

so "información" is not changed to "InforMacIón"

best,

Enrique

I use this as the textchanged event handler of text boxes. Support entry of "McDonald"

Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String
    Dim strCon As String = ""
    Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=@"
    Dim nextShouldBeCapital As Boolean = True

    'Improve to recognize all caps input
    'If str.Equals(str.ToUpper) Then
    '    str = str.ToLower
    'End If

    For Each s As Char In str.ToCharArray

        If allowCapital Then
            strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s)
        Else
            strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower)
        End If

        If wordbreak.Contains(s.ToString) Then
            nextShouldBeCapital = True
        Else
            nextShouldBeCapital = False
        End If
    Next

    Return strCon
End Function

A lot of good answers here. Mine is pretty simple and only takes into account the names we have in our organization. You can expand it as you wish. This is not a perfect solution and will change vancouver to VanCouver, which is wrong. So tweak it if you use it.

Here was my solution in C#. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them.

public static String toProperName(String name)
{
    if (name != null)
    {
        if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc")  // Changes mcdonald to "McDonald"
            return "Mc" + Regex.Replace(name.ToLower().Substring(2), @"\b[a-z]", m => m.Value.ToUpper());

        if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van")  // Changes vanwinkle to "VanWinkle"
            return "Van" + Regex.Replace(name.ToLower().Substring(3), @"\b[a-z]", m => m.Value.ToUpper());

        return Regex.Replace(name.ToLower(), @"\b[a-z]", m => m.Value.ToUpper());  // Changes to title case but also fixes 
                                                                                   // appostrophes like O'HARE or o'hare to O'Hare
    }

    return "";
}

I know this thread has been open for awhile, but as I was doing research for this problem I came across this nifty site, which allows you to paste in names to be capitalized quite quickly: https://dialect.ca/code/name-case/. I wanted to include it here for reference for others doing similar research/projects.

They release the algorithm they have written in php at this link: https://dialect.ca/code/name-case/name_case.phps

A preliminary test and reading of their code suggests they have been quite thorough.

You do not mention which language you would like the solution in so here is some pseudo code.

Loop through each character
    If the previous character was an alphabet letter
        Make the character lower case
    Otherwise
        Make the character upper case
End loop

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow