How does capitalize work?

https://stackoverflow.com/questions/23395334

12-07-2023
|

Question

I'm trying to understand how String#capitalize! works internally. I can create a hash. Given string foo = "the", foo[0] is "t", look up the lower_case "t", and match it with upper case "T" value. In fact, Ruby source shows:

static VALUE
rb_str_capitalize_bang(VALUE str)
{
    rb_encoding *enc;
    char *s, *send;
    int modify = 0;
    unsigned int c;
    int n;

    str_modify_keep_cr(str);
    enc = STR_ENC_GET(str);
    rb_str_check_dummy_enc(enc);
    if (RSTRING_LEN(str) == 0 || !RSTRING_PTR(str)) return Qnil;
    s = RSTRING_PTR(str); send = RSTRING_END(str);

    c = rb_enc_codepoint_len(s, send, &n, enc);
    if (rb_enc_islower(c, enc)) {
        rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
        modify = 1;
    }
    s += n;
    while (s < send) {
        c = rb_enc_codepoint_len(s, send, &n, enc);
        if (rb_enc_isupper(c, enc)) {
            rb_enc_mbcput(rb_enc_tolower(c, enc), s, enc);
            modify = 1;
        }
        s += n;
    }

    if (modify) return str;
    return Qnil;
}

The relevant function is toupper. How does it know toupper("t") equals "T"?

Solution

You're wondering how it knows what the uppercase version of the character is? Like most real-world implementations of this kind of function, it uses a lookup table.

OTHER TIPS

toupper is an ansi C function. This means that the exact implementation actually depends on the provider of your library, which most of the times is your compiler.

Chances are that it follows the ASCII table, because there is no lookup as faster as a sum of integers - one of the steps in the lookup should involve a sum, to calculate the new address.

So, on gcc, we have this implementation

char
ctype<char>::do_toupper(char __c) const
{
  int __x = __c;
  return (this->is(ctype_base::lower, __c) ? (__x - 'a' + 'A') : __x);
}

This basically checks if it lower. If it is, returns the lower. Otherwise, it does subtracts 97 and then sum 65, which is the same thing than subtract 32. Remember that characters and numbers are the same for a computer, just binary data. And then, note how characters are used instead of numbers for a better readability (well, at least for C folks).

Without looking at any source code, I would guess that one way would be to convert a character to its' respective ASCII value, subtract 32 from it and convert the ASCII value back to a char.

enter image description here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow