Question

I have been looking at LLVM lately, and I find it to be quite an interesting architecture. However, looking through the tutorial and the reference material, I can't see any examples of how I might implement a string data type.

There is a lot of documentation about integers, reals, and other number types, and even arrays, functions and structures, but AFAIK nothing about strings. Would I have to add a new data type to the backend? Is there a way to use built-in data types? Any insight would be appreciated.

Was it helpful?

Solution

What is a string? An array of characters.

What is a character? An integer.

So while I'm no LLVM expert by any means, I would guess that if, eg, you wanted to represent some 8-bit character set, you'd use an array of i8 (8-bit integers), or a pointer to i8. And indeed, if we have a simple hello world C program:

#include <stdio.h>

int main() {
        puts("Hello, world!");
        return 0;
}

And we compile it using llvm-gcc and dump the generated LLVM assembly:

$ llvm-gcc -S -emit-llvm hello.c
$ cat hello.s
; ModuleID = 'hello.c'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
target triple = "x86_64-linux-gnu"
@.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]

define i32 @main() {
entry:
        %retval = alloca i32            ; <i32*> [#uses=2]
        %tmp = alloca i32               ; <i32*> [#uses=2]
        %"alloca point" = bitcast i32 0 to i32          ; <i32> [#uses=0]
        %tmp1 = getelementptr [14 x i8]* @.str, i32 0, i64 0            ; <i8*> [#uses=1]
        %tmp2 = call i32 @puts( i8* %tmp1 ) nounwind            ; <i32> [#uses=0]
        store i32 0, i32* %tmp, align 4
        %tmp3 = load i32* %tmp, align 4         ; <i32> [#uses=1]
        store i32 %tmp3, i32* %retval, align 4
        br label %return

return:         ; preds = %entry
        %retval4 = load i32* %retval            ; <i32> [#uses=1]
        ret i32 %retval4
}

declare i32 @puts(i8*)

Notice the reference to the puts function declared at the end of the file. In C, puts is

int puts(const char *s)

In LLVM, it is

i32 @puts(i8*)

The correspondence should be clear.

As an aside, the generated LLVM is very verbose here because I compiled without optimizations. If you turn those on, the unnecessary instructions disappear:

$ llvm-gcc -O2 -S -emit-llvm hello.c
$ cat hello.s 
; ModuleID = 'hello.c'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
target triple = "x86_64-linux-gnu"
@.str = internal constant [14 x i8] c"Hello, world!\00"         ; <[14 x i8]*> [#uses=1]

define i32 @main() nounwind  {
entry:
        %tmp2 = tail call i32 @puts( i8* getelementptr ([14 x i8]* @.str, i32 0, i64 0) ) nounwind              ; <i32> [#uses=0]
        ret i32 0
}

declare i32 @puts(i8*)

OTHER TIPS

[To follow up on other answers which explain what strings are, here is some implementation help]

Using the C interface, the calls you'll want are something like:

LLVMValueRef llvmGenLocalStringVar(const char* data, int len)
{
  LLVMValueRef glob = LLVMAddGlobal(mod, LLVMArrayType(LLVMInt8Type(), len), "string");

  // set as internal linkage and constant
  LLVMSetLinkage(glob, LLVMInternalLinkage);
  LLVMSetGlobalConstant(glob, TRUE);

  // Initialize with string:
  LLVMSetInitializer(glob, LLVMConstString(data, len, TRUE));

  return glob;
}

Think about how a string is represented in common languages:

  • C: a pointer to a character. You don't have to do anything special.
  • C++: string is a complex object with a constructor, destructor, and copy constructor. On the inside, it usually holds essentially a C string.
  • Java/C#/...: a string is a complex object holding an array of characters.

LLVM's name is very self explanatory. It really is "low level". You have to implement strings how ever you want them to be. It would be silly for LLVM to force anyone into a specific implementation.

Using the C API, instead of using LLVMConstString, you could use LLVMBuildGlobalString. Here is my implementation of

int main() {
    printf("Hello World, %s!\n", "there");
    return;
}

using C API:

LLVMTypeRef main_type = LLVMFunctionType(LLVMVoidType(), NULL, 0, false);
LLVMValueRef main = LLVMAddFunction(mod, "main", main_type);

LLVMTypeRef param_types[] = { LLVMPointerType(LLVMInt8Type(), 0) };
LLVMTypeRef llvm_printf_type = LLVMFunctionType(LLVMInt32Type(), param_types, 0, true);
LLVMValueRef llvm_printf = LLVMAddFunction(mod, "printf", llvm_printf_type);

LLVMBasicBlockRef entry = LLVMAppendBasicBlock(main, "entry");
LLVMPositionBuilderAtEnd(builder, entry);

LLVMValueRef format = LLVMBuildGlobalStringPtr(builder, "Hello World, %s!\n", "format");
LLVMValueRef value = LLVMBuildGlobalStringPtr(builder, "there", "value");

LLVMValueRef args[] = { format, value };
LLVMBuildCall(builder, llvm_printf, args, 2, "printf");

LLVMBuildRetVoid(builder);

I created strings like so:

LLVMValueRef format = LLVMBuildGlobalStringPtr(builder, "Hello World, %s!\n", "format");
LLVMValueRef value = LLVMBuildGlobalStringPtr(builder, "there", "value");

The generated IR is:

; ModuleID = 'printf.bc'
source_filename = "my_module"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

@format = private unnamed_addr constant [18 x i8] c"Hello World, %s!\0A\00"
@value = private unnamed_addr constant [6 x i8] c"there\00"

define void @main() {
entry:
  %printf = call i32 (...) @printf(i8* getelementptr inbounds ([18 x i8], [18 x i8]* @format, i32 0, i32 0), i8* getelementptr inbounds ([6 x i8], [6 x i8]* @value, i32 0, i32 0))
  ret void
}

declare i32 @printf(...)

For those using the C++ API of LLVM, you can rely on IRBuilder's CreateGlobalStringPtr :

Builder.CreateGlobalStringPtr(StringRef("Hello, world!"));

This will be represented as i8* in the final LLVM IR.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top