Are C strings always null terminated, or does it depend on the platform?

https://softwareengineering.stackexchange.com/questions/344603

08-01-2021
|

Question

Right now I am working with embedded systems and figuring out ways to implement strings on a microprocessor with no operating system. So far what I am doing is just using the idea of having NULL terminated character pointers and treating them as strings where the NULL signifies the end. I know that this is fairly common, but can you always count on this to be the case?

The reason I ask is that I was thinking about maybe using a real time operating system at some point, and I'd like to re-use as much as my current code as possible. So for the various choices that are out there, can I pretty much expect the strings to work the same?

Let me be more specific though for my case. I am implementing a system that takes and processes commands over a serial port. Can I keep my command processing code the same, and then expect that the string objects created on the RTOS (which contains the commands) to all be NULL terminated? Or, would it be different based on the OS?

Update

After being advised to take a look at this question I have determined that it does not exactly answer what I am asking. The question itself is asking if a string's length should always be passed which is entirely different than what I am asking, and although some of the answers had useful information in them they are not exactly what I am looking for. The answers there seemed to give reasons why or why not to terminate a string with a null character. The difference with what I am asking is if I can more or less expect the in-born strings of different platforms to terminate their own strings with null, without having to go out and try every single platform out there if that makes sense.

Solution

The things that are called "C strings" will be null-terminated on any platform. That's how the standard C library functions determine the end of a string.

Within the C language, there's nothing stopping you from having an array of characters that doesn't end in a null. However you will have to use some other method to avoid running off the end of a string.

OTHER TIPS

Determination of the terminating character is up to the compiler for literals and the implementation of the standard library for strings in general. It isn't determined by the operating system.

The convention of NUL termination goes back to pre-standard C, and in 30+ years, I can't say I've run into an environment that does anything else. This behavior was codified in C89 and continues to be part of the C language standard (link is to a draft of C99):

Section 6.4.5 sets the stage for NUL-terminated strings by requiring that a NUL be appended to string literals.
Section 7.1.1 brings that to the functions in the standard library by defining a string as "a contiguous sequence of characters terminated by and including the first null character."

There's no reason why someone couldn't write functions that handle strings terminated by some other character, but there's also no reason to buck the established standard in most cases unless your goal is giving programmers fits. :-)

I am working with embedded systems ... with no operating system...I am...using the idea of having NULL terminated character pointers and treating them as strings where the NULL signifies the end. I know that this is fairly common, but can you always count on this to be the case?

There is no string data type in the C language, but there are string literals.

If you put a string literal in your program, it will usually be NUL terminated (but see the special case, discussed in comments below.) That is to say, If you put "foobar" in a place where a const char * value is expected, the compiler will emit foobar⊘ to the const/code segment/section of your program, and the value of the expression will be a pointer to the address where it stored the f character. (Note: I am using ⊘ to signify the NUL byte.)

The only other sense in which the C language has strings is, it has some standard library routines that operate on NUL terminated character sequences. Those library routines will not exist in a bare metal environment unless you port them yourself.

They're just code---no different from the code that you yourself write. If you don't break them when you port them, then they will do what they always do (e.g., stop on a NUL.)

As others have mentioned, null terminating of strings is a convention of the C Standard Library. You can handle strings any way you wish if you're not going to use the standard library.

This is true of any operating system with a 'C' compiler, and as well, you can write 'C' programs that are not run under a true operating system as you mention in your question. An example would be the controller for an ink jet printer I designed once. In embedded systems, the memory overhead of an operating system may not be necessary.

In memory-tight situations, I would look at the characteristics of my compiler vis-a-vis the instruction set of the processor, for example. In an application where strings are processed a lot, it might be desirable to use descriptors such as string length. I'm thinking of a case where the CPU is particularly efficient at working with short offsets and/or relative offsets with address registers.

So which is more important in your application: code size and efficiency, or compatibility with an OS or Library? Another consideration might be maintainability. The further you stray from convention, the harder it will be for someone else to maintain.

Others have addressed the issue that in C, strings are largely what you make of them. But there seems to be some confusion in your question w.r.t. the terminator itself, and from one perspective, this could be what someone in your position is worried about.

C strings are null-terminated. That is, they are terminated by the null character, NUL. They are not terminated by the null pointer NULL, which is a completely different kind of value with a completely different purpose.

NUL is guaranteed to have the integer value zero. Within the string, it will also have the size of the underlying character type, which will usually be 1.

NULL is not guaranteed to have an integer type at all. NULL is intended for use in a pointer context, and is generally expected to have a pointer type, which shouldn't convert to a character or integer if your compiler is any good. While the definition of NULL involves the glyph 0, it is not guaranteed to actually have that value[1], and unless your compiler implements the constant as a one-character #define (many don't, because NULL really shouldn't be meaningful in a non-pointer context), the expanded code is therefore not guaranteed to actually involve a zero value (even though it confusingly does involve a zero glyph).

If NULL is typed, it will also be unlikely to have a size of 1 (or another character size). This may conceivably cause additional problems, although actual character constants don't have character size either for the most part.

Now most people will see this and think, "null pointer as anything other than all-zero-bits? what nonsense" - but assumptions like that are only safe on common platforms like x86. Since you've explicitly mentioned an interest in targeting other platforms, you need to take this issue into account, as you have explicitly separated your code from assumptions about the nature of the relationship between pointers and integers.

Therefore, while C strings are null-terminated, they aren't terminated by NULL, but by NUL (usually written '\0'). Code which explicitly uses NULL as a string terminator will work on platforms with a straightforward address structure, and will even compile with many compilers, but it's absolutely not correct C.

[1] the actual null pointer value is inserted by the compiler when it reads a 0 token in a context where it would be converted to a pointer type. This is not a conversion from the integer value 0, and is not guaranteed to hold if anything other than the token 0 itself is used, such as a dynamic value from a variable; the conversion is also not reversible, and a null pointer doesn't have to yield the value 0 when converted to an integer.

I have been using string in C, it means characters with null termination is called Strings.

It won't have any issues when you use in baremetal or in any operating systems such as Windows, Linux, RTOS :(FreeRTO, OSE).

In embedded world null termination actually helps more to token the character as string.

I've been using strings in C like that in many safety critical systems.

You might be wondering, what is string actually in C?

C-style strings, which are arrays, there are also string literals, such as "this". In reality, both of these string types are merely just collections of characters sitting next to each other in memory.

Whenever you write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character.

For example, you can declare and define an array of characters, and initialize it with a string constant:

char string[] = "Hello cruel world!";

Straightforward answer: You don't really need to worry about the usage of characters with null termination, this work independent of any platform.

As others have said, null termination is pretty much universal for standard C. But (as others have also pointed out) not 100%. For (another) example, the VMS operating system typically used what it called "string descriptors" http://h41379.www4.hpe.com/commercial/c/docs/5492p012.html accessed in C by #include <descrip.h>

Application-level stuff can use null termination or not, however the developer sees fit. But low-level VMS stuff absolutely requires descriptors, which don't use null termination at all (see above link for details). This is largely so that all languages (C, assembly, etc) which directly use VMS internals can have a common interface with them.

So if you're anticipating any kind of similar situation, you might want to be somewhat more careful than "universal null termination" might suggest is necessary. I'd be more careful if I were doing what you're doing, but for my application-level stuff it's safe to assume null termination. I just wouldn't suggest the same level of safety to you. Your code might well have to interface with assembly, and/or other, language code at some future point, which may not always conform to the C standard of null-terminated strings.

In my experience of embedded, safety critical and real time systems it is not uncommon to use both the C and PASCAL string conventions, i.e. to supply the strings length as the first character, (which limits the length to 255), and to end the string with at least one 0x00, (NUL), which reduces the usable size to 254.

One reason for this is to know how much data you are expecting after the first byte has been received and another is that, in such systems, dynamic buffer sizes are avoided where possible - allocating a fixed 256 buffer size is faster and safer, (no need to check if malloc failed). Another is that the other systems that you are communicating with may not be written in ANSI-C.

In any embedded work it is important to establish and maintain an Interface Control Document, (IDC), that defines all of your communications structures including string formats, endianness, integer sizes, etc., as soon as possible, (ideally before starting), and it should be your, and all the teams, holy book when writing the system - if someone wishes to introduce a new structure or format it must be documented there first and everybody that might be impacted informed, possibly with an option to veto the change.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange