Why using forward_list with char is much more optimize than using it with long long?

Question 1

forward_list is typically implemented as a simple singly-linked list. The list nodes have a single pointer that designates the following node in the list, and a data field that holds the actual user data. The size of a single node of a forward_list<T> would then be sizeof(void*) + sizeof(T) assuming (1) that all data pointers have the same size, and (2) T does not have a more strict alignment than void*.

list is implemented as a double-linked list. As such, a single node would have a data field and pointers to both the next and the previous list nodes. The size of a list<T> node is thus 2 * sizeof(void*) + sizeof(T) under the same assumptions as above.

It is common for memory allocators to allocate memory in blocks whose size is a multiple of the largest fundamental alignment. There's a tradeoff between providing the granularity of allocations and the amount of tracking overhead. Allocating blocks in multiples of the fundamental alignment is a good balance point and has the added bonus of ensuring that allocations are always properly aligned for any type that is not over-aligned.

Let's make the assumption that your program is running on a 32-bit implementation - sizeof(void*) == 4 - and that the system allocator uses a block size of 8 bytes - sizeof(double). These assumptions are true for typical 32-bit Windows C++ as implemented by MS Visual C++. With those assumptions, the size of various objects are:

forward_list<char> node: sizeof(void*) + sizeof(char) == 5, which the memory allocator will round up to an 8 byte allocation.
forward_list<long long> node: sizeof(void*) + sizeof(long long) == 12, which the memory allocator will round up to an 16 byte allocation.
list<char> node: 2 * sizeof(void*) + sizeof(char) == 9, which the memory allocator will round up to an 16 byte allocation.
list<long long> node: 2 * sizeof(void*) + sizeof(long long) == 16, which the memory allocator won't round since 16 is already a multiple of 8.

So of the types in the OP, forward_list<char> allocates 8 bytes per element and all of list<char>, list<long long>, and forward_list<long long> allocate 16 bytes per element. This is one likely explanation for the memory usage observations.

Question 2

First thing to note is that this is hardly "much better". The difference is entirely marginal.

Beyond that, I think it's fairly obvious that copying 8 bytes is going to take slightly longer than copying one; plus, there is no read from register/memory required for your constant 'a', whose value can be hard-coded into your executable.