Extract function names and their comments from C code with python (to understand the Linux kernel)

https://stackoverflow.com/questions/11014490

14-06-2021
|

Question

Backrground Information

I've just started to learn stuff about drivers and the linux kernel. I want to understand how a user write() and read() works. So I started using ftrace to hopefully see the path the functions go. But a trace from a single programm like the following is "enormous".

int main() {
    int w;
    char buffer[] = "test string mit 512 byte";
    int fd = open("/dev/sdd",O_DIRECT | O_RDWR | O_SYNC);
    w = write(fd,buffer,sizeof(buffer));
}

I also don't know which functions I could filter, because I don't know the Linux Kernel and I don't want to throw something important away.

So I've started to work through a function_graph trace. Here is a snip.

 [...]
 12)   0.468 us    |            .down_write();
 12)   0.548 us    |            .do_brk();
 12)   0.472 us    |            .up_write();
 12)   0.762 us    |            .kfree();
 12)   0.472 us    |            .fput();
 [...]

I saw these .down_write() and .up_write() and I thought, this is exactly what I search. So I looked it up. down_write() sourcecode:

 /*
 * lock for writing
 */
 void __sched down_write(struct rw_semaphore *sem)
 {
       might_sleep();
       rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);

       LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }

But it turned out, that this is just to lock and release locks. Then I've starte to write a small reference for me, so I don't have to always look up this stuff, because it fells like there are over 9000. Then I had the idea, why not, parse these functions and their comments and write them behind the functions in the trace file? Like this:

 [...]
 12)   0.468 us    |            .down_write(); lock for writing
 12)   0.548 us    |            .do_brk(); 
 12)   0.472 us    |            .up_write(); release a write lock
 12)   0.762 us    |            .kfree();
 12)   0.472 us    |            .fput();
 [...]

The main Problem

So I've started to think about how I can achieve this. I would like to do it with python, because I feel most comfortable with it.

1. Problem
To match the C functions and comments, I have to define and implement a recursive matching grammar :(

2. Problem
Some functions are just wrappers and have no comments. For example do_brk() wraps __do_brk() and the comment is only over the __do_brk()

So I thought, that maybe there are other sources for the comments. Maybe docs? Also it's possible, that this "doc generation" with python has somebody already implemented.

Or is my way to understand a system read() write() very unintelligent? Can you give me tipps how I should dig deeper?

Thank you very much for reading,
Fabian

Solution

Parsing comments is quite hard in practice. Parsing kernel code is not specially easy.

First, you should understand precisely what a system call is in the linux kernel, and how applications use them. The Linux Assembly HowTo has good explanations.

Then, you should understand the organization of the Linux kernel. I strongly suggest reading some good books on this.

Exploring the kernel source code with automatic tools is a big amount of work (months, not days). You might consider the coccinelle tool (for so called "semantic patches"). You could also consider customizing the GCC compiler with plugins, or better yet, with MELT extensions

^{(MELT is a high-level domain specific language to extend GCC; I am its main designer & implementor).}

If working with GCC, you'll get all the power of GCC internal representations and processing in the middle-end (but at this stage comments are lost).

What you are trying to do is probably much more ambitious that what you initially thought. See also Alexandre Lissy's work, e.g. model-checking the linux kernel and the papers he will present at Linux Symposium 2012 (july 2012)

OTHER TIPS

Yes, your approach is right: learning kernel, always start with system call. Kernel codes = codes executed with higher privileges than normal codes. Intel has about 18 (?? not sure) privileged instruction set.

http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/

http://en.wikipedia.org/wiki/Privilege_level

And when users code made transition from lower privilege level to higher privilege to execute these special instructions, it pass through a standard system call mechanism.

Doing a simple "strace ls" you can see many system call being executed: each of which necessarily made transition into the kernel to execute some tasks.

Writing a simple script like (kernel version dependent, for your specific kernel, see /sys/kernel/debug/tracing/README):

echo function > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/current_tracer

echo 1 > /sys/kernel/debug/tracing/tracing_on
ls /tmp
cat /sys/kernel/debug/tracing/trace 
echo 0 > /sys/kernel/debug/tracing/tracing_on

We will get the following output (after deleting all the non-ls ftrace output):

http://pastebin.com/vEk2NrDQ

Now the ftrace output above shows the actual kernel functions executed when "ls" is done from userspace. Not every functions need to be understood or is important. Neither is it about learning kernel APIs per-se. But more important is lots of concept: how to share resources among different CPU, different processes, different types of synchronization primitives etc.

Enjoy every small little steps.....one at a time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow