문제

Is there a script available for post processing some objdump --disassemble output to annotate with cycle counts? Especially for the ARM family. Most of the time this would only be a pattern match with a table lookup for the count. I guess annotations like +5M for five memory cycles might be needed. Perl, python, bash, C, etc are fine. I think this can be done generically, but I am interested in the ARM, which has an orthogonal instruction set. Here is a thread on the 68HC11 doing the same thing. The script would need an CPU model option to select the appropriate cycle counts; I think these counts already exist in the gcc machine description.

I don't think there is an objdump switch for this, but RTFM would be great.

Edit: To clarify, assumptions such as best case memory sub-system as will be the case when the code executes from cache are fine. The goal is not a 100% accurate cycle count as per some running machine. It is possible to get a reasonable estimate, otherwise compiler design would be impossible.

As DWelch points out, a simple running total is not possible with deep pipelined architecture, like more recent Cortex chips. The objdump post processing would have to look at surrounding opcodes. A gcc plug-in is more likely to be able to accomplish this and as that is new (4.5+), I don't think such a thing exists. A script for the ARM926 is certainly possible and fairly simple.

The memory latency doesn't matter. The memory controller is like another CPU. It is doing it's business while the CPU is doing arithmetic, etc. A good/well tuned algorithm will parallel the memory accesses with the computations. By counting loads/store and cycles you can determine how much parallelism is accomplished, when you actively profile with a timer. The pipeline is significant due to interlocks between registers, but a cycle count for basic blocks can reliably be calculated and used even on modern ARM processors; this is too complex for a simple script.

도움이 되었습니까?

해결책

There is an online tool which estimates cycle counts on Cortex-A8. However, this CPU is quite old, and programs optimized for it might be suboptimal on newer CPUs.

AFAIK ARM also provides Cortex-A9 and Cortex-A5 cycle-accurate emulators in their RVDS software, but it is quite expensive.

다른 팁

Cycle counts are not something that can be assessed by looking at the instruction alone on a modern high end ARM. There is a lot of runtime state that affects the real world retirement rate of an instruction. Does the data it needs exist in the cache? Does the instruction have any dependencies on previous instruction results? If so, what latencies does the forwarding unit remove? How full is the load/store buffer? What kind of memory mapping is it touching? How full are the processor pipelines that this instruction needs? Are there synchronizing instructions in the stream? Has speculation brought forward some data it depends on? What is the state of the register renamer? Have conditional instructions been filling the pipeline or was the decoder smart enough to skip them completely? What are the ratios between the core clock and the bus and memory clocks? What's the size of the branch prediction table?

Without a full processor simulation all you can get are guesses. Whether those numbers are meaningful to you depends on what you are trying to accomplish with them.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top