Question

I'm observing a strange behavior and would like to know if it is Intel Xeon Phi related or not.

I have a little example code basically the matrix multiplication everyone knows (three nested for loops). I offload the computation to an Intel MIC with OpenMP 4.0 target pragma and map the three matrices with map(to:A,B) map(tofrom:C).

Now, what I am observing is that for small matrices e.g. 1024x1024 the memory transfer took extremely long. Compared to the native version (same code, same parallelisation strategy, just no offloading) the offload version consumes about 320ms more time. I did a warm-up run of the code to remove initialization overhead.

Compared to a Nvidia Tesla K20 where the same amount of memory is copied without noticing this 320ms are very bad.

Are there some environment settings that may improve the memory transfer speed?

An additionally question: I enabled offload reporting via the OFFLOAD_REPORT environment variable. What are the differences between the two timing results shown in the report:

[Offload] [HOST]  [Tag 5] [CPU Time]        26.995279(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data]   3221225480 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time]        16.859548(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data]   1073741824 (bytes)

What are those 10 seconds missing at MIC Time (memory transfer?)

Well a third question. Is it possible to used pinned memory with Intel MICs? If yes, how?

Was it helpful?

Solution

It is possibly the memory allocation on MIC that is taking time. Try and separate the three sources of overhead to better understand where the time goes:

// Device initialization
#pragma offload_transfer target(mic)
...
// Memory allocation and first data transfer
// This is expected to have overhead proportional to the amount of memory allocated
// Doing at least one transfer will speed up subsequent transfers
#pragma offload_transfer target(mic) in(p[0:SIZE] : alloc_if(1) free_if(0))

...
// This transfer should be faster
// For large sizes, approaching 6 GiB/s
#pragma offload_transfer target(mic) in(p[0:SIZE] : alloc_if(0) free_if(0))

OTHER TIPS

Since you said "I did a warm-up run of the code to remove initialization overhead", I assume you started the offload runtime by offloading a dummy section. I remember there is an adjustment to start it "on_offload" (default) or at program initialization time (OFFLOAD_INIT=on_start). Anyhow, there is also a fast-path in the DMA engine. The fast path is taken when the buffers (to be transferred) are aligned to the page size. For an offload application, you can simply set an environment variable along with a threshold integerB|K|M|G|T where M is Megabytes (e.g., MIC_USE_2MB_BUFFERS=2M). This threshold defines the size of the buffer that is needed before huge pages are used. So you get two things: huge pages and faster transfers! This feature is still meaningful even with transparent huge pages (THP) being introduced on the coprocessor.

After simply trying OFFLOAD_INIT=on_start and MIC_USE_2MB_BUFFERS=0, you may want to align the buffers on the host side accordingly (max. of. vector-width and page-size ;-). Remember, without additional offload clauses (LEO; but not sure about OpenMP 4.0) the alignment of the host buffer is simply inherited by an offload section. Aligning to 2MB should cover everything (but you can make your allocation much smarter to avoid wasting resources for small buffers). With that you should have enough keywords to find more background if you need.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top