c++ - How to structure data for optimal speed in a CUDA app -
I am trying to write a simple particle system that leverages the CUDA to update the QDA status. Right now I am defining a particle, in which there is an object with defined position with three float values, and one velocity is also defined with three float values. While updating the particles, I am adding a constant value to the Y component of Velocity to emulate gravity, then add the speed to the current state to come up with a new situation. In the context of memory management, it is better to store two different arrays of float to store data or structure in an object oriented manner. Something like this:
structure vector {float x, y, z; }; Structure particle (vector position; vector velocity;};
It looks like the size of the data is with the same method (4 bytes float float, 3 plot per vector, 2 vectors per particle , A total of 24 bytes) sounds like OO, more transparent data transfers between CPUs and GPUs will be allowed because I can use a memory copy statement instead of 2 (and for a long time, because of the particles I am There are some other bits of information that will be relevant, such as age, Lifetime, Weight / Mass, temperature, etc.) and yet there is also easy readability of code and the ease of working with it which leans towards OO's approach The examples I have seen, do not use structured data, so I wonder if there is a reason.
So the question is which is better: the personal array of data or structured objects ?
Data is common in parallel programming to talk about "ARA Streaks" (AOS) vs. "Straight of Arrays" (SOA) vs.
In GPU programming, SOA's generally preferred reason is to optimize access to global memory
The main point is that the memory transaction would have a minimum size of 32 bytes And you want to maximize efficiency, you can see the presentation from GTC last year for detailed explanation. With AOS:
position [base + tid] .x = position [base + tid]. X + velocity [Base + TID]. X * Dt; // ^ Write to every third address ^ Read from every third address ^ // read from every third address
with SOA:
position.x [Base + TID] = Status.x [Base + TID] + Velocity X [Base + TID] * DT; // ^ Write consecutive addresses ^ Read through frequent reading ^ / Read from continuous readable addresses
In the second case, reading from the address consistently means that you have 33% compared to 100% efficiency. % Is the first case to note that the situation at the old GPUs (calculation capacity 1.0 and 1.1) is very bad (13% efficiency).
There is another possibility - if you have two or four floats in the structure then you can read AOS with 100% efficiency:
float4 los; Float4 lvel; Lpos = position [Base + TID]; Lvel = Velocity [Base + TID]; Lpos.x + = lvel.x * dt; // ... position [base + tid] = lpos;
Then, look for advanced CUDA C presentation for details.
Comments
Post a Comment