Loading...

Follow CUDA | Reddit on Feedspot

Continue with Google
Continue with Facebook
or

Valid
CUDA | Reddit by /u/macedoniancunt - 3d ago

I was wondering if anyone knows what library (free or paid) is the fastest for sorting 32-bit integers. Currently I'm using DeviceRadixSort from CUB which is pretty fast, but since it's still the slowest part of my algorithm I was wondering if anyone knows a faster library.

When I try to search for the fastest parallel sorting algorithm I find a lot of papers and studies, but no actual implementations. Perhaps the algorithms described in those papers are only faster in theory and not in practice?

The dataset I'm sorting is about 2GB so sorting in shared memory is sadly not possible.

submitted by /u/macedoniancunt
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I have a loop to that schedules GPU work to process a frame of data at a time. Some of the D2H copies will use events to signal that the work is done. Since the events have a lifetime beyond the scheduling, each loop iteration pushes the event pointer inside a vector. If the loop runs for a long time, then this std::deque will keep growing without limit. I'd like to be able to pop elements from the back once the events are done. I thought of using a callback, but we cannot call CUDA APIs from callbacks (so no cuEventDestroy). Another way would be to have a separate host thread waiting on the events and using mutex/lock to pop elements from the deque. Do you know of other, maybe simpler ways of accomplishing this?

submitted by /u/lknvsdlkvnsdovnsfi
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I'm trying to figure out how to code a simulation of a glass sphere with optic channels engraved in it how can I trace rays through the model? Similar construction to this experiment... https://www.youtube.com/watch?v=JJB3q0K_TlY

submitted by /u/Volkerborg668
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
CUDA | Reddit by /u/camatta_ - 2w ago

I've spent a lot of time writing some code using Thrust device_vectors inside Global and Device functions, just to discover (when there weren't any more errors) that you can only use it inside Host functions. Am I understanding it wrong or is it really like this? And if it is, what application do this even has ?

submitted by /u/Camatta_
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Hey so I am making a ray tracer that renders images to a window using opengl. I am having some trouble.

When I render the image it displays weirdly like this. Does anyone have any idea what is causing that? Here is my rendering function where I am writing the calculated pixel values to the shared cuda/opengl resource using surf2Dwrite().

__global__ void render(cudaSurfaceObject_t image, int x, int y, int samples, Camera **camera, SceneObject **scene, curandState *rand_state, float *deviceRes) { //coordinates of each thread int i = threadIdx.x + blockIdx.x * blockDim.x; int j = threadIdx.y + blockIdx.y * blockDim.y; //if threads try to write outside image buffer if ((i >= x) || (j >= y)) return; //current pixel int pixelIndex = j * x + i; curandState local_rand_state = rand_state[pixelIndex]; glm::vec3 colour; for (int s = 0; s < samples; s++) { float u = float(i + curand_uniform(&local_rand_state)) / float(x); float v = float(j + curand_uniform(&local_rand_state)) / float(y); //Cast a ray into the center of the pixel Ray ray = (*camera)->getRay(u, v); colour += calculateColour(ray, scene, &local_rand_state); } //average colour of n samples colour[0] = colour[0] / samples; colour[1] = colour[1] / samples; colour[2] = colour[2] / samples; //gamma correction colour[0] = sqrt(colour[0]); colour[1] = sqrt(colour[1]); colour[2] = sqrt(colour[2]); //write to device resource deviceRes[pixelIndex + 0] = colour[0]; deviceRes[pixelIndex + 1] = colour[1]; deviceRes[pixelIndex + 2] = colour[2]; deviceRes[pixelIndex + 3] = 1.0f; //put colour in float4 to pass to surf2Dwrite float4 cfloat = make_float4(0, 0, 0, 0); cfloat.x = colour[0]; cfloat.y = colour[1]; cfloat.z = colour[2]; cfloat.w = 1.0f; surf2Dwrite(cfloat, image, (int)i * sizeof(cfloat), j, cudaBoundaryModeClamp); } 

if i run the ray tracer as a stand alone application (Without windows and opengl, just rendering to a ppm file) then it works fine so I supsect the issue lies in the code after the gamma correction. I can't seem to work out the issue though.

Could someone lend a hand please?

Thanks!

submitted by /u/PM_Me_Compliments
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I am planning to use Keras/Tensorflow to run ML models on my laptop. I have a SSD (128 GB) and a HDD (1 TB). I was wondering whether installing CUDA on the HDD would decrease model performance somehow.

‚Äč

Could anyone point in the right direction?

submitted by /u/hemangb
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Hello

I am currently reading programming massively parallel processors from David B. Kirk It states on page 51, chapter "3.5 KERNEL FUNCTIONS AND THREADING":

Because all of these threads execute the same code, CUDA programming is an instance of the well-known single-program, multiple-data (SPMD) parallel programming style [Atallah 1998], a popular programming style for massively parallel computing systems.

Does this mean there is no way at all to have eg 50% of the threads on your GPU execute one function (an addition for instance) and the other 50% another function (a substraction)? If I am not mistaken this is a very big difference campered with FPGA's where you can do all bunch of different things in parallel.

Or is this totally possible but considered bad practice?

Thanks

submitted by /u/technical_questions2
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Hey guys,

I am working on a real time ray tracing project using CUDA, OpenGL and GLFw.

My goal is to raytrace a scene and see the results in a window in real time. Any changes to the image (change of light intensity etc) should then update in real time.

My program is inteneded to work like this.

1) Cuda ray traces a scene and writes the pixels to a framebuffer

2) That frame buffer is then sent to OpenGL to display it as a textured quad in a GLFW window

3) If I make a change to the image (e.g. press 3 to increase light intensity) then CUDA should re-raytrace the image and update the display in the GLFW window.

After extensive research I have decided to use CUDA/Opengl interlop which I believe shares a resource between CUDA and OpenGL. If that is the case, do I essentially declare a resource that is shared between both, write to it in CUDA and read/display it in OpenGL?

I am following a guide from Cuda By Example however I am really struggling to apply it to my project. Here is what I have so far which I believe is not even close to working (will try my best to only include what is relevant but let me know if anything is missing).

Window.cu Application starting point. Should create a GLFW window, call the CUDA ray tracer and retrieve the pixels from its output, then display the image. Also handles keyboard input.

int main() { //cuda set up cudaDeviceProp prop; int dev; memset(&prop, 0, sizeof(cudaDeviceProp)); prop.major = 1; //minimum computer version prop.minor = 0; checkCudaErrors(cudaChooseDevice(&dev, &prop)); //choose device checkCudaErrors(cudaGLSetGLDevice(dev)); //set device //GLFW set up goes here ... //generate buffers to share between cuda and GL glGenBuffers(1, &bufferObj); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, bufferObj); glBufferData(GL_PIXEL_UNPACK_BUFFER, 3 * 800 * 400, NULL, GL_DYNAMIC_DRAW); //sizing is weird def change //register buffer object with opengl checkCudaErrors(cudaGraphicsGLRegisterBuffer(&resource, bufferObj, cudaGraphicsMapFlagsNone)); //tell cuda we want to share the opengl buffer //map the shared resource uchar3 *devPtr; size_t size; checkCudaErrors(cudaGraphicsMapResources(1, &resource, NULL)); checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void**)&devPtr, &size, resource)); cudaFunction(devPtr); checkCudaErrors(cudaGraphicsUnmapResources(1, &resource, NULL)); 

This is mostly taken from the book I am following.

Raytracer.cu a c++ function that sets up the ray tracer. Calls a CUDA kernel to calculate the colour of each pixel and stores them in a frame buffer (it also writes to a ppm but I want to remove that in favour of displaying the image in a window.

int cudaFunc(uchar3 *devPtr) { //image set up (size, scene etc) ... //framebuffer size_t frameBufferSize = 3 * imageSize * sizeof(float); //3 for each colour channel (RGB) glm::vec3 *frameBuffer; checkCudaErrors(cudaMallocManaged((void**)&frameBuffer, frameBufferSize)); //render the image render <<<blocks, threads>>> (frameBuffer, xPixels, yPixels, samples, d_camera, d_scene, d_rand_state); } __global__ void render(glm::vec3 *frameBuffer, int x, int y, int samples, Camera **camera, SceneObject **scene, curandState *rand_state) { //coordinates of each thread int i = threadIdx.x + blockIdx.x * blockDim.x; int j = threadIdx.y + blockIdx.y * blockDim.y; //if threads try to write outside image buffer if ((i >= x) || (j >= y)) return; //current pixel int pixelIndex = j * x + i; curandState local_rand_state = rand_state[pixelIndex]; glm::vec3 colour = glm::vec3(0,0,0); for (int s = 0; s < samples; s++) { float u = float(i + curand_uniform(&local_rand_state)) / float(x); float v = float(j + curand_uniform(&local_rand_state)) / float(y); //Cast a ray into the center of the pixel Ray ray = (*camera)->getRay(u, v); colour += calculateColour(ray, scene, &local_rand_state); } rand_state[pixelIndex] = local_rand_state; colour /= float(samples); colour[0] = sqrt(colour[0]); colour[1] = sqrt(colour[1]); colour[2] = sqrt(colour [2]); frameBuffer[pixelIndex] = colour; 

}

So in the cudaFunction which I am using to call the kernel from the window file now contains a glm::vec3 *frameBuffer with all the pixel information that I need. My question is, how do I get that back to OpenGL?

Currently I am passing in a uchar3 pointer to the cudaFunction where my framebuffer is a glm::vec3 so I think there is a problem there, however I would just like to know how to get it back to OpenGL and if there is anything else that I am doing wrong.

Thanks for your help.

submitted by /u/PM_Me_Compliments
[visit reddit] [comments]
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Where can I find the latest version of the cuda driver source code? I want to compile and install drivers for my macbook pro ME294 2013 last computer.

submitted by /u/JamesLinus
[visit reddit] [comments]
Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview