Loading...

Follow Acceleware Ltd. Blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

Acceleware participated in the 2019 Propel Energy Tech Forum this week as an alumni presenter. I originally presented at the 2017 Propel forum, and was proud to showcase the significant progress in financing, commercialization, and intellectual property development that we have made in the past two years including:

  • Securing over $17 million in funding and financing;
  • Completing our 1/20th scale test and launching our commercial scale pilot; and
  • Having two patents granted, twelve filed, and another ten in progress.

The conference brings together energy related start-ups and innovators to give them exposure to venture capital funds as well as strategic investors, E&P operators, and the investment arms of the operators. It’s always interesting to see what new concepts and technologies are being developed and funded, while also having an opportunity to ensure all those investors and potential customers are aware of the progress we have made at Acceleware along with the significant benefits RF XL can deliver. To see the recap of our progress since February 2017, please take a look at the Propel 2019 presentation.

 

 

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Normalization is an important method for data pre-processing, especially for machine learning. Why do we need normalization? Normalization may change the distribution of the feature vector data set and the distances among observation points. Additionally, it may accelerate convergence of deep learning algorithms [1]. As another example, our AxHeat product uses normalization for the RF heating of reservoirs, and improves algorithm stability after we have applied it to the water saturation.

In this blog post, we will discuss 3 types of fundamental normalizations. Those methods specific for deep learning, i.e. local response normalization, batch normalization and layer normalization will be discussed in a future blog post [2].

Zero-mean normalization

Equation:

 

In this case, zero-mean normalization is the same as standardization. The processed data with this method will fit in the standard normal distribution. For some cluster algorithms using distance to measure similarity, i.e. K-means, zero-mean normalization is a good choice[3].

Min-max normalization

Equation:

x = (x-min)/(max - min)

Min-max normalization is a linear normalization technique. It does not change the distribution of the data set. However, if the min and max are not stable among input data sets, it may cause instabilities in the results. Min-max normalization is the most common method in imaging processing as most of the pixels values are in range of [0, 255].

Non-linear normalizations

The typical non-linear normalizations include logarithm, exponential functions, and inverse trigonometric functions. The choice of non-linear function depends on the distribution of your inputs and the expected distribution of outputs. has better discernibility in range of [0, 1]. can take any real numbers as inputs and convert to the output to values in the range.

 

Let’s have a look of data distribution after applying zero-mean, min-max, log and arctan normalizations to a standard Cauchy distribution input:

We generate 200 sample points randomly (shown as blue point), which are in range of [-40.0, 40.0]. After normalizations, the data are shrunk into [-10.0, 15.0]. We do see different data distributions where there is not absolute good or bad. It depends your criteria. For instance, if your target is to minimize the distance among the points, min-max is a good choice. If you expect evenly distributed data with obvious differences, log may be a good idea.

Lastly, Scikit-learn provides handy tools to compare/visualize normalized results with your inputs[4]. It is a good idea to run your candidate normalization algorithm on your sample data set before applying to your real data set.

References

[1] https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normali...
[2] http://yeephycho.github.io/2016/08/03/Normalizations-in-neural-networks
[3] http://ai.stanford.edu/~acoates/papers/coatesng_nntot2012.pdf
[4] http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scal...

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

 

Finite-difference modelling is a common method of solving wave equations and is the foundation for several of Acceleware’s products. One problem with finite-difference methods is that material interfaces that do not align with the rectangular grid are not easily represented. At the very least, naïve implementations may represent a structure different from the one intended. The typical example is a sloping interface between two different materials. If the grid is sampled based on nearest neighbor interpolation, then the model will have a staircased slope, as shown in Figure 1.

Figure 1 - Staircase representation of a flat sloping interface

 

In our electromagnetic products, a combination of sub-gridding and material averaging is used to accurately represent man-made structures, but in the seismic industry, extremely sparse grids are used to reduce the memory required for modelling. Therefore, staircasing errors are a common problem.

One solution I was able to apply originates from Rune Mittet’s work titled “On the Internal Interfaces in Finite Difference Schemes” [1] and some internal conversations with our Chief Science Office, Michal Okoniewski. The key is to keep in mind the continuous representation of a discrete set of samples. In the case of a sharp material interface, simply jumping from one value to the next results in oscillations known as the Gibbs phenomenon. Instead, we need to represent the jump accurately within the available bandwidth (in the wavenumber domain).

In realistic cases, there is more than one sharp interface, interfaces intersect one another in complicated ways, etc. Therefore, we assume only that the material parameters are known to much greater accuracy than the computational grid. In that case, a rectangular grid much finer than the computational grid can be generated. It is then filtered in the wavenumber domain such that it can be safely decimated to the computational grid. Doing this, results in substantially less staircasing errors, as shown in Figure 2 and 3.

Figure 2 - Synthetic modelling of a 2D model comprised of sloping reflectors. When the model is filtered to the appropriate bandwidth, the staircasing errors are significantly reduced. Multiple shots similar to this result were used to generate the images in Figure 3.

 

Figure 3 - RTM of the modelled result superimposed on the density model. The upper image uses a staircased model whereas the bottom one is filtered to the appropriate bandwidth. Note the middle reflector especially is represented as a series of flat reflectors at different depths instead of a smooth slope. The circled sections are what can be viewed below in Figure 4.

 

Figure 4 - Zoomed in sections from Figure 3, demonstrating the difference between a staircased model and one which is filtered to the appropriate bandwidth.

 

Unfortunately, filtering using discrete Fourier transforms uses an impractical amount of memory because computational grids already fill a large percentage of available memory. Additionally, sampling 10 times finer in 3D requires 1000 times as much memory. Instead of filtering in the wavenumber domain, I applied a filter on a finely sampled grid in the spatial domain. Since the desired output is the coarsely sampled computational grid, the finely sampled grid does not need to be kept in memory. Instead, a sliding window of samples are evaluated, filtered, and then decimated. This approach is easily multithreaded, and only requires redundant evaluations where the finely sampled grid is partitioned over threads.

In summary, staircasing errors can be significantly reduced by sampling model parameters that have been filtered to the appropriate bandwidth. In order to represent arbitrary models, a finer, but staircased, model is generated, filtered, and then decimated. These 3 steps are combined in order to make the process computationally feasible.

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

 

 

When a program first reads or writes data to disk, the Linux operating system will also attempt to store the data in the page cache, which is allocated from unused portions of system memory.  The primary purpose of the page cache is to increase the performance of subsequent accesses to any data that happens to reside in page cache, instead of having to repeatedly read to/write from disk which is substantially slower.  However, if too much memory is being used by the page cache, this can potentially lead to long delays when a program later needs to allocate large amounts of contiguous memory.  In this scenario the operating system needs to clear the appropriate regions of page cache to make room for the new allocations, which may also involve having to write dirty pages to disk.   (Dirty pages are pages in the page cache that have been modified since the initial write to the page cache and disk - their data must be flushed to disk before freeing up their memory for reuse.) In the case where your machine has multiple NUMA nodes, if a particular NUMA node does not have sufficient free memory for an allocation because of high page cache usage, the machine may instead allocate memory on a different NUMA node that has sufficient free memory!  This will lead to substantially decreased memory read/write performance for your misplaced data.  (See my previous blog post here, for more on investigating this issue.) 

 

In the scenario where you suspect that the page cache is consuming a cumbersome amount of memory and wish to confirm it, there are some handy tools in Linux you can use.

 

The free command

A simple tool on Linux to look at memory usage including page cache is the free command.  The free command output might look something like this, on a machine with 64 gigabytes of memory: 

The number in the cached column on the far right indicates how much memory is used by the operating system for the page cache and slabs (learn more about slabs at http://www.secretmango.com/jimb/Whitepapers/slabs/slab.html).  However, the number does not break down the usage between page cache and slabs, nor does the free tool break down page cache usage between NUMA nodes.  

 

cat proc/memifo

For an even more detailed breakdown on memory usage, you can also try cat /proc/meminfo. An example output is shown below (output varies depending on your specific OS):

A lot of information is displayed here.  The MemTotal, MemFree, Buffers, and Cached fields are fairly self-explanatory and largely mirror the output of the free command.  In particular, the Cached field shows total memory used by the page cache.  Additionally, the Active(file) and Inactive(file) fields further categorize page cache usage: active page cache memory has been used recently and is (usually) not reclaimed until absolutely needed, whereas inactive page cache memory has not been used recently and therefore can be reclaimed without a large performance impact in general.

While using /proc/meminfo provides a very detailed view of memory usage, it does not break usage down per NUMA node.  If you wish to view information per node, then you may want to try…

 

numastat

A better tool for showing detailed memory usage per NUMA node, including page cache usage, is the numastat tool.  In order to show the detailed memory information, pass the –m argument to numastat upon execution:

 

In general, the fields displayed with numastat –m are the same as /proc/meminfo, except now the numbers are displayed per node (and curiously, the Cached field is not shown either.)  With this, you can now investigate if a particular NUMA node has higher page cache allocation than others.

 

Dropping the page cache

If you find yourself in a situation where too much memory is being consumed by the page cache, a last-resort option would be to flush the page cache entirely. Flushing the page cache frees up memory for other applications to use. Doing this requires some careful thought however, as clearing the cache can reduce system performance in other applications!  If you have decided that the benefits outweigh any potential consequences, then you can use echo 3 > /proc/sys/vm/drop_caches. This command clears the entire page cache. Note that this requires sudo privileges to execute.

 

 

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Single Instruction Multiple Data or SIMD execution is a common feature on modern CPUs. SIMD execution seeks to improve instruction throughput of repeated arithmetic operations by processing multiple operations at the same time with specialized vector hardware. Vector hardware is capable of processing many lanes of data at once, cutting down on loop iterations substantially. When paired with work distribution between threads, the throughput gained is even more dramatic.

OpenMP has supported SIMD compilation since version 4.0, published in 2013. The extension allows loops to be compiled for vectorized execution with the simd construct.

A loop can be vectorized by preceding the loop with a #pragma omp simd directive, just as you would designate a multi-threaded loop with #pragma omp parallel for. In fact, you could even do both, creating a loop that splits its iteration space among the threads of a team and has each of those threads process the loop with vectorized instructions. As with parallel loops, designating a loop for vectorization does impose extra restrictions upon the kinds of operations that can occur inside that loop. The most important restriction is that control-flow branching is limited inside SIMD loops; only simple if/else constructs are allowed.

#pragma omp simd
for (int i = 0; i < vec.size(); ++i)
{
    if (vec[i] > 0)
        vec[i] += x;
    else
        vec[i] += y;
}
OpenMP handles a branch like this by masking the vector operation, essentially doing the operation for all lanes of the SIMD unit, but only writing the results for some of them.

For loops that are memory-bound, the speed at which concurrently executed loads/stores can complete will depend on the coherence and alignment of the operations. In an optimal situation, each vectorized load/store should not cross any cache line boundaries. Scattered or misaligned memory accesses will result in worse performance. OpenMP does allow you to certify variables' alignment using the aligned clause, which will enable the compiler to use faster aligned load/store instructions within the loop.

Dependencies within the iteration space can also cause problems in SIMD loops, like the below:

#pragma omp simd
for (int i = 0; i < vec.size() - 7; ++i)
{
    vec[i] += vec[i+7];
}

If the vector length is greater than eight, this loop will give unexpected results. OpenMP does allow control over this behaviour through the safelen clause, which sets a maximum safe vector length for the loop.

Function calls inside a vectorized loop should also be considered. Normally, calls to functions won’t get vectorized unless the called function is inlined. OpenMP offers a way to fix this, with the declare simd construct. The compiler builds both a scalar and a vector version of declare simd functions, such that a vectorized loop can call the correct version. To be valid, vectorized functions must meet the same requirements as vectorized loops.

While most modern compilers support automatic vectorization of loops, detecting dependencies is still a challenge. Using OpenMP gives the control back to the programmer to vectorize loops in these situations. As with parallel constructs, skillful use of SIMD constructs can prove a useful addition to an optimizer’s toolkit.

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Acceleware Ltd. Blog by Yoichi Yoshida - 6M ago

For NVIDIA CUDA hardware, up to 4 warps (depending on the hardware version) will execute on a streaming multiprocessor (SM) at a time. Because thread instructions execute sequentially on the GPU, whenever warps reach an unsatisfied dependency (such as waiting for a memory load) the warp is considered stalled. To prevent the SM from idling when the warp is stalled, the scheduler will check all active warps assigned to it and switch to a warp that is ready to execute.

Multiprocessor occupancy is the ratio of active warps assigned versus the maximum number of active warps supported on a SM. To hide latency from data dependencies and improve performance, one of the first steps for optimization is to increase occupancy on the SM. Occupancy for a kernel can be limited by block size, shared memory usage, register usage as well as hardware limits of the device. The overall occupancy is determined by the lowest of these limits.

To calculate the theoretical occupancy of a kernel, we can either do the calculation for each portion by hand, or use one of the occupancy calculators available. Nvidia provides a calculator in the form of a spreadsheet, available in the tools folder of the CUDA Toolkit. Acceleware provides a handy one on our website, accessible from http://training.acceleware.com/calculator. The calculators allow users to input the hardware they are targeting and the resource usage of their kernel. The calculator displays data on active warp limits on each of the categories, as well as charts to show how occupancy may change when certain parameters are adjusted.

The online calculator provided by Acceleware uses the CUDA Toolkit’s standalone occupancy calculation functions from cuda_occupancy.h directly rather than reimplementing the logic to mimic the behaviour of the worksheet. This means that while the output of our calculator does not exactly match the theoretical values from the worksheet, it more closely reflects the actual behaviour of the CUDA toolkit. For example, at the time of writing on CUDA 9.2, for compute 6.0 devices the documentation and worksheet suggests occupancy calculation is based on a warp allocation granularity of 2, but our calculator reflects actual scheduling granularity of 4. As another example, warp occupancy does not drop to 0 when shared memory usage of the kernel exceeds the lower shared memory configuration sizes, due to the toolkit ignoring the limit rather than failing to launch as suggested by the worksheet.

Occupancy Example

To determine your occupancy, you will need to set the following parameters:

  1. The compute capability (CC) of the device.  This is determined by the GPU.  For example, a P100 is a CC 6.0 device. You can look up the compute capability of your device at https://developer.nvidia.com/cuda-gpus.
  2. Threads per block.  This is defined by you during the kernel launch as the second argument in the launch syntax. If using a 2 or 3-dimensional block size, multiply the dimensions together to gain the total number of threads per block.
  3. Registers per thread.  This is generated at compile time for each kernel.  You can determine this number using the NVIDIA profiler or by compiling with verbose output (--ptxas-options=-v).
  4. Shared memory per block. This parameter is defined by you and is either statically or dynamically assigned.

Assume you have a P100 (CC 6.0) with a kernel that uses:

  • 128 threads per block
  • 12 registers per thread
  • 8192KB of shared memory

The occupancy calculator suggests that you have an occupancy of 50%.  If the occupancy in this case is not at a sufficient level to hide latency, you could increase the threads per block to 256, 512, or 1024 to achieve 100% occupancy as shown on the curves.

References

For more information on occupancy, watch our video on optimization techniques at https://www.youtube.com/watch?v=_f31hvfBv4s.

You can also refer to the programmers guide at  https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#multiprocessor-level and check out the occupancy experiment at https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm

 

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Dan’s previous post “Grid Synchronization with Cooperative Groups” demonstrated the use of the new synchronization model introduced in CUDA 9 by normalizing an input array of integers using only a single CUDA kernel, instead of two separate kernels.  Essentially, the algorithm requires two steps – finding the element with the maximum absolute value, then dividing all elements by this maximum.   With Cooperative Groups, both steps can be combined within a single kernel, with a grid-level synchronization between steps.  Prior to CUDA 9, these two steps needed to be executed in separate kernels.
The single CUDA kernel ended up faster than a conventional pre-CUDA 9 two-pass approach.  Additional speedup was achieved by explicitly caching input elements read during the first step in shared memory and then re-using them during the second step, although this limited the number of elements that can be handled on a single device.
Cooperative Groups aren’t only limited to a single device though – CUDA 9 also introduced multi-device synchronization, which I experimented with to try to understand performance implications.

Download Source Code

Download Multi Device Normalization Results

Implementation

For this experiment, I extended the normalization problem to two GPU devices and wrote up seven implementations to compare in terms of performance.  The input array of elements can be split across the two GPUs, however the primary problem now becomes how to coordinate the overall maximum. 
The first implementation used a “naïve” two pass approach – it allocated an integer in global memory on the first GPU to hold the temporary maximum, and had both GPUs update it with the maximum atomically.  The standard CUDA atomics are only atomic to the GPU it executes on, therefore the scoped atomic function atomicMax_system() was required in order to avoid race conditions.   The maximum was then passed to the kernel that performed the division.  The goal for this implementation was to establish performance of system-level atomics.

Implementations 2 and 3 were also two-pass, but allocated an integer in global memory for each device. Each GPU atomically updated its own max using the standard atomicMax() function.  Implementation 2 then copied the two maxes back to the host where the overall maximum was found before passing it on to the division kernel.  Implementation 3 modified the division kernel so that the two maxes were compared by a single thread in each thread-block and the overall maximum stored in shared memory, prior to division occurring.  Avoiding atomicMax_system() was the goal for both of these implementations.  Implementation 3 had the additional goal of removing the need for device to host transfers of the maximum.

Implementation 4 allocated an integer in managed memory for each device and relied on the page fault mechanisms to handle transferring the maximums back to the host where the max was found and then passed on to the division kernel.  The objective of this implementation was to benchmark managed memory page fault performance.  To improve performance, advice was provided to the managed memory system regarding where the managed memory should reside, its “mostly read” state, as well as which devices would access it via cudaMemAdvise().

Implementation 5 allocated per-device global memory, a per-device warp counter as well as per-device host pinned memory to store the device maximum.  AtomicMax() was used to compute the maximum per device.  Each warp that updated the maximum for the device would also atomically increment a counter, and the last warp that executed would also copy the maximum from global memory into the host pinned memory.  The host pinned memory was then compared on CPU and then passed to the division kernel.  The goal here was to examine the performance trade off between decreasing the amount of data explicitly transferred (4 bytes vs full page) with the cost of the second atomic operation.

Implementation 6 performed 3 passes.  The first kernel found per device maximums into per-device global memory.  The second kernel only launched a single thread per device to write the maximum of the two maximums to device memory.  The third kernel executed the division.  This implementation was implemented to evaluate whether or not leaving the maximums in device memory without returning to the host resulted in better performance.
The last implementation used a multi-device launch and explicit caching in shared memory.  Each GPU tracked maximums in its own global memory.  After multigrid synchronization across both devices, the first thread on each of the two devices computed the inverse based on the max of the two device maxes and stored it in own global memory.  After a grid synchronization (only on across blocks on the device), each thread performed normalization.

Results

Each implementation was timed over five iterations via the NVIDIA Visual Profiler.  Timings listed only measure execution time measured from the start of the first associated kernel to the last associated kernel, and do not include time required to allocate device memory / copy input data onto GPU / copy out output to host / deallocate device memory / validate results.  For runs that involved multiple devices, the minimum start time for both devices was used as the start time, and the maximum finish time was used as the finish time.  The hardware used to collect this data was a CentOS 7.4.1708 machine running two NVIDIA Tesla P100s on CUDA driver 387.26.  The number of elements executed was fixed to 917504 elements which is the maximum that a single GPU can run for the cached single pass version from Dan’s original blog.

 

Run 1

Run 2

Run 3

Run 4

Run 5

Min

Avg

Max

Baseline 1
(Single Gpu Two Pass)

51.135

50.688

50.336

50.335

50.527

50.335

50.6042

51.135

Baseline 2
(Single Gpu One Pass Cached)

25.44

23.808

23.808

24.095

24.159

23.808

24.262

25.44

Implementation 1
(Two Pass Atomic System)

14780.94

14728.58

14740.34

14716.08

14719.84

14716.08

14737.16

14780.94

Implementation 2
(Two Pass Host Max)

169.925

156.799

154.327

106.705

120.042

106.705

141.5596

169.925

Implementation 3
(Two Pass Threadblock Max)

187.967

228.412

188.862

189.013

184.785

184.785

195.8078

228.412

Implementation 4
(Two Pass Managed Max)

364.668

284.729

268.286

248.958

246.542

246.542

282.6366

364.668

Implementation 5
(Two Pass Double Atomic)

148.646

129.312

135.996

126.774

124.081

124.081

132.9618

148.646

Implementation 6
(Three Pass)

158.312

180.294

150.403

132.255

126.399

126.399

149.5326

180.294

Implementation 7
(One Pass Cached)

112.511

65.454

154.399

66.886

103.074

65.454

100.4648

154.399

Time (usec) to Normalize 917504 elements on NVIDIA Tesla P100s

 

As expected, Implementation 1 (Two Pass, Atomic System) performed extremely slowly and had to be omitted from the graph to avoid scaling issues.  Of the four remaining two-pass implementations, managed memory performed the worst, even with cudaMemAdvise() hints.  Profiling showed only a small number of CPU and GPU page faults, however the other three implementations that were fault free performed noticeably faster.  Computing the maximum of the two maxes on the host (Implementation 2) ended up faster than having a third kernel to compute the max on GPU (Implementation 6).

The fastest multiple GPU implementation was the multi-device kernel launch with explicit caching (Implementation 7), which wouldn’t be possible prior to CUDA 9.  The range of results returned by Implementation 7 also varied quite noticeably – some iterations performed twice as fast as others.  According to the profiler, the key difference between these runs was the occasional delay between when the kernel was launched on the first device versus the second device.  For example, for the first iteration (112 usec) there was roughly a 50usec delay before the kernel was launched on the 2nd GPU, while for the second iteration (65 usec), both GPUs launched the kernel nearly simultaneously. 

 

Lessons Learned

The new Cooperative Groups in CUDA 9 add more programming flexibility by permitting synchronization across groups of thread blocks, however moving to synchronization across multiple devices can incur non-trivial performance costs from peer-to-peer memory access and multi-device synchronization.

  • Avoid system level atomics if possible
  • Multi-device kernel performance may vary due to kernels not launching simultaneously
  • The grid size / block size / shared memory size needs to be identical for all devices during a multi-device kernel launch.
    • Even though grid size is the same on each GPU device, each GPU operates on a grid of that size.  (E.g. launching a grid of 100 blocks on 2 devices results in 200 total blocks, and not 50 blocks on each device).
  • Managed memory is easy to use due to the system taking care of data migration between devices, however explicit transfers still perform faster, (especially if the data being transferred is much smaller than page size).
Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Cooperative Groups was introduced in CUDA 9 to provide a flexible model for synchronization and communication between groups of threads executing CUDA kernels.  Mark Harris and Kyrylo Perelygin have an excellent introduction to Cooperative Groups here.

CUDA has always supported barrier synchronization of all the threads in a thread block through the __syncthreads() intrinsic. Cooperative Groups adds the ability to synchronize across thread blocks, by synchronizing all the threads in a grid. Grid synchronization is supported on Compute Capability 6.0 (Pascal) and higher GPUs. If you are running in Windows, your GPU must additionally be running in Tesla Compute Cluster (TCC) driver mode.

To demonstrate the utility of grid synchronization, I use it to normalize an input array of integers to the range [-1, 1] by scaling each element of the input by the maximum absolute value of the input array.  The following source code listing contains three different normalization implementations which we compare below.

Normalize.cu

The code must be compiled to generate relocatable device code, and target GPUs of Compute Capability 6.0 or higher, eg.:

nvcc -rdc=true -arch=sm_60 -o normalize normalize.cu

Two-Pass Normalization

Without grid synchronization, thread blocks are only synchronized upon completion of a kernel, necessitating a two kernel approach for normalization: one kernel to find the maximum absolute value, followed by a second kernel to multiply by the inverse of the maximum absolute value.

The FindMaximum kernel requires 1 thread per input element, and demonstrates the use of Cooperative Group’s partitioning mechanism to do a warp-level shuffle reduction to find the maximum absolute value within a warp, followed by having a single thread of each warp perform an atomicMax() operation to find the global maximum.

One-Pass Normalization Via Grid Synchronization

Using Cooperative Groups, we can merge the two steps in the normalization process into a single kernel by finding the maximum absolute value, synchronizing all the threads in the grid to avoid a potential race condition when accessing the maximum absolute value, and then scaling by the inverse of the maximum absolute value.

Note that to enable grid synchronization, kernels must be launched via the host cudaLaunchCooperativeKernel launch API instead of the more familiar <<<...>>> execution configuration syntax. Additionally, grids must be sized appropriately to guarantee that all the blocks in the grid are resident on the GPU’s streaming multiprocessors (SMs). An upper bound for the grid size can be determined by programmatically querying the number of blocks that can reside on a single SM via the cudaOccupancyMaxActiveBlocksPerMultiprocessor API call, and then multiplying by the number of SMs on the GPU. Given the constraints on grid size, using the same 1 thread per input/output element strategy as the two-pass normalization approach would also limit the maximum size of the input array. To maintain flexibility, instead of requiring 1 thread per input/output element, the GridSynchronizationNormalization kernel uses grid stride loops.

Performance

The table below shows the relative performance of the two approaches on a Tesla P100 GPU.

Implementation

Execution Time (uS)

Two-pass

46.8 (27.4 + 19.4)

GridSynchronizationNormalize

34.4 

GridSynchronizationNormalize is faster than the two-pass approach.  However, it still does more work than necessary.  The advantage of using grid synchronization is that it allows stateful approaches where data persists in registers and/or shared memory after the synchronization, which isn’t possible in the two-stage implementation.

One-Pass Normalization Via Stateful Grid Synchronization

The GridSynchronizationStatefulNormalization kernel uses shared memory to reduce global memory traffic.  The first part of the kernel finds the maximum absolute value while also explicitly caching input elements in shared memory.  After the grid synchronization operation, instead of rereading the input values for the scaling operation as in the other kernels, they are read directly from shared memory.  The table below shows the performance improvements from the stateful approach:

Implementation

Execution Time (uS)

Two-pass

46.8 (27.4 + 19.4)

GridSynchronizationNormalize

34.4

GridSynchronizationStatefulNormalize

25.4

 

The improved performance of the stateful approach comes at a price.  With the stateful approach, the input elements must fit within the GPU’s aggregate shared memory capacity.  On a Tesla P100, the input array can contain no larger than 56 SMs * 64 KB/SM / 4 bytes/element = 917504 elements.  On a Tesla V100, the largest array we can support increases to 80 SMs * 96 KB/SM / 4 bytes/element = 1966080 elements.  These size limitations make the GridSynchronizationStatefulNormalize kernel less scalable than the alternatives.  The stateful approach could be employed as part of an optimized path for smaller data sets, with automatic fallback to the stateless approach for larger data sets.

Potential Mistakes When Developing Grid Synchronization Code

 One of the strengths of the CUDA C programming model compared to other heterogeneous models like OpenCL, Vulkan, or even the CUDA Driver API, is that host and device code are combined into a single-source file.  This advantage is especially evident when using CUDA’s <<<…>>> kernel launch operator.  The kernel launch operator facilitates compile-time error checking to ensure that kernels are invoked with the expected number of arguments, and that each kernel argument is of the expected type.  No compile-time error checking is performed when using the cudaLaunchCooperativeKernel API call to launch a kernel that uses grid synchronization.  Beware that passing an incorrect number of arguments, or arguments of incorrect type through the void** args parameter results in undefined runtime behavior not a compile time error!  A quick experiment omitting the final argument to GridSynchronizationNormalize resulted in a segmentation fault.

cudaLaunchCooperativeKernel does validate its  gridDim, blockDim and sharedMem arguments to ensure that all blocks of the grid are resident on the GPUs SMs and in the event of an error returns the error code cudaErrorCooperativeLaunchTooLarge = 82 with corresponding error string “too many blocks in cooperative launch”.

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

We received an interesting question on one of our Webinar presentations (GPU Architecture and the CUDA Memory Model).

"What happens if enough shared memory is not available? Does the block stall until it becomes available or it uses global memory meanwhile?"

 

There are a couple of things that are happening with shared memory and its size. You as the programmer declare the size of your shared memory allocation. You can do this statically (at compile time) or dynamically (at run time). If you statically declare a shared memory allocation that exceeds the limit per thread block (48KB), the NVIDIA compiler will generate an error and your program will not compile. If you dynamically attempt to allocate more memory than is available, your GPU kernel will not launch and an error code will be returned to your program, although you have to check for error codes programmatically. You can also have a combination of multiple static and one dynamic allocation. If the sum of the allocations exceeds the thread limit per block, the GPU kernel will not launch.

Finally, shared memory is allocated exclusively to a thread block to facilitate zero-overhead context switching. As a result the total amount of shared memory on the SM limits the number of thread blocks that can be scheduled to run concurrently on the SM, which impacts occupancy, and potentially performance. For more details on occupancy and performance, here is the link to our optimization webinar.

As a follow up to my earlier blog, my engineering team commented that NVIDIA’s Volta architecture (compute capability 7.0) increases the amount of shared memory you can allocate per thread block. Using static declarations, the maximum allocation is still 48KB, however, you can access the full 96KB on Volta using dynamic allocation. This can be done by a single dynamic allocation of up to 96KB or a combination of static and dynamic allocations (e.g., 48KB static and 48KB dynamic). Additionally, you must explicitly increase the allocation limit using the cudaFuncSetAttribute call prior to your kernel launch as shown:

Featured Image: 
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The SEG-Y file format has become a widely used industry standard for working with and sharing seismic data. Since the format's first publication in 1975, a lot has changed in the way we record, store, and process seismic data. Many proprietary extensions or modifications have been designed to work around the limitations of the file format, but such alternatives often make shared data sets difficult or impossible to interpret without additional development.

In 2002, SEG-Y revision 1 was released adding support for IEEE floating point data, ASCII text, and a system for extending the format in text headers. The latest update, revision 2, released in 2017 takes more significant steps towards modernizing the format and eliminating the problem of custom work-arounds.

File Format Overview

The following diagram shows an overview of the updated SEG-Y file format.

SEG-Y files start with an optional 128 byte SEG-Y tape label. The first piece of mandatory data is the 3200 byte text header that typically contains company and survey information in a human-readable format. The text header is followed by the 400 byte binary header which defines the size and type of data contained in the file.

Next, the SEG-Y file may contain a number of optional extended text headers. These headers were introduced in revision 1, and expanded on in revision 2. Extended text headers allow users to define how the file should be interpreted using official extensions. Developers can also design and publish their own extensions to the format without altering the contents of any parts of the file defined by the standard.

The bulk of the SEG-Y file is taken up by 1 or more traces. Each trace must include a 240 byte trace header, followed by an optional number of extended trace headers. Extended trace headers were introduced in revision 2 to provide space for new official and user-defined header parameters.

Finally, revision 2 allows for extra user-defined data to be placed at the end of the SEG-Y. This data must be preceded by headers and divided into 3200 byte blocks, but does prevent older code from accurately reading the file.

New Features

Revision 2 introduces many new features to the format. The following is an overview of how to use some the most significant changes that will make the SEG-Y format more precise and easier to work with.

File Version and Backwards Compatibility

A SEG-Y file's version is identified in 2 places: line 39 of the text header, and bytes 3501-3502 of the binary header. For a SEG-Y revision 2 file, line 39 of the text header should read “SEG-Y REV2.0”. Bytes 3501 and 3502 of the binary header are the major and minor revision numbers. For a SEG-Y revision 2 file they would be 0216 and 0016 respectively.

The latest version of the SEG-Y file format is mostly backwards compatible with prior versions. A program written using the new standard will be able to read old SEG-Y files, and a program written using the old standard will be able to read new SEG-Y files that do not use any of the new extended data blocks. Extended text headers were introduced in revision 1, so they are not compatible with programs built to read revision 0. Similarly, extended trace headers were introduced in revision 2, so they are not compatible with older programs.

In general, any parameters in a SEG-Y file with a value of 0 are considered to be not set. New features can therefore be ignored by setting their values to 0. Additionally, the update to the standard did not move or remove any parameters: new parameters were added in unassigned or reserved space so old SEG-Y files should have all new fields set to 0.

Trace Header Extensions

A common challenge in reading SEG-Y files is in interpreting the values of trace headers. Due to limitations in the original format, many users implement custom solutions that use fields for data outside of the specification. SEG-Y revision 2 solves this problem by adding extended trace headers that provide space for additional official and custom header parameters.

Extended trace headers are a series of 240 byte blocks. The maximum number of additional trace headers in a file is specified in binary file header bytes 3507-3510. The actual number of extended trace headers may vary from one trace to the next, unless the file has fixed length traces. While the standard allows for variable length traces, fixed length traces are common to allow fast seeking. As of revision 1, a SEG-Y file can guarantee it has fixed length traces by setting field 3503-3504 in the binary header to 1. For variable length traces, the actual number of extended headers is indicated in bytes 157-158 of the first extended header.

The final 8 bytes of all trace headers are now used to indicate the header's name. The standard trace header should be named “SEG00000”. If extended trace headers are used, the first one must be the new standard header named “SEG00001”. Further headers are fully customizable with the final 8 bytes reserved for a unique name. All header names of the format “SEG#####” are reserved for future revisions.

Modern Data

Revision 2 provides a number of changes that make SEG-Y able to handle much larger data sets. Files can contain more traces and traces can contain more samples. Files can now contain up to 264-1 traces and 232-1 traces per ensemble. This is accomplished by including several 64-bit unsigned integer fields in the extended trace header that override smaller data fields in the standard header. Bytes 1-8 of the extended header override the trace sequence number in a line (bytes 1-4 of the standard header) and bytes 9-16 override the trace sequence number in a file (bytes 5-8). Similarly, bytes 3261-3264 of the binary header override the number of traces per ensemble (bytes 3213-3214).

Traces now also support more samples and new data formats. Bytes 3269-3272 of the binary file header override the number of samples per trace (bytes 3221-3222) allowing for up to 232-1 samples per trace. The trace data format is specified by setting bytes 3225-3226 of the binary file header to a particular code. Most notably, 4-byte IEEE floating point numbers were added in revision 1 and 8-byte IEEE floating point numbers were add in revision 2 (codes 5 and 6 respectively).

This revision also provides a number of changes to make SEG-Y faster and easier to use on modern computers. SEG-Y files can now be written in little-endian format for faster I/O performance. Bytes 3297-3300 can be used to detect the correct byte order. A value of 1690906010 (0102030416) would indicate all bytes are in the correct order, while 6730598510 (0403020116) would indicate the bytes must be reversed.

The new revision also provides a better way to specify many standard parameters with improved precision. Sampling interval can be specified as 64-bit IEEE floating point number in bytes 3273-3280 of the binary file header (this overrides bytes 3217-3218). Source and receiver coordinates, depths, and elevations can also be set as doubles as well using the first extended trace header:

These parameters eliminate the need for the various scaling factors that are present in the standard trace header.

Finally, the standard trace header in bytes 29-30 is an identification code for the type of data contained in the trace. Many new types of data have been added, including velocity, electro-magnetic, and rotational sensor data.

Trace Header Mapping

A common challenge with using the SEG-Y format has been interpreting trace header data. Due to limitations in the original format, various custom solutions have arisen that can make sharing files difficult. While the new standard extended trace headers should alleviate most of problems, trace header mapping should eliminate the problem completely.

The extended text header is a series of 3200 byte records. The number of records is provided in bytes 3505-3506 of the binary file header. These headers provide extra space for any data the user wishes to include in the file. Extended text headers are organized into sets of data, each set is referred to as a “stanza”. A stanza can include data in any format the user wishes, however a number of standard stanzas have been published with the SEG-Y specification. Stanzas must start on a 3200 byte boundary and are started with a stanza header. The header includes the name of the stanza as well as the company that published it. For example, a trace header mapping would begin with “((SEG: Trace Header Mapping ver 1.0))”.

Revision 2 provides a specification for a text header stanza that overrides standard trace header locations. Custom extensions to the SEG-Y file format can now be codified within the file itself for easier processing. User defined parameters can be mapped, and standard parameters can be remapped to new locations or data formats.

Trace header mappings are specified in the extended text header field.

The following is an excerpt of the example trace header mapping provided in the SEG-Y revision 2 standard[1]:

((SEG: Trace Header Mapping ver 1.0))
Header name = SEG00000
Field name 1 = tracl
Byte position 1 = 1
Value format code 1 = 2
Field description 1 = Trace sequence number within line
Field name 2 = tracr
Byte position 2 = 5
Value format code 2 = 2
Field description 2 = Trace sequence number within SEG-Y file
Field name 3 = fldr
Byte position 3 = 9
Value format code 3 = 2
Field description 3 = Original field record number

As seen in the example, a field is mapped by specifying its name, position, format, and description. Using standard names for fields allows developers to document their proprietary modifications to the SEG-Y format in a way that can still be read by anyone that the data is shared with.

If extra data needs to be saved with each trace, new fields can also be mapped to custom headers. The “Header name” line indicates which trace header block is currently being mapped. The example above maps its fields to the standard trace header “SEG00000”.

Conclusion

SEG-Y revision 2 represents a significant step forward in modernizing the file format. By using the new extended trace headers for more precision, and trace header mappings to document any modifications or extra headers, the SEG-Y format will be much easier to use for processing and sharing seismic data.

References

[1] Hagelund, Rune, and Stewart A. Levin, editors. “SEG-Y_r2.0: SEG-Y Revision 2.0 Data Exchange Format.” Mar. 2017, seg.org/Portals/0/SEG/News%20and%20Resources/Technical%20Standards/seg_y_rev2_0-mar2017.pdf.

Featured Image: 
Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview