I. Intriguing Embedded Module from SolidRun Armed with Gyrfalcon AI Accelerator
SolidRun (www.solid-run.com) announced the availability of i.MX 8M Mini System on Module (SOM) with some serious AI acceleration horsepower, thanks to Gyrfalcon Lightspeeur® 2803S accelerator chip. This module is a powerful platform that contains all the ingredients needed to develop a quick prototype of an AI-enabled edge device. The size of the module is 47mm x 30mm and it is jam-packed with features such as:
Various grades of NXP (based on ARM Cortex A53) Host Processor
4GB LPDDR4 Memory
Bluetooth and Wi-Fi Connectivity (u-blox module)
Robust multimedia prowess (support of 20 audio channels, MIPI-DSI, and 1080p encoder/decoder, camera and display interfaces)
The AI inference is done using Gyrfalcon's Lightspeeur® 2803S (24 TOPS/W, 9 x 9 mm) accelerator chip
Power dissipation 3.3W
Admittedly the list above does not do justice covering all the capabilities of this product, but the intention here is not to duplicate the data sheet. What is significant, in my view; can be summarized in just four numbers:
24 TOPS/W (quite sufficient for most edge vision inference workloads)
3.3 Watts (high for battery-operated devices but suitable for whole host of industrial, automotive, medical and robotics applications)
47mm x 30mm (can be much smaller if optimized for a specific applications)
$56 (given time, volume, and good negotiation the price can be much lower)
I am certain we will see a flurry of other small and low-cost embedded systems with sizable AI horsepower that can serve a myriad of applications and use cases. So why is this significant?
Imagine a scenario that every instrument, device, machinery, or vehicle costing more than a few hundred dollars can be easily enhanced (by a similar module) and enabled to utilize historical data to gain predictive, inferential, and recognition abilities above and beyond its baseline features. Do you see value in this? I certainly do. Not in all cases imaginable, but in most. Don’t take me wrong. I am not ignoring or redefining IoT here. This goes above and beyond IoT. Successful deployment of IoT requires an infrastructure build, but there are hundreds of legacy use cases (that can benefit from AI) that will do just fine without being connected to millions of other nodes. I believe most of the buzz around AI has been around IoT, autonomous vehicles, surveillance cameras, and robots but there are numerous other applications that can also draw benefit from having artificial cognitive capabilities.
II. Mixed Precision Training with 8-bit Floating Point
Group of researchers at Intel Labs have been able to demonstrate that 8-bit floating point representation (FP-8) can be as effective as FP-16 and FP-32 during training.
A bit of history first. The choice of numerical representation of weights, activations, errors, and gradients in Deep Neural Networks (DNNs) can have a dramatic impact on the die size and power dissipation of training and inference chips. It should come as no surprise that there are virtually dozens research teams working feverishly attempting to find the most efficient numerical representation without sacrificing accuracy. This journey has been relatively easy for inference chips. Even 8-bit integer representation has produced remarkable results in inference chips. Unfortunately, the same can’t be claimed for training. Presently the most common numerical format for training is 16-bit floating point (FP-16). There is ample evidence that FP-16 can come very close to FP-32 when it comes to validation accuracy. Number of teams have also attempted to use integer representation for training, but the results have been mixed at best. Some have been successful in improving the outcomes, but at the expense of adding additional hardware (for stochastic rounding).
Researchers at Intel Labs have moved away from INT-8 and have instead proposed a new scalable solution to use FP- 8 compute primitives that no longer require additional hardware for stochastic rounding. Much reduction in cost and complexity of MAC units has been the result. They have shown outstanding state-of-the-art accuracy with FP-8 representation of weights, activations, errors and weight gradients across a broad set of popular data sets. In some cases, their accuracy has been better than FP-32. Truly remarkable accomplishment.
III. EfficientNet: Compound Model Scaling in CNNs
Kudos to Quac Le and his team at Google AI for coming up with the concept of “Compound Model Scaling”. The findings have led to dramatic improvement in size and efficiency of Convolutional Neural Network (CNN) implementations.
The standard practice for choosing a CNN architecture is to start with a baseline model and apply various scaling strategies to improve its accuracy and efficiency while staying within a given resource budget. Scaling in CNNs can be done in three ways:
Width Scaling: Increase the number of neurons in each layer
Depth Scaling: Adding layers to various stages of the network (more convolutional, pooling, or fully connected layers)
Resolution Scaling: Increase the input resolution
Traditionally the process of scaling up a network has been tedious to say the least, requiring numerous guesses and arbitrary attempts.
Intel researchers have shown that it is critical to balance the scaling of all dimensions of the network (width, depth, and resolution), and have proposed a method to find the optimal scaling balance. Furthermore, they have shown that an optimal balance can be achieved by scaling each dimension by a constant ratio. They have proposed a formal process (called compound scaling) that uses simple grid search to come up with three fixed scaling coefficients (one for each dimension) and that optimizes the accuracy of the model given available resources. In practice, the optimization will start by picking a bare-minimum model and would entail finding three scaling coefficients (given available computational resource). Finally, the baseline network is scaled across all three dimensions using these coefficients.
The team has been able to demonstrate truly remarkable results. In one specific use case they have achieved state-of-the-art 84.4% top-1 and 97.1% top-5 accuracy on ImageNet while being 8.4x smaller and 6.1x faster on inference compared to the best-of-breed CNNs.
IV. Few Market Data Points
According to IDC the worldwide shipments of AI-optimized processors for edge systems will reach 340M units in 2019 and will increase to 1.5B units in 2023
According to Woodside Capital the total VC investment in semiconductor companies reached nearly $1B, 67% of which had to do with AI
According to Strategy Analytics, the number of automotive image sensors will grow from 110M in 2019 to 330M in 2026
Hi Chipdesign masters, I am a layout designer and most of my seniors are saying that the main reason we need to match devices is to make even distribution of heat. I am still not convince and for me the main reason is to compensate variation introduced during the fab process. What can you say about this masters? Can you innumerate other factors according to their impact? Thanks
I've gone through the H&P books and the design of the MIPs/RISCv processors and I really enjoyed reading them. I looking for resources on micro-architecture designs for other kinds of devices, e.g. networking devices/storage devices/graphics offload etc. The point of the exercise is to expose myself to a variety of different device types in order to gain a better understanding of how different devices are put together for handling different kinds of workloads, e.g. I enjoyed reading about Google's TPU hardware design.
I've looked at ISCA and ASPOLOS but they seem very CPU oriented. Are there any recommendations you may have on other kinds of devices? I don't mind if they are older resources.
Not sure how I'll go with design help on this topic, but don't know unless I ask. For transparency, this is class work, so happy for any help not to just dishout an answer straight up. I'm just a bit stuck on progressing.
Working on a task to design a 4bit S-D converter. I've been reading quite a lot of info trying to wrap my head around how to achieve the multiple bit output. Currently have a single stage modulator implemented at both high level opamp and also using mosfets representative of IC level.
The S-D output has a single bit with the 1-bit DAC feedback, which seems to be functioning correctly with a varied duty cycle based on the input signal voltage. What I am struggling to determine is the expansion to have multiple bit ADC.
Multi-stage provides improvement to the noise supression/shift only from what I have determined, and whilst useful, seems not to be the direction I need to go at this point.
Have been reading about decimation post the 1bit output. This is likely to be the method to achieve the multiple bit output, but everything I've read is about the math related to filtering.
What sort of key words do I need to be looking for to translate this to a circuit schematic? Comb filtering seemed interesting at one point, but I didn't see how that related to a 4 bit output. Would appreciate any help you folks can offer.
I have taken a digital electronics course for the past few semesters in my junior year of high school, and our "final" is to design a circuit of our choice. However, I am stumped for ideas. What are some small-medium sized projects you guys have done? I am familiar with the 74LS series of chips. Thank you for any input?