RSIP Vision, managed by Ron Soferman, is an established leader in computer vision and image processing R&D. Here is the section with articles by RSIP Vision relating Computer Vision and Image Processing projects and works.
Detecting physical and virtual intrusions is a key process in ensuring information and property security. Physical intrusion detection refers to all attempts at break-ins to a building, warehouse, or other perimeters by an unauthorized person, where access is granted to only limited personnel. Based on the characteristics of the structure, an attempted intrusion can take many forms, from brute-force entry to disguising as authorized staff. The identification of authorized personnel can be made by human observer at the gate or by a patrol; on the other hand, digital access authorization can use magnetic cards or another IR or biometric-based authentication means. Here we concentrate on vision-based intrusion detection, analyzed from video streams by security cameras overlooking a populated region.
The variability in forms of unauthorized entry poses heavy restriction on the possibility of compiling a ready-made off-the-shelf software-based solution, and tailor-made algorithms must be constructed based on security needs and the characteristics of authorized personnel and perimeter. However, other forms of intrusion detection exist by which no person, or only in rare cases, one person is allowed to enter a perimeter, such as climbing on the exterior of a building, monument, or borders between neighboring countries.
A vision-based intrusion detection system must first possess the ability to identify and track dynamic objects’ trajectories, most often humans. This by itself remains a challenging task in computer vision, especially in a heavily crowded environments, where occlusion and path crossing hamper reliable trajectory tracking. The second requirement therefore is the ability of the intrusion system to infer from the history of an object its potential future. Hence, rather than simply identifying any trajectory ‘border’ crossing, a prediction system can reduce substantially the response time against intrusion and allow timely reaction. Such trajectory predictions are already being implemented in Automatic Driving Assistance Systems, where response time of smart cars upon pedestrian sudden road crossing has crucial implications.
Due to the complexity of both human tracking and path prediction, a deterministic system is bound to have high rates of false positives and negatives, and therefore should be rendered obsolete. On the other hand, statistical-based machine learning solutions are far more probable to reduce both false positive and false negative trajectory prediction, by learning the characteristic of these rare intrusion events from trajectories and structure characteristics.
The task of prediction trajectories’ intent can be formulated in a stochastic analysis framework, such that a probability map of a trajectory at any time in the present to hit any one of the entries of a building should be computed, based on its very short history (sometimes given by the limited view of a camera). More specifically, we ask what is the probability that a given path ends up in one of the entrances and remains there more than a ‘normal’ time span, to raise system ‘suspicion’. By assigning time varying probabilities to human trajectories, an alert can be triggered to call for further examination by a human observer or another timely response.
Intrusion detection with deep learning
The stochastic nature and scarcity of intrusions renders it difficult to extract from existing datasets (e.g. retrospective analysis of video streams) a pattern relating a person’s trajectory tracked over time to an actual act of intrusion attempt. The reasons most probably stem from the fact that trajectories by themselves (i.e. spatio-temporal information) are insufficient to reliably raise an alert. Rather, other features of the tracked human and past events are helpful hints for the system to deduce a possible intrusion intention. This highly complex deduction can be provided in the framework of deep learning neural networks, where object features and trajectories are integrated to construct an automatic alert system.
Intrusion detection – training
Training of such network can be done using retrospective analysis of intrusion events, where analysis of trajectories is done from start (source) to end (sink points), thus accounting for variability in physical perimeter structures and camera position. Other informative features extracted along a trajectory could be: number of trajectories, their coherence, interaction with other trajectories, type of clothes, equipment held in hands, time of day, and other site-specific features related to the vulnerability of the perimeter itself.
Such sophisticated formulation and construction of vision-based intrusion detection with deep learning can save both resources and time. At RSIP Vision, our engineers utilize high end algorithmics to both train and use deep learning networks, based on features learned by our experts using innovative mathematical and computational methodologies. RSIP Vision is a global leader in computer vision and deep learning. To learn more about RSIP Vision’s domains of expertise, please visit our selected project page. Contact us now to discuss your artificial intelligence project.
Object tracking in video sequences is a classical challenge in computer vision, which finds applications in nearly all domains of the industry: from assembly line automation, security, traffic control, automatic driving assistance systems and agriculture. Presently state of the art algorithms performs relatively well in control environments, where illumination and camera angle remain relatively stable throughout acquisition and when object occlusion is at minimum. Complications pile-up rapidly when multiple objects appear in the scene, or when objects undergo non-rigid transformation from one frame to the next.
Classically, object tracking algorithms start with a known object to track (contained within a bounding box); the algorithm then computes features in that bounding box, which are associated with the object to track, and continues to find the most probable bounding box in the next frame within which features are as close as possible to those appearing in the previous frame. The search algorithm advances by looks for bounding box in frame t+1 after applying an affine transformation of that in the previous frame t. Such an approach is highlighted in the algorithm of Lucas, Kanade and Tomasi.
Several drawbacks with the above classical approach are encountered. Firstly, bounding boxes of non-convex object might contain a large portion of the background, which might be complex and the features extracted from it might confuse matching for the next frame. Secondly, feature matching algorithms can easily lose track of the main object when occluded (e.g., hand passing over face). Finally, feature computation and matching for object undergoing small continuous motion forms a bottleneck for the speed at which objects can be tracked in real-time and can be as slow as 0.8-10 fps.
To resolve some of the difficulties of classic tracking algorithms, deep convolutional networks have been offered as a possible solution. These off-line trained networks provide means of overcoming occlusion, based on the knowledge of continuity learned in the training phase. In addition, the network can handle a variety of object types (convex and non-convex), which improves their overall usability over classical solutions.
The output of such trained network are the coordinates of a bounding box containing the object to track in each frame of the video sequence. A notable network architecture to perform tracking tasks contains an input layer with two nodes corresponding to two subsequent frames: one annotated and centered on the object to track, and the second for the frame within which the bounding box is to be localized. Frame information is passed to convolutional (filter) layers, where feature values are concatenated and further passed through three fully connected layers; finally, information is passed to four node outputs representing the four corners of the bounding box containing the image in the next frame.
Due to the nature of offline training of the network, object tracking can be performed on a surprising 100 fps on a modern GPU, and about 20 fps on a CPU. The construction and training of the CNN for tracking requires careful planning and execution. Our experts at RSIP Vision have been dealing for many years with state-of-the-art algorithms for object tracking in the most severe conditions in natural videos, such as low dose X-ray, cloudy road conditions and security cameras. We at RSIP Vision are committed to provide the highest quality tailor-made algorithms for our clients’ needs, with unparalleled accuracy and reproducibility. To learn more about RSIP Vision’s activities in a large variety of domains, please visit our project page. To learn how RSIP Vision can contribute to advancing your project, please contact us and consult our experts.
Object identification and tracking in a sequence of frames (video) consists of sampling of the scene, by e.g raster or uniform scatter, to extract features and compute their descriptors for target objects identification. This raster scanning procedure can by resource intensive, especially if every (or almost every) pixel in the image needs to be examined; hence this poses as a bottle neck for real time applications. Although in sparse image application only points of interest (such as edges) are to be examined, full image operations are still utilized (gradients or extracting corners), which requires information from all pixels. Heavily crowded scenes, such as urban landscape, containing pedestrians, cars and traffic signal, need to be identified and tracked by an automated driving assistance system (ADAS) and integrated, to be able to issue a timely response. However, for some applications, sampling space can be reduced if the temporal link between the expected successive locations of objects is retrieved.
Samples in green are illustrations showing a possibility where the algorithm searches for information in each frame (say from a dash cam)
The goal, therefore, is to dramatically reduce the image sample by predicting which pixels to sample on which we compute features and descriptors for further analysis, identification and tracking. The mathematical tool which deals with inference of the conditional (temporal) link between successive distributions of events (points related to target object at any time) is called temporal point processes. Temporal point processes are stochastic realizations of points scattered in space at time ti, having an intensity (loosely, the expected number of points per unit measure) conditioned on the intensity (or distribution) in the previous time step ti-1. The distribution of points is unknown a priori and needs to be estimated from the data.
Rather than scanning the whole image, the machine learning algorithm learns to search for information in designated areas
Several types of temporal point processes are useful for object points (events) identification. For example, the self-excitatory process (Hawkes process, which is also used to model location of residual secondary earthquakes after a large shock) increases the sampling of points near a successful event (identification of object point) in subsequent time steps, thus providing more “relevant” samples in the vicinity of the object to be extracted. On the other hand, self-inhibitory processes diminish the number of points in the vicinity of “failed” or irrelevant events (non-object sample point), thus reducing the sample space where it is not needed.
We therefore arrive at the crucial point: how to retrieve such dynamic sampling based on temporal point processes. Such non trivial relationship must be learned and adapted dynamically from the data. We find the solution in Machine Learning methodologies, the goal of which is to continuously update the conditional intensity of the point process based on its history. The parameters of the conditional intensity of our point processes (or mixed inhibitory-excitatory process) can be obtained by minimizing log likelihood, thus obtaining the maximum likelihood estimators.
In the learning phase, we need to feed our network (e.g. recurrent neural network) with identified object points in each frame. Special attention must be given to applications for which the view point is static (e.g. security cameras) vs. dynamic (e.g. dash cameras in cars). These cases force the restriction that sampling process should converge to either the homogeneous Poisson point process (constant intensity) in the case of no temporal correlation, or to the excitatory-inhibitory mixed distribution (non-homogeneous), when object identification has some certainty. The use of cutting edge methodologies in computer vision and machine learning has been the everyday practice at RSIP Vision for years. Our engineers develop and implement state of the art methodologies to provide our clients with the most advanced and stable solution for their projects. To learn more about RSIP Vision’s activities in a wide range of industrial domains, please visit our project page.
Defect detection during production is a necessary step to ensure product quality. Although manual human inspections are still being employed, automated visual inspection has practically replaced manual labor in almost all major production lines and is ubiquitous in mechanical parts manufacturing, auto parts manufacturing, printed circuit boards (PCBs), electronic parts, medicine as well as agricultural yield inspection. Automated defect inspection by means of vision and sensory hardware is now advantageous over manual inspection due to its accuracy, speed, relative ease of implementation and the reduced costs.
The automated detection of defects works by comparing a template gold standard product template with those in manufacturing process and detect unreasonable deviations from it.
For the production of well-defined patterned products such as PCBs, pattern matching algorithms can be used to estimate deviation. However, defects in other products, such as fruits and flowers, might be less obvious to both define and detect. When a large variability in both defects and product shape are present, statistical methods, such as those offered by deep learning algorithms, are best suited for the job.
In analogy to a human inspector, whose reasoning is made on per-product based, machine learning algorithms are taught to distinguish defects in a certain acceptable range, based on the characteristics of features and their descriptors extracted from the product under inspection.
The actual deep learning architecture (number of layers and node connectivity) might differ according to the complexity of the problem. However, U-Nets architecture is a plausible and promising possibility. U-Nets are fully connected convolutional neural networks (CNN), where images undergo a sequence of down sampling and simultaneous computation of features in each scale, followed by a sequence of up sampling to retrieve eventual classified (segmented or annotated) output image.
Due to the architecture of the convolutional networks and the oftentimes unpredictable orientation and geometry of products on the conveyor belt, features characterizing each inspected product must be scale and rotation invariant. Several image features exist for such cases, such as the Harris corners, SIFT, and so on. In addition, features related to the texture of products, as in the case of fabric or ceramic defect inspection, are called for. For this end, central image monomers (also called Hu moments) can be used. For any image f(x,y), the Hu moments of order p+q are defined as:
where p and q are integers. These moments uniquely characterize each image, they are invariant to translations and are computationally cheap to compute. By using Hu centralized monomers, additional features in overlapping regions of the image at various scales can be extracted and fed into the classifier or grading algorithm to improve the accuracy in the automated defect inspection process by RSIP Vision.
Deep learning architectures employed for automated inspection are expected to reach almost all domains of productions. Soft classifiers for classifying defects, such as those offered by machine learning, are particularly adequate for those cases where large variability in sensory information used for inspection, grading and sorting is present. However, the use of the right features has a major impact on the success and accuracy of all classifiers and should be tailor used on the basis of each product separately, according to the margin of acceptable defects.
At RSIP Vision we have been developing cutting edge image processing, computer vision and machine learning algorithms for several decades. We construct tailor-made solutions for our clients around the world, utilizing the expertise of expert engineers, dedicated to deliver high standards of quality. To learn more about RSIP Vision’s activity, please visit our latest projects page.