Follow HackerEarth | Machine Learning Blog on Feedspot

Continue with Google
Continue with Facebook


In the previous blog, Introduction to Object detection, we learned the basics of object detection. We also got an overview of the YOLO (You Look Only Once algorithm). In this blog, we will extend our learning and will dive deeper into the YOLO algorithm. We will learn topics such as intersection over area metrics, non maximal suppression, multiple object detection, anchor boxes, etc. Finally, we will build an object detection detection system for a self-driving car using the YOLO algorithm. We will be using the Berkeley driving dataset to train our model.

Data Preprocessing

Before, we get into building the various components of the object detection model, we will perform some preprocessing steps. The preprocessing steps involve resizing the images (according to the input shape accepted by the model) and converting the box coordinates into the appropriate form. Since we will be building a object detection for a self-driving car, we will be detecting and localizing eight different classes. These classes are ‘bike’, ‘bus’, ‘car’, ‘motor’, ‘person’, ‘rider’, ‘train’, and ‘truck’. Therefore, our target variable will be defined as:


\hat{y} ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & … & {c_8}

pc : Probability/confidence of an object being present in the bounding box

bx, by : coordinates of the center of the bounding box

bw : width of the bounding box w.r.t the image width

bh : height of the bounding box w.r.t the image height

ci = Probability of the ith class

But since the box coordinates provided in the dataset are in the following format: xmin, ymin, xmax, ymax (see Fig 1.), we need to convert them according to the target variable defined above. This can be implemented as follows:

W : width of the original image
H : height of the original image

b_x = \frac{(x_{min} + x_{max})}{2 * W}\ , \ b_y = \frac{(y_{min} + y_{max})}{2 * H} \\
b_w = \frac{(x_{max} – x_{min})}{2 * W}\ , \ b_y = \frac{(y_{max} + y_{min})}{2 * W}

Fig 1. Bounding Box Coordinates in the target variable

def process_data(images, boxes=None):
    Process the data
    images = [PIL.Image.fromarray(i) for i in images]
    orig_size = np.array([images[0].width, images[0].height])
    orig_size = np.expand_dims(orig_size, axis=0)
    #Image preprocessing 
    processed_images = [i.resize((416, 416), PIL.Image.BICUBIC) for i in images]
    processed_images = [np.array(image, dtype=np.float) for image in processed_images]
    processed_images = [image/255. for image in processed_images]
    if boxes is not None:
        # Box preprocessing
        # Original boxes stored as as 1D list of class, x_min, y_min, x_max, y_max
        boxes = [box.reshape((-1, 5)) for box in boxes]
        # Get extents as y_min, x_min, y_max, x_max, class for comparison with 
        # model output
        box_extents = [box[:, [2,1,4,3,0]] for box in boxes]
        # Get box parameters as x_center, y_center, box_width, box_height, class.
        boxes_xy = [0.5* (box[:, 3:5] + box[:, 1:3]) for box in boxes]
        boxes_wh = [box[:, 3:5] - box[:, 1:3] for box in boxes]
        boxes_xy = [box_xy / orig_size for box_xy in boxes_xy]
        boxes_wh = [box_wh / orig_size for box_wh in boxes_wh]
        boxes = [np.concatenate((boxes_xy[i], boxes_wh[i], box[:, 0:1]), axis=-1) for i, box in enumerate(boxes)]
        # find the max number of boxes 
        max_boxes = 0
        for boxz in boxes:
            if boxz.shape[0] > max_boxes:
                max_boxes = boxz.shape[0]
        # add zero pad for training 
        for i, boxz in enumerate(boxes):
            if boxz.shape[0] <  max_boxes:
                zero_padding = np.zeros((max_boxes - boxz.shape[0], 5), dtype=np.float32)
                boxes[i] = np.vstack((boxz, zero_padding))
        return np.array(processed_images), np.array(boxes)
        return np.array(processed_images)
Intersection Over Union

Intersection over Union (IoU) is an evaluation metric that is used to measure the accuracy of an object detection algorithm. Generally, IoU is a measure of the overlap between two bounding boxes. To calculate this metric, we need:

  • The ground truth bounding boxes (i.e. the hand labeled bounding boxes)
  • The predicted bounding boxes from the model

Intersection over Union is the ratio of the area of intersection over the union area occupied by the ground truth bounding box and the predicted bounding box. Fig. 9 shows the IoU calculation for different bounding box scenarios.

Intersection over Union is the ratio of the area of intersection over the union area occupied by the ground truth bounding box and the predicted bounding box. Fig. 2 shows the IoU calculation for different bounding box scenarios.

Fig 2. Intersection over Union computation for different bounding boxes.

Now, that we have a better understanding of the metric, let’s code it.

def IoU(box1, box2):
    Returns the Intersection over Union (IoU) between box1 and box2
    box1: coordinates: (x1, y1, x2, y2)
    box2: coordinates: (x1, y1, x2, y2)
    # Calculate the intersection area of the two boxes. 
    xi1 = max(box1[0], box2[0])
    yi1 = max(box1[1], box2[1])
    xi2 = min(box1[2], box2[2])
    yi2 = min(box1[3], box2[3])
    area_of_intersection = (xi2 - xi1) * (yi2 - yi1)
    # Calculate the union area of the two boxes
    # A U B = A + B - A ∩ B
    A = (box1[2] - box1[0]) * (box1[3] - box1[1])
    B = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = A + B - area_of_intersection
    intersection_over_union = area_of_intersection/ union_area
    return intersection_over_union

Defining the Model

Instead of building the model from scratch, we will be using a pre-trained network and applying transfer learning to create our final model. You only look once (YOLO) is a state-of-the-art, real-time object detection system, which has a mAP on VOC 2007 of 78.6% and a mAP of 48.1% on the COCO test-dev. YOLO applies a single neural network to the full image. This network divides the image into regions and predicts the bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

One of the advantages of YOLO is that it looks at the whole image during the test time, so its predictions are informed by global context in the image. Unlike R-CNN, which requires thousands of networks for a single image, YOLO makes predictions with a single network. This makes this algorithm extremely fast, over 1000x faster than R-CNN and 100x faster than Fast R-CNN.

Loss Function

If the target variable $# y $#  is defined as

y ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & {…} & {c_8}
\end{bmatrix}}^T \\
& {y_1}& {y_2} & {y_3} & {y_4} & {y_5} & {y_6} & {y_7} & {…} & {y_{13}}

the loss function for object localization is defined as

\mathcal{L(\hat{y}, y)} =
(\hat{y_1} – y_1)^2 + (\hat{y_2} – y_2)^2 + … + (\hat{y_{13}} – y_{13})^2 &&, y_1=1 \\
(\hat{y_1} – y_1)^2 &&, y_1=0

The loss function in case of the YOLO algorithm is calculated using the following steps:

  • Find the bounding boxes with the highest IoU with the true bounding boxes
  • Calculate the confidence loss (the probability of object being present inside the bounding box)
  • Calculate the classification loss (the probability of class present inside the bounding box)
  • Calculate the coordinate loss for the matching detected boxes.
  • Total loss is the sum of the confidence loss, classification loss, and coordinate loss.

Using the steps defined above, let’s calculate the loss function for the YOLO algorithm.

In general, the target variable is defined as

y ={
{p_i(c)}& {x_i} & {y_i} & {h_i} & {w_i} & {C_i}


pi(c) : Probability/confidence of an object being present in the bounding box.
xi, yi : coordinates of the center of the bounding box.
wi : width of the bounding box w.r.t the image width.
hi : height of the bounding box w.r.t the image height.
Ci = Probability of the ith class.

then the corresponding loss function is calculated as


The above equation represents the yolo loss function. The equation may seem daunting at first, but on having a closer look we can see it is the sum of the coordinate loss, the classification loss, and the confidence loss in that order. We use sum of squared errors because it is easy to optimize. However, it weights the localization error equally with classification error which may not be ideal. To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this.

Note that the loss function only penalizes classification error if an object is present in that grid cell. It also penalizes the bounding box coordinate error if that predictor is responsible for the ground truth box (i.e which has the highest IOU of any predictor in that grid cell).

Model Architecture

The YOLO model has the following architecture (see Fig 3). The network has 24 convolutional layers followed by two fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. The convolutional layers are pretrained on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

Fig 3. The YOLO Archtecture (Image taken from the official YOLO paper)

We will be using pre trained YOLOv2 model, which has been trained on the COCO image dataset with classes similar to the Berkeley Driving Dataset. So, we will use the YOLOv2 pretrained network as a feature extractor. We will load the pretrained weights of the YOLOv2 model and will freeze all the weights except for the last layer during training of the model. We will remove the last convolutional layer of the YOLOv2 model and replace it with a new convolutional layer indicating the number of classes (8 classes as defined earlier) to be predicted. This is implemented in the following code.

def create_model(anchors, class_names, load_pretrained=True, freeze_body = True):
    load_pretrained: whether or not to load the pretrained model or initialize all weights

    freeze_body: whether or not to freeze all weights except for the last layer
    model_body : YOLOv2 with new output layer
    model : YOLOv2 with custom loss Lambda layer  
    detector_mask_shape = (13, 13, 5, 1)
    matching_boxes_shape = (13, 13, 5, 5)
    # Create model input layers 
    image_input = Input(shape=(416,416,3))
    boxes_input = Input(shape=(None, 5))
    detector_mask_input = Input(shape=detector_mask_shape)
    matching_boxes_input = Input(shape=matching_boxes_shape)
    # Create model body
    yolo_model = yolo_body(image_input, len(anchors), len(class_names))
    topless_yolo = Model(yolo_model.input, yolo_model.layers[-2].output)
    if load_pretrained == True:
        # Save topless yolo
        topless_yolo_path = os.path.join('model_data', 'yolo_topless.h5')
        if not os.path.exists(topless_yolo_path):
            print('Creating Topless weights file')
            yolo_path = os.path.join('model_data', 'yolo.h5')
            model_body = load_model(yolo_path)
            model_body = Model(model_body.inputs, model_body.layers[-2].output)
    if freeze_body:
        for layer in topless_yolo.layers:
            layer.trainable = False
    final_layer = Conv2D(len(anchors)*(5 + len(class_names)), (1, 1), activation='linear')(topless_yolo.output)
    model_body = Model(image_input, final_layer)
    # Place model loss on CPU to reduce GPU memory usage.    
    with tf.device('/cpu:0'):
        model_loss = Lambda(yolo_loss, output_shape=(1,), name='yolo_loss', arguments={
            'anchors': anchors, 
            'num_classes': len(class_names)})([model_body.output, boxes_input, detector_mask_input, matching_boxes_input])
    model = Model([model_body.input, boxes_input, detector_mask_input, matching_boxes_input], model_loss)
    return model_body, model

Due to limited computational power, we used only the first 1000 images present in the training dataset to train the model. Finally, we trained the model for 20 epochs and saved the model weights with the lowest loss.

Tackling Multiple Detection Threshold Filtering

The YOLO object detection algorithm will predict multiple overlapping bounding boxes for a given image. As not all bounding boxes contain the object to be classified (e.g. pedestrian, bike, car or truck) or detected, we need to filter out those bounding boxes that don’t contain the target object. To implement this, we monitor the value of pc, i.e., the probability or confidence of an object (i.e. the four classes) being present in the bounding box. If the value of pc is less than the threshold value, then we filter out that bounding box from the predicted bounding boxes. This threshold may vary from model to model and serve as a hyper-parameter for the model.

If predicted target variable is defined as:

\hat{y} ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & … & {c_8}

then discard all bounding boxes where the value of pc < threshold value. The following code implements this approach.

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold=0.6):
    Filters YOLO boxes by thresholding on object and class confidence
    box_confidence: Probability of the box containing the object
    boxes: The box parameters : (x, y, h, w) 
           x, y -> Center of the box 
           h, w -> Height and width of the box w.r.t the image size
    box_class_probs: Probability of all the classes for each box
    threshold: Threshold value for box confidence
    scores: containing the class probability score for the selected boxes
    boxes: contains box coordinates for the selected boxes
    classes: contains the index of the class detected by the selected boxes
    # Compute the box scores: 
    box_scores = box_confidence * box_class_probs
    # Find the box classes index with the maximum box score
    box_classes = K.argmax(box_scores)
    # Find the box classes with maximum box score
    box_class_scores = K.max(box_scores, axis=-1)
    # Creating a mask for selecting the boxes that have box score greater than threshold
    thresh_mask = box_class_scores >= threshold
    # Selecting the scores, boxes and classes with box score greater than 
    # threshold by filtering the box score with the help of thresh_mask.
    scores = tf.boolean_mask(tensor=box_class_scores, mask=thresh_mask)
    classes = tf.boolean_mask(tensor=box_classes, mask=thresh_mask)
    boxes = tf.boolean_mask(tensor=boxes, mask=thresh_mask)
    return scores, classes, boxes

Non-max Suppression

Even after filtering by thresholding over the classes score, we may still end up with a lot of overlapping bounding boxes. This is because the YOLO algorithm may detect an object multiple times, which is one of its drawbacks. A second filter called non-maximal suppression (NMS) is used to remove duplicate detections of an object. Non-max suppression uses ‘Intersection over Union’ (IoU) to fix multiple detections.

Non-maximal suppression is implemented as follows:

  • Find the box confidence (pc) (Probability of the box containing the object) for each detection.
  • Pick the bounding box with the maximum box confidence. Output this box as prediction.
  • Discard any remaining bounding boxes which have an IoU greater than 0.5 with the bounding box selected as output in the previous step i.e. any bounding box with high overlap is discarded.

In case there are multiple classes/ objects, i.e., if there are four objects/classes, then non-max suppression will run four times, once for every output class.

Anchor boxes

One of the drawbacks of YOLO algorithm is that each grid can only detect one object. What if we want to detect multiple distinct objects in each grid. For example, if two objects or classes are overlapping and share the same grid as shown in the image (see Fig 4.),

Fig 4. Two Overlapping bounding boxes with two overlapping classes.

We make use of anchor boxes to tackle the issue. Let’s assume the predicted variable is defined as

\hat{y} ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & {…} & {c_8}

then, we can use two anchor boxes in the following manner to detect two objects in the image simultaneously.

Fig 5. Target variable with two bounding boxes

Earlier, the target variable was defined such that each object in the training image is assigned to grid cell that contains that object’s midpoint. Now, with two anchor boxes, each object in the training images is assigned to a grid cell that contains the object’s midpoint and anchor box for the grid cell with the highest IOU. So, with the help of two anchor boxes, we can detect at most two objects simultaneously in an image. Fig 6. shows the shape of the final output layer with and without the use of anchor boxes.

Fig 6. Shape of the output layer with two anchor boxes

Although, we can detect multiple images using Anchor boxes, but they still have limitations. For example, if there are two anchor boxes defined in the target variable and the image has three overlapping objects, then the algorithm fails to detect all three objects. Secondly, if two anchor boxes are associated with two objects but have the same midpoint in the box coordinates, then the algorithm fails to differentiate between the objects. Now, that we know the basics of anchor boxes, let’s code it.

In the following code we will use 10 anchor boxes. As a result, the algorithm can detect at maximum of 10 objects in a given image.

def non_max_suppression(scores, classes, boxes, max_boxes=10, iou_threshold = 0.5):
    Non-maximal suppression is used to fix the multiple detections of the same object.
    - Find the box_confidence (Probability of the box containing the object) for each detection.
    - Find the bounding box with the highest box_confidence
    - Suppress all the bounding boxes which have an IoU greater than 0.5 with the bounding box with the maximum box confidence.
    scores    -> containing the class probability score for the selected boxes.
    boxes     -> contains box coordinates for the boxes selected after threshold masking.
    classes   -> contains the index of the classes detected by the selected boxes.
    max_boxes -> maximum number of predicted boxes to be returned after NMS filtering.
    scores  -> predicted score for each box.
    classes -> predicted class for each box.
    boxes   -> predicted box..
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Humans can easily detect and identify objects present in an image. The human visual system is fast and accurate and can perform complex tasks like identifying multiple objects and detect obstacles with little conscious thought. With the availability of large amounts of data, faster GPUs, and better algorithms, we can now easily train computers to detect and classify multiple objects within an image with high accuracy. In this blog, we will explore terms such as object detection, object localization, loss function for object detection and localization, and finally explore an object detection algorithm known as “You only look once” (YOLO).

Object Localization

An image classification or image recognition model simply detect the probability of an object in an image. In contrast to this, object localization refers to identifying the location of an object in the image. An object localization algorithm will output the coordinates of the location of an object with respect to the image. In computer vision, the most popular way to localize an object in an image is to represent its location with the help of bounding boxes. Fig. 1 shows an example of a bounding box.

Fig 1. Bounding box representation used for object localization

A bounding box can be initialized using the following parameters:

  • bx, by : coordinates of the center of the bounding box
  • bw : width of the bounding box w.r.t the image width
  • bh : height of the bounding box w.r.t the image height
Defining the target variable

The target variable for a multi-class image classification problem is defined as:

\(\hat{y} = {c_i}\)where,
$#\smash{c_i}$# = Probability of the $#i_{th}$# class.
For example, if there are four classes, the target variable is defined as

y =
{c_1} & \\
{c_2} & \\
{c_3} & \\
We can extend this approach to define the target variable for object localization. The target variable is defined as
y =
{p_c} & \\
{b_x} & \\
{b_y} & \\
{b_h} & \\
{b_w} & \\
{c_1} & \\
{c_2} & \\
{c_3} & \\
$#\smash{p_c}$# = Probability/confidence of an object (i.e the four classes) being present in the bounding box.
$#\smash{b_x, b_y, b_h, b_w}$# = Bounding box coordinates.
$#\smash{c_i}$# = Probability of the $#\smash{i_{th}}$# class the object belongs to.

For example, the four classes be ‘truck’, ‘car’, ‘bike’, ‘pedestrian’ and their probabilities are represented as  $#c_1, c_2, c_3, c_4$#.  So,

p_c =
1,\ \ c_i: \{c_1, c_2, c_3, c_4\} && \\
0,\ \ otherwise

Loss Function

Let the values of the target variable $#y$# are represented as $#y_1$#, $#y_2$#, $#…,\ y_9$#.

y ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & {c_3} & {c_4}
\end{bmatrix}}^T \\
& {y_1}& {y_2} & {y_3} & {y_4} & {y_5} & {y_6} & {y_7} & {y_8} & {y_9}

The loss function for object localization will be defined as

\mathcal{L(\hat{y}, y)} =
(\hat{y_1} – y_1)^2 + (\hat{y_8} – y_8)^2 + … + (\hat{y_9} – y_9)^2 &&, y_1=1 \\
(\hat{y_1} – y_1)^2 &&, y_1=0

In practice, we can use a log function considering the softmax output in case of the predicted classes ($#c_1, c_2, c_3, c_4$#). While for the bounding box coordinates, we can use something like a squared error and for $#p_c$# (confidence of object) we can use logistic regression loss.

Since we have defined both the target variable and the loss function, we can now use neural networks to both classify and localize objects.

Object Detection

An approach to building an object detection is to first build a classifier that can classify closely cropped images of an object. Fig 2. shows an example of such a model, where a model is trained on a dataset of closely cropped images of a car and the model predicts the probability of an image being a car.

Fig 2. Image classification of cars

Now, we can use this model to detect cars using a sliding window mechanism. In a sliding window mechanism, we use a sliding window (similar to the one used in convolutional networks) and crop a part of the image in each slide. The size of the crop is the same as the size of the sliding window. Each cropped image is then passed to a ConvNet model (similar to the one shown in Fig 2.), which in turn predicts the probability of the cropped image is a car.

Fig 3. Sliding windows mechanism

After running the sliding window through the whole image, we resize the sliding window and run it again over the image again. We repeat this process multiple times. Since we crop through a number of images and pass it through the ConvNet, this approach is both computationally expensive and time-consuming, making the whole process really slow. Convolutional implementation of the sliding window helps resolve this problem.

Convolutional implementation of sliding windows

Before we discuss the implementation of the sliding window using convents, let’s analyze how we can convert the fully connected layers of the network into convolutional layers. Fig. 4 shows a simple convolutional network with two fully connected layers each of shape (400, ).

Fig 4. Sliding windows mechanism

A fully connected layer can be converted to a convolutional layer with the help of a 1D convolutional layer. The width and height of this layer are equal to one and the number of filters are equal to the shape of the fully connected layer. An example of this is shown in Fig 5.

Fig 5. Converting a fully connected layer into a convolutional layer

We can apply this concept of conversion of a fully connected layer into a convolutional layer to the model by replacing the fully connected layer with a 1-D convolutional layer. The number of the filters of the 1D convolutional layer is equal to the shape of the fully connected layer. This representation is shown in Fig 6. Also, the output softmax layer is also a convolutional layer of shape (1, 1, 4), where 4 is the number of classes to predict.

Fig 6. Convolutional representation of fully connected layers.

Now, let’s extend the above approach to implement a convolutional version of sliding window. First, let’s consider the ConvNet that we have trained to be in the following representation (no fully connected layers).

Let’s assume the size of the input image to be 16 × 16 × 3. If we’re to use a sliding window approach, then we would have passed this image to the above ConvNet four times, where each time the sliding window crops a part of the input image of size 14 × 14 × 3 and pass it through the ConvNet. But instead of this, we feed the full image (with shape 16 × 16 × 3) directly into the trained ConvNet (see Fig. 7). This results in an output matrix of shape 2 × 2 × 4. Each cell in the output matrix represents the result of a possible crop and the classified value of the cropped image. For example, the left cell of the output (the green one) in Fig. 7 represents the result of the first sliding window. The other cells represent the results of the remaining sliding window operations.

Fig 7. Convolutional implementation of the sliding window

Note that the stride of the sliding window is decided by the number of filters used in the Max Pool layer. In the example above, the Max Pool layer has two filters, and as a result, the sliding window moves with a stride of two resulting in four possible outputs. The main advantage of using this technique is that the sliding window runs and computes all values simultaneously. Consequently, this technique is really fast. Although a weakness of this technique is that the position of the bounding boxes is not very accurate.

The YOLO (You Only Look Once) Algorithm

A better algorithm that tackles the issue of predicting accurate bounding boxes while using the convolutional sliding window technique is the YOLO algorithm. YOLO stands for you only look once and was developed in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. It’s popular because it achieves high accuracy while running in real time. This algorithm is called so because it requires only one forward propagation pass through the network to make the predictions.

The algorithm divides the image into grids and runs the image classification and localization algorithm (discussed under object localization) on each of the grid cells. For example, we have an input image of size 256 × 256. We place a 3 × 3 grid on the image (see Fig. 8).

Fig. 8 Grid (3 x 3) representation of the image

Next, we apply the image classification and localization algorithm on each grid cell. For each grid cell, the target variable is defined as

y_{i, j} ={
{p_c}& {b_x} & {b_y} & {b_h} & {b_w} & {c_1} & {c_2} & {c_3} & {c_4}

Do everything once with the convolution sliding window. Since the shape of the target variable for each grid cell is 1 × 9 and there are 9 (3 × 3) grid cells, the final output of the model will be:

The advantages of the YOLO algorithm is that it is very fast and predicts much more accurate bounding boxes. Also, in practice to get more accurate predictions, we use a much finer grid, say 19 × 19, in which case the target output is of the shape 19 × 19 × 9.


With this, we come to the end of the introduction to object detection. We now have a better understanding of how we can localize objects while classifying them in an image. We also learned to combine the concept of classification and localization with the convolutional implementation of the sliding window to build an object detection system. In the next blog, we will go deeper into the YOLO algorithm, loss function used, and implement some ideas that make the YOLO algorithm better. Also, we will learn to implement the YOLO algorithm in real time.

Have anything to say? Feel free to comment below for any questions, suggestions, and discussions related to this article. Till then, keep hacking with HackerEarth.

The post Introduction to Object Detection appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Bonjour! Welcome to another part of the series on data visualization techniques. In the previous two articles, we discussed different data visualization techniques that can be applied to visualize and gather insights from categorical and continuous variables. You can check out the first two articles here:

In this article, we’ll go through the implementation and use of a bunch of data visualization techniques such as heat maps, surface plots, correlation plots, etc. We will also look at different techniques that can be used to visualize unstructured data such as images, text, etc.

### Importing the required libraries   
 import pandas as pd   
 import numpy as np  
 import seaborn as sns   
 import matplotlib.pyplot as plt   
 import plotly.plotly as py  
 import plotly.graph_objs as go  
 %matplotlib inline


A heat map(or heatmap) is a two-dimensional graphical representation of the data which uses colour to represent data points on the graph. It is useful in understanding underlying relationships between data values that would be much harder to understand if presented numerically in a table/ matrix.

### We can create a heatmap by simply using the seaborn library.   
 sample_data = np.random.rand(8, 12)  
 ax = sns.heatmap(sample_data)

Fig 1. Heatmap using the seaborn library

Let’s understand this using an example. We’ll be using the metadata from Deep Learning 3 challenge. Link to the dataset. Deep Learning 3 challenged the participants to predict the attributes of animals by looking at their images.

### Training metadata contains the name of the image and the corresponding attributes associated with the animal in the image.  
 train = pd.read_csv('meta-data/train.csv')  

We will be analyzing how often an attribute occurs in relationship with the other attributes. To analyze this relationship, we will compute the co-occurrence matrix.

### Extracting the attributes  
 cols = list(train.columns)  
 attributes = np.array(train[cols])  
 print('There are {} attributes associated with {} images.'.format(attributes.shape[1],attributes.shape[0]))

Out: There are 85 attributes associated with 12,600 images.

# Compute the co-occurrence matrix  
 cooccurrence_matrix = np.dot(attributes.transpose(), attributes)  
 print('\n Co-occurrence matrix: \n', cooccurrence_matrix)  
 Out: Co-occurrence matrix:   
  [[5091 728 797 ... 3797 728 2024]  
  [ 728 1614  0 ... 669 1614 1003]  
  [ 797  0 1188 ... 1188  0 359]  
  [3797 669 1188 ... 8305 743 3629]  
  [ 728 1614  0 ... 743 1933 1322]  
  [2024 1003 359 ... 3629 1322 6227]]

# Normalizing the co-occurrence matrix, by converting the values into a matrix  
 # Compute the co-occurrence matrix in percentage  
 cooccurrence_matrix_diagonal = np.diagonal(cooccurrence_matrix)  
 with np.errstate(divide = 'ignore', invalid='ignore'):  
   cooccurrence_matrix_percentage = np.nan_to_num(np.true_divide(cooccurrence_matrix, cooccurrence_matrix_diagonal))  
 print('\n Co-occurrence matrix percentage: \n', cooccurrence_matrix_percentage)

We can see that the values in the co-occurrence matrix represent the occurrence of each attribute with the other attributes. Although the matrix contains all the information, it is visually hard to interpret and infer from the matrix. To counter this problem, we will use heat maps, which can help relate the co-occurrences graphically.

fig = plt.figure(figsize=(10, 10))  
 # Draw the heatmap with the mask and correct aspect ratio   
 ax = sns.heatmap(cooccurrence_matrix_percentage, cmap='viridis', center=0, square=True, linewidths=0.15, cbar_kws={"shrink": 0.5, "label": "Co-occurrence frequency"}, )  
 ax.set_title('Heatmap of the attributes')  

Fig 2. Heatmap of the co-occurrence matrix indicating the frequency of occurrence of one attribute with other

Since the frequency of the co-occurrence is represented by a colour pallet, we can now easily interpret which attributes appear together the most. Thus, we can infer that these attributes are common to most of the animals.


Choropleths are a type of map that provides an easy way to show how some quantity varies across a geographical area or show the level of variability within a region. A heat map is similar but doesn’t include geographical boundaries. Choropleth maps are also appropriate for indicating differences in the distribution of the data over an area, like ownership or use of land or type of forest cover, density information, etc. We will be using the geopandas library to implement the choropleth graph.

We will be using choropleth graph to visualize the GDP across the globe. Link to the dataset.

# Importing the required libraries  
 import geopandas as gpd   
 from shapely.geometry import Point  
 from matplotlib import cm  
 # GDP mapped to the corresponding country and their acronyms  
 df =pd.read_csv('GDP.csv')  

0 Afghanistan 21.71 AFG
1 Albania 13.40 ALB
2 Algeria 227.80 DZA
3 American Samoa 0.75 ASM
4 Andorra 4.80 AND

### Importing the geometry locations of each country on the world map  
 geo = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))[['iso_a3', 'geometry']]  
 geo.columns = ['CODE', 'Geometry']  

# Mapping the country codes to the geometry locations  
 df = pd.merge(df, geo, left_on='CODE', right_on='CODE', how='inner')  
 #converting the dataframe to geo-dataframe  
 geometry = df['Geometry']  
 df.drop(['Geometry'], axis=1, inplace=True)  
 crs = {'init':'epsg:4326'}  
 geo_gdp = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)  
 ## Plotting the choropleth  
 cpleth = geo_gdp.plot(column='GDP (BILLIONS)', cmap=cm.Spectral_r, legend=True, figsize=(8,8))  
 cpleth.set_title('Choropleth Graph - GDP of different countries')

Fig 3. Choropleth graph indicating the GDP according to geographical locations

Surface plot

Surface plots are used for the three-dimensional representation of the data. Rather than showing individual data points, surface plots show a functional relationship between a dependent variable (Z) and two independent variables (X and Y).

It is useful in analyzing relationships between the dependent and the independent variables and thus helps in establishing desirable responses and operating conditions.

from mpl_toolkits.mplot3d import Axes3D  
 from matplotlib.ticker import LinearLocator, FormatStrFormatter  
 # Creating a figure  
 # projection = '3d' enables the third dimension during plot  
 fig = plt.figure(figsize=(10,8))  
 ax = fig.gca(projection='3d')  
 # Initialize data   
 X = np.arange(-5,5,0.25)  
 Y = np.arange(-5,5,0.25)  
 # Creating a meshgrid  
 X, Y = np.meshgrid(X, Y)  
 R = np.sqrt(np.abs(X**2 - Y**2))  
 Z = np.exp(R)  
 # plot the surface   
 surf = ax.plot_surface(X, Y, Z, cmap=cm.GnBu, antialiased=False)  
 # Customize the z axis.  
 ax.set_title('Surface Plot')  
 # Add a color bar which maps values to colors.  
 fig.colorbar(surf, shrink=0.5, aspect=5)  

One of the main applications of surface plots in machine learning or data science is the analysis of the loss function. From a surface plot, we can analyze how the hyperparameters affect the loss function and thus help prevent overfitting of the model.

Fig 4. Surface plot visualizing the dependent variable w.r.t the independent variables in 3-dimensions

Visualizing high-dimensional datasets

Dimensionality refers to the number of attributes present in the dataset. For example, consumer-retail datasets can have a vast amount of variables (e.g. sales, promos, products, open, etc.). As a result, visually exploring the dataset to find potential correlations between variables becomes extremely challenging.

Therefore, we use a technique called dimensionality reduction to visualize higher dimensional datasets. Here, we will focus on two such techniques :

  • Principal Component Analysis (PCA)
  • T-distributed Stochastic Neighbor Embedding (t-SNE)
Principal Component Analysis (PCA)

Before we jump into understanding PCA, let’s review some terms:

  • Variance: Variance is simply the measure of the spread or extent of the data. Mathematically, it is the average squared deviation from the mean position.
  • Covariance: Covariance is the measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. It is the measure of how two random variables vary together. It is similar to variance, but where variance tells you the extent of one variable, covariance tells you the extent to which the two variables vary together. Mathematically, it is defined as:

A positive covariance means X and Y are positively related, i.e., if X increases, Y increases, while negative covariance means the opposite relation. However, zero variance means X and Y are not related.

Fig 5. Different types of covariance

PCA is the orthogonal projection of data onto a lower-dimension linear space that maximizes variance (green line) of the projected data and minimizes the mean squared distance between the data point and the projects (blue line). The variance describes the direction of maximum information while the mean squared distance describes the information lost during projection of the data onto the lower dimension.

Thus, given a set of data points in a d-dimensional space, PCA projects these points onto a lower dimensional space while preserving as much information as possible.

Fig 6. Illustration of principal component analysis

In the figure, the component along the direction of maximum variance is defined as the first principal axis. Similarly, the component along the direction of second maximum variance is defined as the second principal component, and so on. These principal components are referred to the new dimensions carrying the maximum information.

# We will use the breast cancer dataset as an example  
 # The dataset is a binary classification dataset  
 # Importing the dataset  
 from sklearn.datasets import load_breast_cancer  
 data = load_breast_cancer()  
 X = pd.DataFrame(data=data.data, columns=data.feature_names) # Features   
 y = data.target # Target variable   
 # Importing PCA function  
 from sklearn.decomposition import PCA  
 pca = PCA(n_components=2) # n_components = number of principal components to generate  
 # Generating pca components from the data  
 pca_result = pca.fit_transform(X)  
 print("Explained variance ratio : \n",pca.explained_variance_ratio_)

Out: Explained variance ratio :   
  [0.98204467 0.01617649]

We can see that 98% (approx) variance of the data is along the first principal component, while the second component only expresses 1.6% (approx) of the data.

# Creating a figure   
 fig = plt.figure(1, figsize=(10, 10))  
 # Enabling 3-dimensional projection   
 ax = fig.gca(projection='3d')  
 for i, name in enumerate(data.target_names):  
   ax.text3D(np.std(pca_result[:, 0][y==i])-i*500 ,np.std(pca_result[:, 1][y==i]),0,s=name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))  
 # Plotting the PCA components    
 ax.scatter(pca_result[:,0], pca_result[:, 1], c=y, cmap = plt.cm.Spectral,s=20, label=data.target_names)  

Fig 7. Visualizing the distribution of cancer across the data

Thus, with the help of PCA, we can get a visual perception of how the labels are distributed across given data (see Figure).

T-distributed Stochastic Neighbour Embedding (t-SNE)

T-distributed Stochastic Neighbour Embeddings (t-SNE) is a non-linear dimensionality reduction technique that is well suited for visualization of high-dimensional data. It was developed by Laurens van der Maten and Geoffrey Hinton. In contrast to PCA, which is a mathematical technique, t-SNE adopts a probabilistic approach.

PCA can be used for capturing the global structure of the high-dimensional data but fails to describe the local structure within the data. Whereas, “t-SNE” is capable of capturing the local structure of the high-dimensional data very well while also revealing global structure such as the presence of clusters at several scales. t-SNE converts the similarity between data points to joint probabilities and tries to maximize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embeddings and high-dimension data. In doing so, it preserves the original structure of the data.

# We will be using the scikit learn library to implement t-SNE  
 # Importing the t-SNE library   
 from sklearn.manifold import TSNE  
 # We will be using the iris dataset for this example  
 from sklearn.datasets import load_iris

# Loading the iris dataset   
 data = load_iris()  
 # Extracting the features   
 X = data.data  
 # Extracting the labels   
 y = data.target  
 # There are four features in the iris dataset with three different labels.  
 print('Features in iris data:\n', data.feature_names)  
 print('Labels in iris data:\n', data.target_names)

Out: Features in iris data:  
  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']  
 Labels in iris data:  
  ['setosa' 'versicolor' 'virginica']

# Loading the TSNE model   
 # n_components = number of resultant components   
 # n_iter = Maximum number of iterations for the optimization.  
 tsne_model = TSNE(n_components=3, n_iter=2500, random_state=47)  
 # Generating new components   
 new_values = tsne_model.fit_transform(X)  
 labels = data.target_names  
 # Plotting the new dimensions/ components  
 fig = plt.figure(figsize=(5, 5))  
 ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)  
 for label, name in enumerate(labels):  
   ax.text3D(new_values[y==label, 0].mean(),  
        new_values[y==label, 1].mean() + 1.5,  
        new_values[y==label, 2].mean(), name,  
        bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))  
 ax.scatter(new_values[:,0], new_values[:,1], new_values[:,2], c=y)  
 ax.set_title('High-Dimension data visualization using t-SNE', loc='right')  

Fig 8. Visualizing the feature space of the iris dataset using t-SNE

Thus, by reducing the dimensions using t-SNE, we can visualize the distribution of the labels over the feature space. We can see that in the figure the labels are clustered in their own little group. So, if we’re to use a clustering algorithm to generate clusters using the new features/components, we can accurately assign new points to a label.  


Let’s quickly summarize the topics we covered. We started with the generation of heatmaps using random numbers and extended its application to a real-world example. Next, we implemented choropleth graphs to visualize the data points with respect to geographical locations. We moved on to implement surface plots to get an idea of how we can visualize the data in a three-dimensional surface. Finally, we used two- dimensional reduction techniques, PCA and t-SNE, to visualize high-dimensional datasets.

I encourage you to implement the examples described in this article to get a hands-on experience. Hope you enjoyed the article. Do let me know if you have any feedback, suggestions, or thoughts on this article in the comments below!

The post Data Visualization for Beginners-Part 3 appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Deep Learning is on the rise, extending its application in every field, ranging from computer vision to natural language processing, healthcare, speech recognition, generating art, addition of sound to silent movies, machine translation, advertising, self-driving cars, etc. In this blog, we will extend the power of deep learning to the domain of music production. We will talk about how we can use deep learning to generate new musical beats.

The current technological advancements have transformed the way we produce music, listen, and work with music. With the advent of deep learning, it has now become possible to generate music without the need for working with instruments artists may not have had access to or the skills to use previously. This offers artists more creative freedom and ability to explore different domains of music.

Recurrent Neural Networks

Since music is a sequence of notes and chords, it doesn’t have a fixed dimensionality. Traditional deep neural network techniques cannot be applied to generate music as they assume the inputs and targets/outputs to have fixed dimensionality and outputs to be independent of each other. It is therefore clear that a domain-independent method that learns to map sequences to sequences would be useful.

Recurrent neural networks (RNNs) are a class of artificial neural networks that make use of sequential information present in the data.

Fig. 1 A basic RNN unit.

A recurrent neural network has looped, or recurrent, connections which allow the network to hold information across inputs. These connections can be thought of as memory cells. In other words, RNNs can make use of information learned in the previous time step. As seen in Fig. 1, the output of the previous hidden/activation layer is fed into the next hidden layer. Such an architecture is efficient in learning sequence-based data.

In this blog, we will be using the Long Short-Term Memory (LSTM) architecture. LSTM is a type of recurrent neural network (proposed by Hochreiter and Schmidhuber, 1997) that can remember a piece of information and keep it saved for many timesteps.


Our dataset includes piano tunes stored in the MIDI format. MIDI (Musical Instrument Digital Interface) is a protocol which allows electronic instruments and other digital musical tools to communicate with each other. Since a MIDI file only represents player information, i.e., a series of messages like ‘note on’, ‘note off, it is more compact, easy to modify, and can be adapted to any instrument.

Before we move forward, let us understand some music related terminologies:

  • Note: A note is either a single sound or its representation in notation. Each note consist of pitch, octave, and an offset.
  • Pitch: Pitch refers to the frequency of the sound.
  • Octave: An octave is the interval between one musical pitch and another with half or double its frequency.
  • Offset: Refers to the location of the note.
  • Chord: Playing multiple notes at the same time constitutes a chord.
Data Preprocessing

We will use the music21 toolkit (a toolkit for computer-aided musicology, MIT) to extract data from these MIDI files.

  1. Notes Extraction

    def get_notes():  
       notes = []  
       for file in songs:  
         # converting .mid file to stream object  
         midi = converter.parse(file)  
         notes_to_parse = []  
           # Given a single stream, partition into a part for each unique instrument  
           parts = instrument.partitionByInstrument(midi)  
         if parts: # if parts has instrument parts   
           notes_to_parse = parts.parts[0].recurse()  
           notes_to_parse = midi.flat.notes  
         for element in notes_to_parse:   
           if isinstance(element, note.Note):  
             # if element is a note, extract pitch   
           elif(isinstance(element, chord.Chord)):  
             # if element is a chord, append the normal form of the   
             # chord (a list of integers) to the list of notes.   
             notes.append('.'.join(str(n) for n in element.normalOrder))  
       with open('data/notes', 'wb') as filepath:  
         pickle.dump(notes, filepath)  
       return notes

    The function get_notes returns a list of notes and chords present in the .mid file. We use the converter.parse function to convert the midi file in a stream object, which in turn is used to extract notes and chords present in the file. The list returned by the function get_notes() looks as follows:

       ['F2', '4.5.7', '9.0', 'C3', '5.7.9', '7.0', 'E4', '4.5.8', '4.8', '4.8', '4', 'G#3',  
       'D4', 'G#3', 'C4', '4', 'B3', 'A2', 'E3', 'A3', '0.4', 'D4', '7.11', 'E3', '0.4.7', 'B4', 'C3', 'G3', 'C4', '4.7', '11.2', 'C3', 'C4', '11.2.4', 'G4', 'F2', 'C3', '0.5', '9.0', '4.7', 'F2', '', '4.8', 'F4', '4', '4.8', '2.4', 'G#3',  
      '8.0', 'E2', 'E3', 'B3', 'A2', '4.9', '0.4', '7.11', 'A2', '9.0.4', ...........]

    We can see that the list consists of pitches and chords (represented as a list of integers separated by a dot). We assume each new chord to be a new pitch on the list. As letters are used to generate words in a sentence, similarly the music vocabulary used to generate music is defined by the unique pitches in the notes list.

  2. Generating Input and Output Sequences

    A neural network accepts only real values as input and since the pitches in the notes list are in string format, we need to map each pitch in the notes list to an integer. We can do so as follows:

    # Extract the unique pitches in the list of notes.   
     pitchnames = sorted(set(item for item in notes))  
     # create a dictionary to map pitches to integers  
     note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

    Next, we will create an array of input and output sequences to train our model. Each input sequence will consist of 100 notes, while the output array stores the 101st note for the corresponding input sequence. So, the objective of the model will be to predict the 101st note of the input sequence of notes.

    # create input sequences and the corresponding outputs  
     for i in range(0, len(notes) - sequence_length, 1):  
       sequence_in = notes[i: i + sequence_length]  
       sequence_out = notes[i + sequence_length]  
       network_input.append([note_to_int[char] for char in sequence_in])  

    Next, we reshape and normalize the input vector sequence before feeding it to the model. Finally, we one-hot encode our output vector.

    n_patterns = len(network_input)  
     # reshape the input into a format compatible with LSTM layers   
     network_input = np.reshape(network_input, (n_patterns, sequence_length, 1))  
     # normalize input  
     network_input = network_input / float(n_vocab)  
     # One hot encode the output vector  
     network_output = np_utils.to_categorical(network_output)

Model Architecture

We will use keras to build our model architecture. We use a character level-based architecture to train the model. So each input note in the music file is used to predict the next note in the file, i.e., each LSTM cell takes the previous layer activation (a⟨t−1⟩) and the previous layers actual output (y⟨t−1⟩) as input at the current time step tt. This is depicted in the following figure (Fig 2.).

Fig 2. One to Many LSTM architecture

Our model architecture is defined as:

model = Sequential()  
 model.add(LSTM(128, input_shape=network_in.shape[1:], return_sequences=True))  
 model.add(LSTM(128, return_sequences=True))  
 model.compile(loss='categorical_crossentropy', optimizer='adam')

Our music model consists of two LSTM layers with each layer consisting of 128 hidden layers. We use ‘categorical cross entropy‘ as the loss function and ‘adam‘ as the optimizer. Fig. 3 shows the model summary.

Fig 3. Model summary

Model Training

To train the model, we call the model.fit function with the input and output sequences as the input to the function. We also create a model checkpoint which saves the best model weights.

from keras.callbacks import ModelCheckpoint  
 def train(model, network_input, network_output, epochs):   
   Train the neural network  
   filepath = 'weights.best.music3.hdf5'  
   checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True)  
   model.fit(network_input, network_output, epochs=epochs, batch_size=32, callbacks=[checkpoint])  
 def train_network():  
   epochs = 200  
   notes = get_notes()  
   print('Notes processed')  
   n_vocab = len(set(notes))  
   print('Vocab generated')  
   network_in, network_out = prepare_sequences(notes, n_vocab)  
   print('Input and Output processed')  
   model = create_network(network_in, n_vocab)  
   print('Model created')  
   return model  
   print('Training in progress')  
   train(model, network_in, network_out, epochs)  
   print('Training completed')

The train_network method gets the notes, creates the input and output sequences, creates a model, and trains the model for 200 epochs.

Music Sample Generation

Now that we have trained our model, we can use it to generate some new notes. To generate new notes, we need a starting note. So, we randomly pick an integer and pick a random sequence from the input sequence as a starting point.

def generate_notes(model, network_input, pitchnames, n_vocab):  
   """ Generate notes from the neural network based on a sequence of notes """  
   # Pick a random integer  
   start = np.random.randint(0, len(network_input)-1)  
   int_to_note = dict((number, note) for number, note in enumerate(pitchnames))  
   # pick a random sequence from the input as a starting point for the prediction  
   pattern = network_input[start]  
   prediction_output = []  
   print('Generating notes........')  
   # generate 500 notes  
   for note_index in range(500):  
     prediction_input = np.reshape(pattern, (1, len(pattern), 1))  
     prediction_input = prediction_input / float(n_vocab)  
     prediction = model.predict(prediction_input, verbose=0)  
     # Predicted output is the argmax(P(h|D))  
     index = np.argmax(prediction)  
     # Mapping the predicted interger back to the corresponding note  
     result = int_to_note[index]  
     # Storing the predicted output  
     # Next input to the model  
     pattern = pattern[1:len(pattern)]  
   print('Notes Generated...')  
   return prediction_output

Next, we use the trained model to predict the next 500 notes. At each time step, the output of the previous layer (ŷ⟨t−1⟩) is provided as input (x⟨t⟩) to the LSTM layer at the current time step t. This is depicted in the following figure (see Fig. 4).

Fig 4. Sampling from a trained network.

Since the predicted output is an array of probabilities, we choose the output at the index with the maximum probability. Finally, we map this index to the actual note and add this to the list of predicted output. Since the predicted output is a list of strings of notes and chords, we cannot play it. Hence, we encode the predicted output into the MIDI format using the create_midi method.

### Converts the predicted output to midi format  

To create some new jazz music, you can simply call the generate() method, which calls all the related methods and saves the predicted output as a MIDI file.

#### Generate a new jazz music   
   Initiating music generation process.......  
   Loading Model weights.....  
   Model Loaded  
   Generating notes........  
   Notes Generated...  
   Saving Output file as midi....

To play the generated MIDI in the Jupyter Notebook you can import the play_midi method from the play.py file or use an external MIDI player or convert the MIDI file to the mp3. Let’s listen to our generated jazz piano music.

### Play the Jazz music  

Generated Track 1


Congratulations! You can now generate your own jazz music. You can find the full code in this Github repository. I encourage you to play with the parameters of the model and train the model with input sequences of different sequence lengths. Try to implement the code for some other instrument (such as guitar). Furthermore, such a character-based model can also be applied to a text corpus to generate sample texts, such as a poem.

Also, you can showcase your own personal composer and any similar idea in the World Music Hackathon by HackerEarth.

Have anything to say? Feel free to comment below for any questions, suggestions, and discussions related to this article. Till then, happy coding.

The post Composing Jazz Music with Deep Learning appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Welcome to Part II of the series on data visualization. In the last blog post, we explored different ways to visualize continuous variables and infer information. If you haven’t visited that article, you can find it here. In this blog, we will expand our exploration to categorical variables and investigate ways in which we can visualize and gain insights from them, in isolation and in combination with variables (both categorical and continuous).

Before we dive into the different graphs and plots, let’s define a categorical variable. In statistics, a categorical variable is one which has two or more categories, but there is no intrinsic ordering to them, for example, gender, color, cities, age group, etc. If there is some kind of ordering between the categories, the variables are classified as ordinal variables, for example, if you categorize car prices by cheap, moderate and expensive. Although these are categories, there is a clear ordering between the categories.

# Importing the necessary libraries.  
import numpy as np  
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
%matplotlib inline

We will be using the Adult data set, which is an extraction of the 1994 census dataset. The prediction task is to determine whether a person makes more than 50K a year. Here is the link to the dataset. In this blog, we will be using the dataset only for data analysis.

# Since the dataset doesn't contain the column header, we need to specify it manually.   
cols = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'annual-income']  

# Importing dataset   
data = pd.read_csv('adult dataset/adult.data', names=cols)

# The first five columns of the dataset.   

Bar graph

A bar chart or graph is a graph with rectangular bars or bins that are used to plot categorical values. Each bar in the graph represents a categorical variable and the height of the bar is proportional to the value represented by it.

Bar graphs are used:

  • To make comparisons between variables
  • To visualize any trend in the data, i.e., they show the dependence of one variable on another
  • Estimate values of a variable

# Let's start by visualizing the distribution of gender in the dataset.  
fig, ax = plt.subplots()  
x = data.gender.unique()  
# Counting 'Males' and 'Females' in the dataset  
y = data.gender.value_counts()  
# Plotting the bar graph  
ax.bar(x, y)  

Fig 1. Bar plot showing the distribution of gender in the dataset

From the figure, we can infer that there are more number of males than females in the dataset. Next, we will use the bar graph to visualize the distribution of annual income based on both gender and hours per week (i.e. the number of hours they work per week). 

# For this plot, we will be using the seaborn library as it provides more flexibility with dataframes.   
sns.barplot(data.gender, data['hours-per-week'], hue=data['annual-income'])  

So from the figure above, we can infer that males and females with annual income less than 50K tend to work more per week.


This is a seaborn-specific function which is used to plot the count or frequency distribution of each unique observation in the categorical variable. It is similar to a histogram over a categorical rather than quantitative variable.

So, let’s plot the number of males and females in the dataset using the countplot function.

# Using Countplot to count number of males and females in the dataset.  

Fig 3. Distribution of gender using countplot.

Earlier, we plotted the same thing using a bar graph, and it required some external calculations on our part to do so. But we can do the same thing using the countplot function in just a single line of code. Next, we will see how we can use countplot for deeper insights.

# ‘hue’ is used to visualize the effect of an additional variable to the current distribution.  
sns.countplot(data.gender, hue=data['annual-income'])  

Fig 4. Distribution of gender based on annual income using countplot.

From the figure above, we can count that number of males and females whose annual income is <=50 and > 50K. We can see that the approximate number of

  • Males with annual income <=50K : 15,000
  • Males with annual income > 50K: 7000
  • Females with annual income <=50K: 9000
  • Females with annual income > 50K: 1000

So, we can infer that out of 32,500 (approx) people, only 8000 people have income greater than 50K, out of which only 1000 of them are females.

Box plot

Box plots are widely used in data visualization. Box plots, also known as box and whisker plots are used to visualize variations and compare different categories in a given set of data. It doesn’t display the distribution in detail but is useful in detecting whether a distribution is skewed and detect outliers in the data. In a box and whisker plot:

  • the box spans the interquartile range
  • a vertical line inside the box represents the median
  • two lines outside the box, the whiskers, extending to the highest and the lowest observations represent the possible outliers in the data

Fig 5. Box and whisker plot.

Let’s use a box and whisker plot to find a correlation between ‘hours-per-week’ and ‘relationship’ based on their annual income.

# Creating a box plot  
fig, ax = plt.subplots(figsize=(15, 8))  
sns.boxplot(x='relationship', y='hours-per-week', hue='annual-income', data=data, ax=ax)  
ax.set_title('Annual Income of people based on relationship and hours-per-week')  

Fig 6. Using box plot to visualize how people in different relationships earn based on the number of hours they work per week.

We can interpret some interesting results from the box plot. People with the same relationship status and an annual income more than 50K often work for more hours per week. Similarly, we can also infer that people who have a child and earn less than 50K tend to have more flexible working hours.
Apart from this, we can also detect outliers in the data. For example, people with relationship status ‘Not in family’ (see Fig 6.) and an income less than 50K have a large number of outliers at both the high and low ends. This also seems to be logically correct as a person who earns less than 50K annually may work more or less depending on the type of job and employment status.

Strip plot

Strip plot is a data analysis technique used to plot the sorted values of a variable along one axis. It is used to represent the distribution of a continuous variable with respect to the different levels of a categorical variable. For example, a strip plot can be used to show the distribution of the variable ‘gender’, i.e., males and females, with respect to the number of hours they work each week. A strip plot is also a good complement to a box plot or a violin plot in cases where you want to showcase all the observations along with some representation of the underlying distribution.

# Using Strip plot to visualize the data.  
fig, ax= plt.subplots(figsize=(10, 8))  
sns.stripplot(data['annual-income'], data['hours-per-week'], jitter=True, ax=ax)  
ax.set_title('Strip plot')  

Fig 7. Strip plot showing the distribution of the earnings based on the number of hours they work per week.

In the figure, by looking at the distribution of the data points, we can deduce that most of the people with an annual income greater than 50K work between 40 and 60 hours per week. While those with income less than 50K work can work between 0 and 60 hours per week.

Violin plot

Sometimes the mean and median may not be enough to understand the distribution of the variable in the dataset. The data may be clustered around the maximum or minimum with nothing in the middle. Box plots are a great way to summarize the statistical information related to the distribution of the data (through the interquartile range, mean, median), but they cannot be used to visualize the variations in the distributions.

A violin plot is a combination of a box plot and kernel density function (KDE, described in Part I of this blog series) which can be used to visualize the probability distribution of the data. Violin plots can be interpreted as follows:

  • The outer layer shows the probability distribution of the data points and indicates 95% confidence interval. The thicker the layer, the higher the probability of the data points, and vice-versa.
  • The second layer shows a box plot indicating the interquartile range.
  • The third layer, or the dot, indicates the median of the data.

    Fig 8. Representation of a violin plot.

Let’s now build a violin plot. To start with, we will analyze the distribution of annual income of the people w.r.t. the number of hours they work per week.

fig, ax = plt.subplots(figsize=(10, 8))  
sns.violinplot(x='annual-income', y='hours-per-week', data=data, ax=ax)  
ax.set_title('Violin plot')  

Fig 9. Violin plot showing the distribution of the annual income based on the number of hours they work per week.

In Fig 9, the median number working hours per week is same (40 approximately) for both people earning less than 50K and greater than 50K. Although people earning less than 50K can have a varied range of the hours they spend working per week, most of the people who earn more than 50K work in the range of 40 – 80 hours per week.

Next, we can visualize the same distribution, but this grouping them according to their gender.

# Violin plot  
fig, ax = plt.subplots(figsize=(10, 8))  
sns.violinplot(x='annual-income', y='hours-per-week', hue='gender', data=data, ax=ax)  
ax.set_title('Violin plot grouped according to gender')  

Fig 10. Distribution of annual income based on the number of hours worked per week and gender.

Adding the variable ‘gender’, gives us insights into how much each gender spends working per week based upon their annual income. From the figure, we can infer that males with annual income less than 50K tends to spend more hours working per week than females. But for people earning greater than 50K, both males and females spend an equal amount of hours per week working.

Violin plots, although more informative, are less frequently used in data visualization. It may be because they are hard to grasp and understand at first glance. But their ability to represent the variations in the data are making them popular among machine learning and data enthusiasts.


PairGrid is used to plot the pairwise relationship of all the variables in a dataset. This may seem to be similar to the pairplot we discussed in part I of this series. The difference is that instead of plotting all the plots automatically, as in the case of pairplot, Pair Grid creates a class instance, allowing us to map specific functions to the different sections of the grid.

Let’s start by defining the class.

# Creating an instance of the pair grid plot.  
g = sns.PairGrid(data=data, hue='annual-income')

The variable ‘g’ here is a class instance. If we were to display ‘g’, then we will get a grid of empty plots. There are four grid sections to fill in a Pair Grid: upper triangle, lower triangle, the diagonal, and off-diagonal. To fill all the sections with the same plot, we can simply call ‘g.map’ with the type of plot and plot parameters.

# Creating a scatter plots for all pairs of variables.  
g = sns.PairGrid(data=data, hue='capital-gain')  

Fig 11. Scatter plot between each variable pair in the dataset.

The ‘g.map_lower’ method only fills the lower triangle of the grid while the ‘g.map_upper’ method only fills the upper triangle of the grid. Similarly, ‘g.map_diag’ and ‘g.map_offdiag’ fills the diagonal and off-diagonal of the grid, respectively.

#Here we plot scatter plot, histogram and violin plot using Pair grid.  
g = sns.PairGrid(data=data, vars = ['age', 'education-num', 'hours-per-week'])  
# with the help of the vars parameter we can select the variables between which we want the plot to be constructed.  

g.map_lower(plt.scatter, color='red')  
g.map_diag(plt.hist, bins=15)  

Fig 12. Pair Grid showing different plot between the different pair of variables.

Thus with the help of Pair Grid, we can visualize the relationship between the three variables (‘hours-per-week’, ‘education-num’ and ‘age’) using three different plots all in the same figure. Pair grid comes in handy when visualizing multiple plots in the same figure.


Let’s summarize what we learned. So, we started with visualizing the distribution of categorical variables in isolation. Then, we moved on to visualize the relationship between a categorical and a continuous variable. Finally, we explored visualizing relationships when more than two variables are involved. Next week, we will explore how we can visualize unstructured data. Finally, I encourage you to download the given census data (used in this blog) or any other dataset of your choice and play with all the variations of the plots learned in this blog. Till then, Adiós!

The post Data visualization for beginners – Part 2 appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This is a series of blogs dedicated to different data visualization techniques used in various domains of machine learning. Data Visualization is a critical step for building a powerful and efficient machine learning model. It helps us to better understand the data, generate better insights for feature engineering, and, finally, make better decisions during modeling and training of the model.

For this blog, we will use the seaborn and matplotlib libraries to generate the visualizations. Matplotlib is a MATLAB-like plotting framework in python, while seaborn is a python visualization library based on matplotlib. It provides a high-level interface for producing statistical graphics. In this blog, we will explore different statistical graphical techniques that can help us in effectively interpreting and understanding the data. Although all the plots using the seaborn library can be built using the matplotlib library, we usually prefer the seaborn library because of its ability to handle DataFrames.

We will start by importing the two libraries. Here is the guide to installing the matplotlib library and seaborn library. (Note that I’ll be using matplotlib and seaborn libraries interchangeably depending on the plot.)

### Importing necessary library  
import random  
import numpy as np  
import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
%matplotlib inline

Simple Plot

Let’s begin by plotting a simple line plot which is used to plot a mathematical. A line plot is used to plot the relationship or dependence of one variable on another. Say, we have two variables ‘x’ and ‘y’ with the following values:

x = np.array([ 0, 0.53, 1.05, 1.58, 2.11, 2.63, 3.16, 3.68, 4.21,  
        4.74, 5.26, 5.79, 6.32, 6.84])  
y = np.array([ 0, 0.51, 0.87, 1. , 0.86, 0.49, -0.02, -0.51, -0.88,  
        -1. , -0.85, -0.47, 0.04, 0.53])

To plot the relationship between the two variables, we can simply call the plot function.

### Creating a figure to plot the graph.  
fig, ax = plt.subplots()  
ax.plot(x, y)  
ax.set_xlabel('X data')  
ax.set_ylabel('Y data')  
ax.set_title('Relationship between variables X and Y')  
plt.show() # display the graph  
### if %matplotlib inline has been invoked already, then plt.show() is automatically invoked and the plot is displayed in the same window.

Fig. 1. Line Plot between X and Y

Here, we can see that the variables ‘x’ and ‘y’ have a sinusoidal relationship. Generally, .plot() function is used to find any mathematical relationship between the variables.


A histogram is one of the most frequently used data visualization techniques in machine learning. It represents the distribution of a continuous variable over a given interval or period of time. Histograms plot the data by dividing it into intervals called ‘bins’. It is used to inspect the underlying frequency distribution (eg. Normal distribution), outliers, skewness, etc.

Let’s assume some data ‘x’ and analyze its distribution and other related features.

### Let 'x' be the data with 1000 random points.   
x = np.random.randn(1000)

Let’s plot a histogram to analyze the distribution of ‘x’.

plt.title('Distribution of the variable x')  

Fig 2. Histogram showing the distribution of the variable ‘x’.

The above plot shows a normal distribution, i.e., the variable ‘x’ is normally distributed. We can also infer that the distribution is somewhat negatively skewed. We usually control the ‘bins’ parameters to produce a distribution with smooth boundaries. For example, if we set the number of ‘bins’ too low, say bins=5, then most of the values get accumulated in the same interval, and as a result they produce a distribution which is hard to predict.

plt.hist(x, bins=5)  
plt.title('Distribution of the variable x')  

Fig 3. Histogram with low number of bins.

Similarly, if we increase the number of ‘bins’ to a high value, say bins=1000, each value will act as a separate bin, and as a result the distribution seems to be too random.

plt.hist(x, bins=1000)  
plt.title('Distribution of the variable x')  

Fig. 4. Histogram with a large number of bins.

Kernel Density Function

Before we dive into understanding KDE, let’s understand what parametric and non-parametric data are.

Parametric Data: When the data is assumed to have been drawn from a particular distribution and some parametric test can be applied to it

Non-Parametric Data: When we have no knowledge about the population and the underlying distribution

Kernel Density Function is the non-parametric way of representing the probability distribution function of a random variable. It is used when the parametric distribution of the data doesn’t make much sense, and you want to avoid making assumptions about the data.

The kernel density estimator is the estimated pdf of a random variable. It is defined as

Similar to histograms, KDE plots the density of observations on one axis with height along the other axis.

### We will use the seaborn library to plot KDE.  
### Let's assume random data stored in variable 'x'.  
fig, ax = plt.subplots()  
### Generating random data  
x = np.random.rand(200)   
sns.kdeplot(x, shade=True, ax=ax)  

Fig 5. KDE plot for the random variable ‘x’.

Distplot combines the function of the histogram and the KDE plot into one figure.

### Generating a random sample  
x = np.random.random_sample(1000)  
### Plotting the distplot  
sns.distplot(x, bins=20)

Fig 6. Displot for the random variable ‘x’.

So, the distplot function plots the histogram and the KDE for the sample data in the same figure. You can tune the parameters of the displot to only display the histogram or kde or both. Distplot comes in handy when you want to visualize how close your assumption about the distribution of the data is to the actual distribution.

Scatter Plot

Scatter plots are used to determine the relationship between two variables. They show how much one variable is affected by another. It is the most commonly used data visualization technique and helps in drawing useful insights when comparing two variables. The relationship between two variables is called correlation. If the data points fit a line or curve with a positive slope, then the two variables are said to show positive correlation. If the line or curve has a negative slope, then the variables are said to have a negative correlation.

A perfect positive correlation has a value of 1 and a perfect negative correlation has a value of -1. The closer the value is to 1 or -1, the stronger the relationship between the variables. The closer the value is to 0, the weaker the correlation.

For our example, let’s define three variables ‘x’, ‘y’, and ‘z’, where ‘x’ and ‘z’ are randomly generated data and ‘y’ is defined as
We will use a scatter plot to find the relationship between the variables ‘x’ and ‘y’.

### Let's define the variables we want to find the relationship between.  
x = np.random.rand(500)  
z = np.random.rand(500)  
### Defining the variable 'y'  
y = x * (z + x)  
fig, ax = plt.subplots()  
ax.set_title('Scatter plot between X and Y')  
plt.scatter(x, y, marker='.')  

Fig 7. Scatter plot between X and Y.

From the figure above we can see that the data points are very close to each other and also if we fit a curve, along with the points, it will have a positive slope. Therefore, we can infer that there is a strong positive correlation between the values of the variable ‘x’ and variable ‘y’.

Also, we can see that the curve that best fits the graph is quadratic in nature and this can be confirmed by looking at the definition of the variable ‘y’.

Joint Plot

Jointplot is seaborn library specific and can be used to quickly visualize and analyze the relationship between two variables and describe their individual distributions on the same plot.

Let’s start with using joint plot for producing the scatter plot.

### Defining the data.   
mean, covar = [0, 1], [[1, 0,], [0, 50]]  
### Drawing random samples from a multivariate normal distribution.  
### Two random variables are created, each containing 500 values, with the given mean and covariance.  
data = np.random.multivariate_normal(mean, covar, 500)  
### Storing the variables in a dataframe.  
df = pd.DataFrame(data=data, columns=['X', 'Y'])

### Joint plot between X and Y  
sns.jointplot(df.X, df.Y, kind='scatter')  

Fig 8. Joint plot (scatter plot) between X and Y.

Next, we can use the joint point to find the best line or curve that fits the plot.

sns.jointplot(df.X, df.Y, kind='reg')  

Fig 9. Using joint plot to plot the regression line that best fits the data points.

Apart from this, jointplot can also be used to plot ‘kde’, ‘hex plot’, and ‘residual plot’.


We can use scatter plot to plot the relationship between two variables. But what if the dataset has more than two variables (which is quite often the case), it can be a tedious task to visualize the relationship between each variable with the other variables.

The seaborn pairplot function does the same thing for us and in just one line of code. It is used to plot multiple pairwise bivariate (two variable) distribution in a dataset. It creates a matrix and plots the relationship for each pair of columns. It also draws a univariate distribution for each variable on the diagonal axes.

### Loading a dataset from the sklearn toy datasets  
from sklearn.datasets import load_linnerud  
### Loading the data  
linnerud_data = load_linnerud()  
### Extracting the column data  
data = linnerud_data.data

Sklearn stores data in the form of a numpy array and not data frames, thereby storing the data in a dataframe.

### Creating a dataframe  
data = pd.DataFrame(data=data, columns=diabetes_data.feature_names)  
### Plotting a pairplot  

Fig 10. Pair plot showing the relationships between the columns of the dataset.

So, in the graph above, we can see the relationships between each of the variables with the other and thus infer which variables are most correlated.


Visualizations play an important role in data analysis and exploration. In this blog, we got introduced to different kinds of plots used for data analysis of continuous variables. Next week, we will explore the various data visualization techniques that can be applied to categorical variables or variables with discrete values. Next, I encourage you to download the iris dataset or any other dataset of your choice and apply and explore the techniques learned in this blog.

Have anything to say? Feel free to comment below for any questions, suggestions, and discussions related to this article. Till then, Sayōnara.

The post Data visualization for beginners – Part 1 appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The meteoric rise of artificial intelligence in the last decade has spurred a huge demand for AI and ML skills in today’s job market. ML-based technology is now used in almost every industry vertical from finance to healthcare. In this blog, we have compiled a list of best frameworks and libraries that you can use to build machine learning models.

1) TensorFlow
Developed by Google, TensorFlow is an open-source software library built for deep learning or artificial neural networks. With TensorFlow, you can create neural networks and computation models using flowgraphs. It is one of the most well-maintained and popular open-source libraries available for deep learning. The TensorFlow framework is available in C++ and Python. Other similar deep learning frameworks that are based on Python include Theano, Torch, Lasagne, Blocks, MXNet, PyTorch, and Caffe. You can use TensorBoard for easy visualization and see the computation pipeline. Its flexible architecture allows you to deploy easily on different kinds of devices
On the negative side, TensorFlow does not have symbolic loops and does not support distributed learning. Further, it does not support Windows.

Theano is a Python library designed for deep learning. Using the tool, you can define and evaluate mathematical expressions including multi-dimensional arrays. Optimized for GPU, the tool comes with features including integration with NumPy, dynamic C code generation, and symbolic differentiation. However, to get a high level of abstraction, the tool will have to be used with other libraries such as Keras, Lasagne, and Blocks. The tool supports platforms such as Linux, Mac OS X, and Windows.

3) Torch
The Torch is an easy to use open-source computing framework for ML algorithms. The tool offers an efficient GPU support, N-dimensional array, numeric optimization routines, linear algebra routines, and routines for indexing, slicing, and transposing. Based on a scripting language called Lua, the tool comes with an ample number of pre-trained models. This flexible and efficient ML research tool supports major platforms such as Linux, Android, Mac OS X, iOS, and Windows.

4) Caffe
Caffe is a popular deep learning tool designed for building apps. Created by  Yangqing Jia for a project during his Ph.D. at UC Berkeley, the tool has a good Matlab/C++/ Python interface. The tool allows you to quickly apply neural networks to the problem using text, without writing code. Caffe partially supports multi-GPU training. The tool supports operating systems such as Ubuntu, Mac OS X, and Windows.

5) Microsoft CNTK
Microsoft cognitive toolkit is one of the fastest deep learning frameworks with C#/C++/Python interface support. The open-source framework comes with powerful C++ API and is faster and more accurate than TensorFlow. The tool also supports distributed learning with built-in data readers. It supports algorithms such as Feed Forward, CNN, RNN, LSTM, and Sequence-to-Sequence. The tool supports Windows and Linux.

6) Keras
Written in Python, Keras is an open-source library designed to make the creation of new DL models easy. This high-level neural network API can be run on top of deep learning frameworks like TensorFlow, Microsoft CNTK, etc.  Known for its user-friendliness and modularity, the tool is ideal for fast prototyping. The tool is optimized for both CPU and GPU.

7) SciKit-Learn
SciKit-Learn is an open-source Python library designed for machine learning. The tool based on libraries such as NumPy, SciPy, and matplotlib can be used for data mining and data analysis. SciKit-Learn is equipped with a variety of ML models including linear and logistic regressors, SVM classifiers, and random forests. The tool can be used for multiple ML tasks such as classification, regression, and clustering. The tool supports operating systems like Windows and Linux. On the downside, it is not very efficient with GPU.

Written in C#, Accord.NET is an ML framework designed for building production-grade computer vision, computer audition, signal processing and statistics applications. It is a well-documented ML framework that makes audio and image processing easy. The tool can be used for numerical optimization, artificial neural networks, and visualization. It supports Windows.

9)Spark MLIib
Apache Spark’s MLIib is an ML library that can be used in Java, Scala, Python, and R. Designed for processing large-scale data, this powerful library comes with many algorithms and utilities such as classification, regression, and clustering. The tool interoperates with NumPy in Python and R libraries. It can be easily plugged into Hadoop workflows.

10) Azure ML Studio
Azure ML studio is a modern cloud platform for data scientists. It can be used to develop ML models in the cloud. With a wide range of modeling options and algorithms, Azure is ideal for building larger ML models. The service provides 10GB of storage space per account. It can be used with R and Python programs.

11) Amazon Machine Learning
Amazon Machine Learning (AML) is an ML service that provides tools and wizards for creating ML models. With visual aids and easy-to-use analytics, AML aims to make ML more accessible to developers. AML can be connected to data stored in Amazon S3, Redshift, or RDS.

Machine learning frameworks come with pre-built components that are easy to understand and code. A good ML framework thus reduces the complexity of defining ML models. With these open-source ML frameworks, you build your ML models easily and quickly.

Know an ML framework that should be on this list? Share them in comments below.

The post 11 open source frameworks for AI and machine learning models appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

“Data is abundant and cheap but knowledge is scarce and expensive.”

In the last few years, there has been a data revolution that has transformed the way we source, capture, and interact with data. From fortune 500 firms to start-ups, healthcare to fintech, machine learning and data science have become an integral part of everyday operations of most companies. Of all the sectors, the social good sector has not seen the push the other sectors have. It is not that the machine learning and data science techniques don’t work for this sector, but the lack of financial support and staff has stopped them from creating their special brand of magic here.

At HackerEarth, we intend to tackle this issue by sponsoring machine learning and data science challenges for social good.

Machine learning at HackerEarth

Even though machine learning (ML) is a new wing at HackerEarth, this is the fastest growing unit in the company. Also, over the past year, we have grown to a community of 200K+ machine learning and data science enthusiasts. We have conducted 50+ challenges across sectors with an average of 6500+ people participating in each.

The initiative

The “Machine Learning Challenges for Social Good” initiative is focused toward bringing interesting real-world data problems faced by nonprofits and governmental and non-governmental organizations to the machine learning and data science community’s notice. This is a win-win for both communities because the nonprofits and governmental and non-governmental organizations get their challenges addressed, and the machine learning and data science community gets to hone their skills while being agents of change.

Our role

HackerEarth will contribute by  

  • Working with the organizations to identify  and prepare the data set most suitable for the initiative
  • Hosting the challenge which includes the following (but not limited to)
    • Creating a problem statement with the given dataset
    • Finding support data sets if required
    • Framing the best evaluation metric to choose the winners
  • Promoting this to our developer community and inviting programmers to contribute to the cause
  • Sharing the winning approaches and ML models built, which the organizations can use
  • Sponsoring prizes 

Are you a nonprofit or a governmental/non-governmental organization with a business/social problem for which primary or secondary data is available? If yes, please mail us at social@hackerearth.com. [Please use subject line “Reg: Machine Learning for Social Good | {Your Organization Name}]

The post Leverage machine learning to amplify your social impact appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

What is Artificial Intelligence (AI)?

Are you thinking of Chappie, Terminator, and Lucy? Sentient, self-aware robots are closer to becoming a reality than you think. Developing computer systems that equal or exceed human intelligence is the crux of artificial intelligence. Artificial Intelligence (AI) is the study of computer science focusing on developing software or machines that exhibit human intelligence. A simple enough definition, right?

Obviously, there is a lot more to it. AI is a broad topic ranging from simple calculators to self-steering technology to something that might radically change the future.

Goals and Applications of AI

The primary goals of AI include deduction and reasoning, knowledge representation, planning, natural language processing (NLP), learning, perception, and the ability to manipulate and move objects. Long-term goals of AI research include achieving Creativity, Social Intelligence, and General (human level) Intelligence.

AI has heavily influenced different sectors that we may not recognize. Ray Kurzweil says “Many thousands of AI applications are deeply embedded in the infrastructure of every industry.” John McCarthy, one of the founders of AI, once said that “as soon as it works, no one calls it AI anymore.”

Broadly, AI is classified into the following:

Source: Bluenotes

Types of AI

While there are various forms of AI as it’s a broad concept, we can divide it into the following three categories based on AI’s capabilities:

Weak AI, which is also referred to as Narrow AI, focuses on one task. There is no self-awareness or genuine intelligence in case of a weak AI.

iOS Siri is a good example of a weak AI combining several weak AI techniques to function. It can do a lot of things for the user, and you’ll see how “narrow” it exactly is when you try having conversations with the virtual assistant.

Strong AI, which is also referred to as True AI, is a computer that is as smart as the human brain. This sort of AI will be able to perform all tasks that a human could do. There is a lot of research going on in this field, but we still have much to do. You should be imagining Matrix or I, Robot here.

Artificial Superintelligence is going to blow your mind if Strong AI impressed you.  Nick Bostrom, leading AI thinker, defines it as “an intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.”

Artificial Superintelligence is the reason why many prominent scientists and technologists, including Stephen Hawking and Elon Musk, have raised concerns about the possibility of human extinction.

How can you get started?

The first thing you need to do is learn a programming language. Though there are a lot of languages that you can start with, Python is what many prefer to start with because its libraries are better suited to Machine Learning.

Here are some good resources for Python:

Introduction to Bots

A BOT is the most basic example of a weak AI that can do automated tasks on your behalf. Chatbots were one of the first automated programs to be called “bots.” You need AI and ML for your chatbots. Web crawlers used by Search Engines like Google are a perfect example of a sophisticated and advanced BOT.

You should learn the following before you start programming bots to make your life easier.

    • xpath – This will help you to inspect and target HTML and build your bot from what you see there.
    • regex – This will help you to process the data you feed your bot by cleaning up or targeting (or both) the parts that matter to your logic.
  • REST – This is really important as you will eventually work with APIs. You can use requests to do this.

How can you build your first bot?

You can start learning how to create bots in Python through the following tutorial in the simplest way.

You can also start by using APIs and tools that offer the ability to build end-user applications. This helps you by actually building something without worrying too much about the theory at first. Some of the APIs that you can use for this are:

Here’s a listing of a few BOT problems for you to practice and try out before you attempt the ultimate challenge.

What now?

Once you have a thorough understanding of your preferred programming language and enough practice with the basics, you should start to learn more about Machine Learning. In Python, start learning Scikit-learn, NLTK, SciPy, PyBrain, and Numpy libraries which will be useful while writing Machine Learning algorithms.You need to know Advanced Math and as well.

Here is a list of resources for you to learn and practice:

Here are a few more valuable links:

You should also take part in various AI and BOT Programming Contests at different places on the Internet:

Before you start learning and contributing to the field of AI, read how AI is rapidly changing the world.

The post Artificial Intelligence 101: How to get started appeared first on HackerEarth Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Hackathons have become the go-to method for developers and companies to innovate in tech and build great products in a short span of time. With tens of thousands of developers participating in hackathons across the globe every month, it’s a great way to collaborate and work on solving real-life issues using tech.

“Along with being stimulating and productive, hackathons are fun” – says Team Vikings, who won the first prize (a brand new Harley-Davidson bike!) in the recently concluded GO-HACK hackathon. The team built Rashalytics – a comprehensive platform for analysing and minimising rash driving. And now, they have big plans of taking this hack live for the public.

Read on to know more about their amazing idea and how they built the platform.

What is Rashalytics?

Rashalytics is a system that promises to mitigate the problem of rash driving by intelligently incentivising or penalising the driver based on his driving style. It has been designed to reduce the number of accidents that have increased with the hyperlocal on-demand delivery requiring breakneck speeds of the various products.

The system is able to extract the data of rich metrics like sharp acceleration, hard braking, sharp turns, etc. from the driver’s android phone, which is used to train the machine learning models.


  • Nodejs – To create the API server and the mock sensor data generator
  • Kafka – To build the data pipeline
  • Apache Spark – To process the real-time data stream and generate metrics to measure the driving quality
  • ReactJs – To create the dashboard web app
  • Google roads & maps API : To get the traffic and ETA data


The system primarily consists of 4 parts:.

  1. The Android app: Simulated by the team, this app aggregates locally and sends the chunks of sensor data to the API server via an HTTP endpoint.
  2. API Server: This matches the data with the schema and if valid, it puts the data in Kafka queue.
  3. Engine: Made with the Apache Spark, this helps sensor data to aggregate and form metrics such as sharp acceleration, hard braking, sharp turns, etc. These metrics, in turn, are used to generate a dynamic driving quality score for the driver. This score forms the basis of a lot of analytics and functionalities that this system provides.
  4. Dashboard: The dashboard provides a beautiful and intuitive interface to take proactive decisions as well as run analytics using the provided APIs. It has been written using ReactJS.

Here’s the flow diagram showing how the whole system works:

This system allowed the team to create:

  • A dynamic profile and the dashboard of the rider describing his driving style, which affects his rating.
  • An actionable “real-time” rash driving reporting system which allows the authorities and the hub incharges to react before it’s too late.
  • A dashboard usable by both the fleet managers and traffic police control board to visualise the data such as incident distribution by time, which tells at what time of the day a driver is more likely to drive in an unsafe manner.
  • A modular system in which the new data sources, metrics, and models can be added so that the third-party vendors can be easily on-boarded onto the platform.


Here are some of the challenges that the team faced while building this application:

  1. Setting up the entire system architecture with different components by developing them in isolation and then combining them together to work seamlessly
  2. Deciding the thresholds for different metrics after which the driving will be considered rash
  3. Creating a linear predictor for the driving quality score vs time with only one data point
  4. Creating a synthetic feature as generating the score itself is challenging enough

What’s Next?

Project creators Shivendra Soni, Rishabh Bhardwaj and Ankit Silaich have great plans in store for their project. Here are some of their ideas:

  1. Create and SDK for easy data collection and integration with different apps and make it possible for third-party vendors to utilise this data
  2. Improve the driving score model to include even more parameters and make it more real-world oriented
  3. Create a social profile which lets the users share their driving score
  4. Enable enterprise grade plug-n-play integration support

The post How these hackathon winners apply Machine Learning to minimize rash driving appeared first on HackerEarth Blog.

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview