Creating An Image Dataset and Labelling By Web Scraping

While working on a data science project, the first step is acquiring the data. For this purpose, we traverse through several websites where certain datasets are available in a structured manner and we can download and have it ready to use. Even if we have a dataset, it might not have enough data and we know our ML models want a good amount of data to be trained well. In case of classification problems, we need this data along with labels. But this is not always the case, often for a specific problem statement dataset might not be readily available. Suppose we want to build a face mask classifier and maybe after several web searches we don’t get the desired dataset. In such situations, we need to make our dataset.

In computer vision problems, very less is said about acquiring images and more about working with images. Thus I’ll be going through this crucial step of making a custom dataset and also labelling it.

In this article, I’ll be discussing how to create an image dataset as well as label it using python. For creating an image dataset, we need to acquire images by web scraping or better to say image scraping and then label using Labeling software to generate annotations.

Web Scraping

Web scraping means extracting data from websites, wherein a large amount of data after extraction is stored in a local system. Web scraping may access the world wide web through https and a web browser.

The most well-known image scraping python library is beautifulsoup that parses HTML and XML documents. The requests library makes the necessary requests to the webpage. Both the packages are pip installable(and maybe already preinstalled).

import BeautifulSoup
import requests as rq
import os
r2 = rq.get("https://www.pexels.com/search/koala")
soup = BeautifulSoup(r2.text,"html.parser")
links = []

If we click onto any picture on the webpage and go to developer tools we’ll see the specified format starts with ‘images.pexels.com/photos’, up to photos the format is the same and then a unique number is present, thus we specify that so similar images can be acquired. This is a form of regex(regular expressions).

images = soup.select('img[src^="https://images.pexels.com/photos"]')
for img in images:
    links.append(img['src'])

After this step if we wish we can print the ‘links’ list to see the image links that have been scrapped.

Making our directory to save images in it

os.mkdir('jayita_photos')

Now we download images and only 10 images to show the working. The entire page can also be done. This is done with the usual file handling technique.

i=1
for index,img_link in enumerate(links):
    if i<=10:
        img_data = rq.get(img_link).content
        with open("jayita_photos//"+str(index+1)+'.jpg', 'wb+') as f:
            f.write(img_data)
        i+=1
    else:
        f.close()
        break

After successfully running the program go to the specified file path and you can see that the images are stored.

Labelling

Now that we have our images we need to label them for classification. For this, we’ll be using the labelling software. Labelling is a GUI based annotation tool. Works with Python 3 and above. It’s a pip installable. Provides two types of annotations Pascal VOC(this is used by ImageNet) and YOLO.

Labelling software opens up with the above command.

On the left side there are specified options and on the right side image file information will be shown. For a single image select open for a directory of images select ‘open dir’ this will load all the images. To go to the previous image press ‘a’, for next image press ‘d’.

Drawing the rectangular box to get the annotations. Press ‘w’ to directly get it.

After drawing this window will pop up which means to store the class name for that particular image.

Providing class labels (koala in this case) on the right side of the window it shows.

After drawing the bounding box and labelling the precise class name its important save along with format(Pascal VOC or YOLO) that will generate the annotations. This file is stored in an XML format.

Format for Pascal VOC form of annotations

Format for YOLO 

The first one 0 represents object id, then rest 4 are bounding box coordinates

the class file containing the class names generated along with YOLO format

Conclusion

Creating own image datasets with these steps can be helpful in situations where the dataset is not readily available or less amount of data is available then to increase size this can be used. I’ve only shown it for a single class but this can be applied to multiple classes also, provided all the classes are placed in the same folder.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link