Soving Captchas to scrap pirate websites

@ MDG

Key achievements

  • Built a software to solve most frequent reCaptchas challenges with an accuracy between 87 and 96%
  • Lead a small team to upskill in a totally new field and meet short deadlines

Client

Massive Data Guadian (MDG) was an early stage startup based in Luxembourg aiming at fighting illegal broadcast of films by pirate websites.

The MVP product needed to scrap pirate website to be able to provide a live vision of all pirate links for a defined list of movies.

Problem

Pirate websites were using Captchas to stop bots from scrapping them and identify links of pirated films.

At that time, in 2018, the most common type of Captchas were reCaptchas V2 (see the exemple picture).

The founder of the startup untrusted me with the responsibility of building a team to develop a deep learning module able to solve reCaptchas.

I found two others people to help me in this task.

We agreed to work six month on the project and see were this would take us, as we would be working part-time on the project, aside from our full time jobs.

Process & Solution

1.

Upskilling in Deep Learning

No member of the team already had knowledge in Deep Learning. We had to divide the task in order to speed up the learning process. We used, among others, Stanford onlines courses on Convolutional Neural Networks (CNN) for visual recognition.

2.

Choice and design of technical solution

We choosed to use Pythorch with Python. Although at the time it was not as developed as Tensorflow, it was said to be quicker to learn : a good option given the little amount of time we had.

For more precision, we decided to build a model for each challenge (one that say if a picture is a car or not, on to find out wether it is a road or not, ...).

To get the most results out of the minimum of efforts, we started working on solving the most recurring challenges first (car, road, bus, storefront, ...)

I also defined the software architecture and an interface to allow our module to communicate with the others pieces of software the startup was using.

3.

Developing a Python framework

At the time, I was working as a data engineer, building industrial Machine Learning production code.

So I was the one who built a Python Framework to make the training and use of models simple with a few parameters. It was even more useful and time saving as non of us has used Python before.

4.

Building a Dataset

Of course, no dataset was already available.

We scrapped the data from the reCaptachas API. To be more precise, we build a dataset for each porblem.

We then started to manually label the images, which was very unpleasant and time consuming.

Building the dataset and refining the models is an iterative process: Step 1: Scraping data from the reCaptcha API Step 2: Manual sorting Step 3: Training the initial model (V0.1) Step 4: Sorting using the new V0.1 model Step 5: Manual verification Step 6: Training an updated model (V0.2) with the latest data

We decided to label a minimal dataset, build a model with it, sort new images with it, check sorting manually (much faster), reuse the new data to improve the model and work in such cycles until new data do not improve the model significantly.

We ended up with a dataset of over 60k labeled images.

5.

Identifying best models & parameters

We finally trained and tested different models and parameters as we grew the dataset to find the best model for each problem.

We had limited computational ressources and couldn't models consuming too much RAM, which added some more challenge.

6.

Managing the Team

Motivating a team to work on a side project on their free time for six months is not an easy task.

To keep everyone engaged, the experience needed to be able to make each person grow in a meaningful way.

I selected people I could trust to be part of the team and had conversations with each of them to understand their deepest motivations. I then distributed the tasks to match as much as possible with every one's goal.

I planned the whole project and kept folowing advances to build a sense of progress and keep the team organised.

Results

At the end of the six month, we had built a software that could sort the most frequent reCaptachas images categories live as MDG's software encoutered them.

We reached an accuracy between 87 and 96% for the most frequent images categories.

We also identified a set of ways to improve our models even further :

  • use confidence scores and add more hard-to-categorize pictures into the dataset
  • Use several models at the time and vote for the most probable solution
  • Buy better hardware to allow the use of more performing models
Tools & methods :

Python,  Pytorch, Deep Learning, Convolutional Neural Network (CNN), ResNet, GoogLeNet

Interested in cooperation or would like to discuss anything ?

>