Embedder searching Falsifiers on polling stations

Zpoken
6 min readMay 13, 2021

Project purpose

This article describes part of the project of monitoring the parliamentary elections in Georgia in 2020. The main goal of the project was to find forgers — people voting several times (at different voting stations) in the online regime. There was a volunteer capturing all voters on election day at most of the voting stations. To reach the goal we needed to train a lightweight embedder that could identify repeating persons in different backgrounds, captured with different devices.

Dataset Acquisition

There are a lot of public person ReID datasets. Some of them can be used for all purposes, some for academic research only. You can find links to some of available datasets here: https://github.com/NEU-Gou/awesome-reid-dataset

Most popular datasets of a lot of academic papers focus on are Market-1501 and Duke-MTMC.

There are a lot of public pretrained person-reid models available. You can find some of them here: https://kaiyangzhou.github.io/deep-person-reid/MODEL_ZOO.html

The problem with available models is that they are trained on one or a few datasets for which the conditions images were taken are somewhat similar: camera is located at some fixed position and angle, background is often similar. To train the embedder that will work well in a real world scenario with completely different conditions we need to use the dataset that will cover as much variability as possible. One way to create such a dataset is to unite some of the available datasets into a single one.

Training losses

To get a better idea on how to unite the datasets we will quickly recap two of the existing training losses for reid.

The first approach is triplet loss and it’s modifications. Idea of triplet loss is simple: we train embeddings to be as close as possible for different images of the same person and to be as far as possible for images of different persons, like shown on the figure.

Simplified scheme of the triplet loss

Second approach is to build one additional linear layer on top of CNN generating embedding. Dimensionality of the output of the linear layer should be equal to the number of unique persons in the dataset. At the training stage we use softmax activation and cross-entropy loss function to simultaneously train CNN extracting embeddings and linear classification layer. At the test stage we discard that linear layer and use only the embedding extractor.

Both losses are available in the excellent reid framework: https://github.com/KaiyangZhou/deep-person-reid

Ways to unite the dataset

There are two possibilities of uniting multiple datasets into a single one.

Let’s say we have two reid datasets: first one with n1 unique persons and m1 total images, and second one with correspondingly n2 and m2 unique persons and total images.

The first possibility is to train the model on both datasets but not mixing them. Thus, for example, if we use triple loss, then we take pair elements in a way that both elements are from the same dataset (first or second). If we train using cross-entropy loss, then we need to create 2 separate linear classification layers with n1 and n2 output nodes. During the training we make forward pass through the first linear layer for images from the first dataset and through the second for images from the second dataset, like shown in the picture.

Cross-entropy loss for embedding training for separated dataset

Advantage of using such an approach is that it works in cases where datasets we are uniting have intersecting sets of persons.

In case datasets are not intersecting, it is much better to unite them into a single dataset with n1+n2 unique persons with m1+m2 total elements. Then, during the training we can use pairs taken from different datasets (for triplet loss) or a single linear layer with n1+n2 output elements for cross-entropy loss. Second approach results in a much higher number of training pairs than the first and thus better model generalization.

In practice different person reid datasets have some persons in common. In fact, in many cases, same persons with different identities could appear even inside some separate datasets. This could harm training and quality of the resulting embedder, thus it would be nice to filter the resulting united dataset from duplicates first.

Filtering the dataset

We found it useful to apply an iterative procedure of filtering the dataset as follows. Firstly we take any pretrained person reid embedder and compute embeddings for all elements of the dataset. After that we compute all distances between all pairs of embeddings of persons with different identities and sort all pairs by that distance. After that we manually look at first let’s say 500 pairs in the sorted list and mark pairs of images of the same persons. After that we remove all images of marked persons from the dataset and repeat the process. At some point we train the embedder on the filtered dataset and repeat the process using a trained embedder. After several iterations a large part of repeating persons in the will be removed.

Metrics & Experiments

To test the quality of publicly available embedders and the ones trained by ourselves we created a separate test set. In the real world scenario we had a very large number of unique persons and only a few repeating persons we needed to detect.

To simulate that, we created a test set that consisted of 4000 unique persons with about 10000 images of them. We created a list of all possible pairs of images and removed all pairs of the same identity. This resulted in a list of ~20 mln image pairs (each pair consisted of images of different persons). After that we added 74 pairs of images of same persons (so we had 2 different photos for each of the additional 74 persons). After that we found distances between embeddings for each pair and sorted the list w.r.t. distances.

Thus we obtained a sorted list of ~20 mln pairs of images, 74 of which are pairs of images of the same persons. If the embedder is not doing something useful, these 74 pairs will be uniformly distributed over all pairs. If the embedder is ideal, these 74 pairs will be at first 74 positions in the list.

To check the embedder quality we measure such metrics on the sorted queue: minimal, average and maximal position of 74 pairs of same persons, how much of 74 pairs are inside first 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000 pairs out of 20 mln.

Results of experiments

Some tips and conclusions

  • After dataset union there could be a disbalance in the number of images for different persons. It is nice to limit the number of different images for each person by some small number, for example 3.
  • It is helpful to use augmentations of the input images at the training stage. We used gamma, brightness and affine geometric transformations, as well as dropout.
  • Uniting the reidentification datasets could result in a boost of overall accuracy and better performance in real world scenarios with varying conditions.

--

--

Zpoken

We are a full-stack Web3 development organization.