Let’s say you want to create your own ASR dataset. You gathered some audios and texts of variable lengths, but STT models require relatively short audios to be trained on. Few minutes long audio is already too long and requires slicing. The possible solution is audio to text aligning.

This tutorial explains how to align long audios with their texts not using any complex or heavyweight programs for Forced Alignment. Prerequisite is some STT model capable of generating decent text (with timestamps) to align with our existing labels (transcriptions).

Text align

Firstly we need to inference our audios and align STT outputs…

Project purpose

This article describes part of the project of monitoring the parliamentary elections in Georgia in 2020. The main goal of the project was to find forgers — people voting several times (at different voting stations) in the online regime. There was a volunteer capturing all voters on election day at most of the voting stations. To reach the goal we needed to train a lightweight embedder that could identify repeating persons in different backgrounds, captured with different devices.

Dataset Acquisition

There are a lot of public person ReID datasets. Some of them can be used for all purposes…


