Subtitles-rs - Use SRT subtitle files to study foreign languages

Overview

Rust subtitle utilities

Build Status Build status

Are you looking for substudy? Try here. (substudy has been merged into the subtitles-rs project.)

This repository contains a number of related tools and libraries for manipulating subtitles. See the README.md files in this individual subdirectories for more details.

  • substudy: Learn foreign languages using audio and subtitles extracted from video files.
  • vobsub: A Rust library for parsing subtitles in sub/idx format.
  • vobsub2png: A command-line tool for converting sub/idx subtitles to PNGs with JSON metadata.
  • opus_tools: Utilities for parsing subtitle data from the OPUS project, for use as input to various language models.
  • common_failures: Useful Fail implementations and error-handling tools.
  • cli_test_dir: A simple integration testing harness for CLI tools.

The following subtitle-related projects can be found in other repositories:

  • aligner: This GPLed library by kaegi uses dynamic programming to re-align out-of-sync subtitles using another subtitle file with known-good timing.
  • subparse: This library by kaegi parses many common subtitle formats.

License

This code is distributed under the CC0 1.0 Universal public domain grant (plus fallback license), with the exception of some data in the fixtures directory, which contains a few individual frames of subtitle data used in tests. Note that none of the individual crates include that data.

Contributions

Your feedback and contributions are welcome! Please feel free to submit issues and pull requests using GitHub.

Comments
  • Can't compile substudy

    Can't compile substudy

    I'm trying to update substudy and get it running again after a long time not using it. It used to work! On this same laptop, even, but before I'd upgraded to El Capitan.

    I'm on a Mac (El Capitan 10.11.5). I ran multirust update stable and brew install cmake ffmpeg to make sure those dependencies were all up to date. No issues there.

    But whether I run cargo install substudy or cargo build, I get the same error message:

    /Users/arthaey/.multirust/toolchains/stable/cargo/registry/src/github.com-1ecc6299db9ec823/substudy-0.4.0/src/video.rs:13:20: 13:25 warning: unused import, #[warn(unused_imports)] on by default
    /Users/arthaey/.multirust/toolchains/stable/cargo/registry/src/github.com-1ecc6299db9ec823/substudy-0.4.0/src/video.rs:13 use err::{err_str, Error, Result};
                                                                                                                                                   ^~~~~
    error: linking with `cc` failed: exit code: 1
    note: "cc" "-m64" [snip verbose options]
    note: Undefined symbols for architecture x86_64:
    [snip stacktrace]
    ld: symbol(s) not found for architecture x86_64
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    
    error: aborting due to previous error
    error: failed to compile `substudy v0.4.0`, intermediate artifacts can be found at `/var/folders/j0/8d669w252655xfzfrfhnsgjc0000gn/T/cargo-install.qUrb7Re5e91W`
    

    Running with --verbose didn't give any more useful output.

    opened by Arthaey 20
  • Problem: tests fail on Windows

    Problem: tests fail on Windows

    Also, if a non-exe file maching binary is present in the build directory, it'll get picked up but there'll be an error saying that this binary is not a valid Win32 binary.

    Solution: adjust tests to accommodate for Windows' ecosystem and make the binary name automatically assume .exe extension when compiled on Windows.

    opened by yrashk 9
  • I cannot get substudy to function

    I cannot get substudy to function

    When I run, substudy combine episode_01_01.es.srt episode_01_01.en.srt > episode_01_01.bilingual.srt

    Nothing happens. If I drop the word substudy from the above, I get "Usage: combine Filel OP file2" as an error.

    If I run substudy export csv episode_01_01.mkv episode_01_01.es.srt episode_01_01.en.srt then I get "MissingFieldError("streams")"

    If I run that code wthout the ">" after mkv, then I get: Invalid arguments. Usage: substudy clean <subs> substudy combine <foreign-subs> <native-subs> substudy export csv <video> <foreign-subs> [<native-subs>] substudy export review <video> <foreign-subs> [<native-subs>] substudy export tracks <video> <foreign-subs> substudy list tracks <video> substudy --help substudy --version

    https://forum.language-learners.org/viewtopic.php?f=19&t=6360&start=10 is where another person tried to help. I welcome your thoughts on where I am going wrong. I installed LinuxMint18 just to use your program, so maybe it is possible I am missing some foundational download I need before using substudy.

    opened by Spinozza 9
  • ExpectedError(

    ExpectedError("Number", "-99")

    I'm trying to export my mkv to csv using the given command substudy export csv episode_07_01.mkv \ episode_07_01.fr.srt episode_07_01.en.srt but I can't seem to get it working. I get ExpectedError("Number", "-99").

    I'm having no problems using the combine code for the bilingual subtitles. I just can't seem to get the export working for the life of me..

    opened by swedeeee 6
  • Split audio according to subtitle timing exactly for songs

    Split audio according to subtitle timing exactly for songs

    The default algorithm for splitting up audio works well enough for TV shows, where splitting up a sentence isn't really the end of the world.

    For my weaker languages, I had the idea to use substudy with songs instead of TV. But it's going to be really weird and annoying to split up lyrics that way, especially when they're already "bite-sized".

    So here I am, asking for yet another option to be added to substudy. ;)

    opened by Arthaey 5
  • Reduce padding on subtitle times

    Reduce padding on subtitle times

    Great tool, thank you for your efforts.

    I do wish for one feature especially though, it's almost a deal-breaker, and that's the ability to adjust the padding of subtitle times through an argument when generating Anki cards through substudy.

    It seems to me, and correct me if I'm wrong, but it certainly seems that the times are being automatically padded by substudy, meaning there is extra time added on both the start and the end.

    This might be a sensible default, as it's presumably better to have a little too much of what's said, than too little, but I have some pretty nicely timed subtitles, and it would be great to be able to prevent this padding from taking place, even if it's not the default.

    I've tried looking through the source, but tbh, I'm not a great programmer, I'm actually integrating substudy into some scripts, but I've been looking at a lot of python and POSIX sh for the last few years, not so much C-like languages, and I'm totally unfamiliar with Rust. It seems a lot isn't commented, so I find my eyes glazing over and my brain shutting down when I tried to figure stuff out in the source myself.

    Even if this isn't what's happening, perhaps a command line argument to reduce or increase padding would be good.

    I'd try to implement it myself as I'm sure it's rather simple, but for the reasons I stated earlier, I'm pretty much at the mercy of your (or some other kind soul's) benevolence.

    opened by NinKenDo64 4
  • Option to include an extra line of dialog before or after

    Option to include an extra line of dialog before or after

    The automatic splitting seems pretty decent, but sometimes it chops a sentence awkwardly. It would be nice if I could ask substudy to include one line of dialog before or after the "target" line for context.

    Ideally, as separate "columns" in the csv output, so I can style them differently in Anki. :)

    opened by Arthaey 4
  • Cannot truncate time period Period

    Cannot truncate time period Period

    An odd error.

    Cannot truncate time period Period { begin: 610.839, end: 611.874 } at 610.839

    The first part of the file is.

    1 00:00:06,120 --> 00:00:08,475 Parco nazionale Harwood PENNSYLVANIA

    2 00:00:08,519 --> 00:00:09,998 Forse dovremmo tornare indietro.

    I've compared this to other ones which have been successful and it doesn't look any different. So, is it possible the problem is with the mpv file output by ogmrip ?

    opened by rdearman 4
  • No Tag data when MKV created by ogmrip

    No Tag data when MKV created by ogmrip

    Using the list track command produces an error when querying a mkv file created by ogmrip. Had to use ogmrip because of copy protection.

    substudy list tracks CM6-0E-UT3.2_DES.mkv MissingFieldError("tags")

    Using the command mkvinfo, I can see the track information and extract the appropriate sub-title track.

    mkvinfo CM6-0E-UT3.2_DES.mkv | + A track | + Track number: 4 (track ID for mkvmerge & mkvextract: 3) | + Track UID: 914786270282254080 | + Track type: subtitles | + Lacing flag: 0 | + Codec ID: S_TEXT/UTF8 | + A track | + Track number: 5 (track ID for mkvmerge & mkvextract: 4) | + Track UID: 9054752488569948265 | + Track type: subtitles | + Default flag: 0 | + Lacing flag: 0 | + Codec ID: S_TEXT/UTF8 | + Language: ita

    any subsequent reference to tracks however fails on the tags metadata.

    substudy export tracks CM6-0E-UT3.2_DES.mkv it.srt MissingFieldError("tags")

    opened by rdearman 3
  • MissingFieldError(

    MissingFieldError("codec_name")

    Seems if you use a format (mp4) other then MKV you get this error "MissingFieldError("codec_name")". However a simple convert works in ffmpeg. I wasn't sure if there is another way to get around this. I may try my hand at a PR on this, but have never done Rust ;) Great program btw, much easier then sub2srs

    ffmpeg -i ~/Downloads/file.mp4 -vcodec copy -acodec copy ~/Downloads/out_file.mkv

    opened by mattkanwisher 2
  • graphic interface

    graphic interface

    It would be nice to have graphic interface sometime in the future, although I know there is no clear solution to this problem now, so maybe just keep this issue open until there is a light, cross-platform solution in the future.

    opened by jaroslaw-weber 2
  • Substudy: Adjust image size as command line option

    Substudy: Adjust image size as command line option

    The tool is great but the images are a little small for me. I guess defaulting to small images is to save on disk space but it would be great to able to select image size. I'm not so familiar with Rust so I couldn't figure out from the code how image size is handled currently, but perhaps allowing the user to scale the input resolution by a factor of the original would be the simplest way.

    opened by sykul 0
  • Add cleaning of bracketed text

    Add cleaning of bracketed text

    Adds cleaning of bracketed text such as [Music] to the substudy clean command. This text is found on YouTube video subtitles such as the start of this video. Let me know if you want me to remove the test case I added to substudy/src/clean.rs or if there is a better place to put it. Thanks for making this project!

    opened by sirfredrick 0
  • In combined mode, map words to their traduction

    In combined mode, map words to their traduction

    First of all thanks for the amazing tool. It's super nice to be able to combine both subtitles.

    Ideally I would like to be able to select words from a subtitlte and act on them like https://animelon.com/video/594e558522477e0e27f1f035 does

    I think you have a similar project with https://github.com/emk/subtitles-rs/issues/22 . A more short term goal could be, in the srt file, to map each words to their translation (in combined mode) via different colors /underline where that makes sense.

    enhancement 
    opened by teto 3
  • Vobsub: allow parsing demuxed data streams

    Vobsub: allow parsing demuxed data streams

    In a framework such as GStreamer, the vobsub decoder will be inserted in a pipeline after elements taking care of demuxing the Mpeg 2 stream. In such a situation, the PES packets are already parsed. Only the subtitle data is available.

    This PR introduces the data mode in SubtitlesFromChunks which uses the same interface as the packet mode. The file fixtures/example_data.sub is a demuxed version of fixtures/example.sub.

    Notes:

    • I didn't try fuzz testing yet.
    • This PR includes the changes from #41. I'll rebase if it is accepted.
    opened by fengalin 0
  • Vobsub: allow parsing input stream by chunks

    Vobsub: allow parsing input stream by chunks

    This is a first step toward https://github.com/sdroege/gst-plugin-rs/issues/21

    The PR adds a new struct SubtitlesFromChunks which keeps track of the parsing context and allows passing chunks.

    Except if I missed something, the API and behaviour for Subtitles is unchanged. I couldn't find a solution to avoid using the Rc<RefCell<>> for SubtitlesContext without breaking the existing API.

    Changes were tested using examples in "fixtures" (including for the parse_corpus ignored set).

    opened by fengalin 7
Releases(substudy_v0.4.5)
  • substudy_v0.4.5(Dec 17, 2017)

    This release includes lots of handy minor fixes:

    • We now have official binaries for Linux, Mac and Windows. Please let me know whether these work for you.
    • The uchardet and cld2 dependencies have been replaced with pure Rust dependencies. This makes it easier to support many platforms and to build from source.
    • We now have a progress bar for media exports!
    • Argument parsing has been totally overhauled, so help messages should be better.
    • Error formatting has been standardized and improved, so it's easier to figure out why something went wrong.
    • We now support *.srt files generated by Aeneas, which is excellent for syncing audiobooks with text. These broke before because Aeneas occasionally generates 0-second subtitles, which we rejected as invalid.
    Source code(tar.gz)
    Source code(zip)
    substudy-v0.4.5-linux.zip(2.50 MB)
    substudy-v0.4.5-osx.zip(1.61 MB)
    substudy-v0.4.5-windows.zip(1.22 MB)
Owner
Eric Kidd
Eric Kidd
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

Alexander Krivács Schrøder 2 Jul 25, 2022
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Splits test files into multiple groups to run tests in parallel nodes

split-test split-test splits tests into multiple groups based on timing data to run tests in parallel. Installation Download binary from GitHub releas

Fumiaki MATSUSHIMA 28 Dec 12, 2022
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Wilfred Hughes 13.9k Jan 2, 2023
A command line tool for renaming your ipa files quickly and easily.

ipa_renamer A command line tool for renaming your ipa files quickly and easily. Usage ipa_renamer 0.0.1 A command line tool for renaming your ipa file

Noah Hsu 31 Dec 31, 2022
format whisper transcripts to .srt

whispersub A dead simple utility to format the output of OpenAI's whisper model (or whisper.cpp) into an .srt file. Usage whispersub input.txt -o outp

Mike Dallas 3 Jul 21, 2023
A site for hosting (Japanese) subtitles

jimaku (字幕) jimaku is a simple site dedicated to hosting Japanese subtitles of anime or other Japanese content. It's the spiritual successor of kitsun

Danny 47 Jul 21, 2024
Deno Foreign Function Interface.

deno_plugin_ffi (WIP & Need Help) Deno Foreign Function Interface. deno_ffi is a Deno plugin for loading and calling dynamic libraries using pure Java

Deno Foreign Function Interface 37 Aug 18, 2022
Foreign Function Interface Plugin for Deno.

Deno FFI Plugin to call dynamic library functions in Deno. Usage import { Library } from "https://deno.land/x/[email protected]/mod.ts"; const lib = new

DjDeveloper 4 Aug 18, 2022
Safe OCaml-Rust Foreign Function Interface

ocaml-rust This repo contains code for a proof of concept for a safe OCaml-Rust interop inspired by cxx. This is mostly optimized for calling Rust cod

Laurent Mazare 23 Dec 27, 2022
Postgres Foreign Data Wrapper for Clerk.com API

Pre-requisites Postgres-15 Rust pgrx Getting Started To run the program locally, clone the repository git clone https://github.com/tembo-io/clerk_fdw.

Tembo 3 Aug 22, 2023
Self Study on developing a game engine using wgpu as the rendering API. Learning as I go.

Fabled Engine Any issues, enhancement, features, or bugs report are always welcome in Issues. The obj branch is where frequent development and up to d

Khalid 20 Jan 5, 2023
Hotwire allows you to study network traffic of a few popular protocols in a simple way

Hotwire Hotwire is a gtk GUI application that leverages the wireshark and tshark infrastructure to capture traffic and explore the contents of tcpdump

null 194 Dec 30, 2022
Elemental System Designs is an open source project to document system architecture design of popular apps and open source projects that we want to study

Elemental System Designs is an open source project to document system architecture design of popular apps and open source projects that we want to study

Jason Shin 9 Apr 10, 2022
The study of a simple path tracer implementation (image raytracing in shorts)

The study of a simple path tracer implementation (generate a raytraced image, in shorts).

Leonardo Vieira 1 Apr 2, 2022