stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, Arabic, CJKV characters and other scripts in all supported multi-byte-encodings, while GNU strings fails in finding any of these scripts in UTF-16 and many other encodings.

stringsext prints all graphic character sequences in FILE or stdin that are at least MIN bytes long.

Unlike GNU strings stringsext can be configured to search for valid characters not only in ASCII but also in many other input encodings, e.g.: UTF-8, UTF-16BE, UTF-16LE, BIG5-2003, EUC-JP, KOI8-R and many others. The option --list-encodings shows a list of valid encoding names based on the WHATWG Encoding Standard. When more than one encoding is specified, the scan is performed in different threads simultaneously.

When searching for UTF-16 encoded strings, 96% of all possible two byte sequences, interpreted as UTF-16 code unit, relate directly to Unicode codepoints. As a result, the probability of encountering valid Unicode characters in a random byte stream, interpreted as UTF-16, is also 96%. In order to reduce this big number of false positives, stringsext provides a parametrizable Unicode-block-filter. See --encodings and --same-unicode-block options in the manual page for more details.

stringsext is mainly useful for extracting Unicode content out of non-text files.

When invoked with stringsext -e ascii stringsext can be used as GNU strings replacement.


stringsext -tx -e utf-8 -e utf-16le -e utf-16be \
           -n 10 -a None -u African  /dev/disk/by-uuid/567a8410

 3de2fff0+	(b UTF-16LE)	ݒݓݔݕݖݗݙݪ
 3de30000+	(b UTF-16LE)	ݫݱݶݷݸݹݺ
<3de36528 	(a UTF-8)	فيأنمامعكلأورديافىهولملكاولهبسالإنهيأيقدهلثمبهلوليبلايبكشيام
>3de36528+	(a UTF-8)	أمنتبيلنحبهممشوش
<3de3a708 	(a UTF-8)	علىإلىهذاآخرعددالىهذهصورغيركانولابينعرضذلكهنايومقالعليانالكن
>3de3a708+	(a UTF-8)	حتىقبلوحةاخرفقطعبدركنإذاكمااحدإلافيهبعضكيفبح
 3de3a780+	(a UTF-8)	ثومنوهوأناجدالهاسلمعندليسعبرصلىمنذبهاأنهمثلكنتالاحيثمصرشرححو
 3de3a7f8+	(a UTF-8)	لوفياذالكلمرةانتالفأبوخاصأنتانهاليعضووقدابنخيربنتلكمشاءوهياب
 3de3a870+	(a UTF-8)	وقصصومارقمأحدنحنعدمرأياحةكتبدونيجبمنهتحتجهةسنةيتمكرةغزةنفسبي
 3de3a8e8+	(a UTF-8)	تللهلناتلكقلبلماعنهأولشيءنورأمافيكبكلذاترتببأنهمسانكبيعفقدحس
 3de3a960+	(a UTF-8)	نلهمشعرأهلشهرقطرطلب
 3df4cca8 	(c UTF-16BE)	փօև։֋֍֏֑֛֚֓֕֗֙֜֝֞׹
<3df4cd20 	(c UTF-16BE)	־ֿ׀ׁׂ׃ׅׄ׆ׇ׈׉׊׋


Building and installing

  1. Install Rust, e.g.

    curl -sSf | sh
  2. Download, compile and install:

    cargo install stringsext
    sudo cp ~/.cargo/bin/stringsext /usr/local/bin

This project follows Semantic Versioning.



  • Jens Getreu


  • Apache 2 license or MIT license

  status
  • Byte offsets not accurate

    Byte offsets not accurate

    Hello and thank you. The byte offsets produced by using the -t flag don't appear to be entirely accurate. There are duplicates. Reading the manual it appears that they signify either a range (<), (>) or indicate that the line is an extension of a line passed the length limit (+). Somewhat of an approximation rather than an exact location like it is with the standard strings command. Is this because of the nature of the worker threads not being aware of one another and able to piece together an exact picture?

    opened by STashakkori 2
  • No ELF? Intended behavior?

    No ELF? Intended behavior?

    When I run stringsext on a binary, I get an empty line where the ELF designator would be. Example:

    GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 test.c main .symtab .strtab .shstrtab .text .data .bss .comment .note.GNU-stack .rela.eh_frame

    Whereas the typical strings function gives: ELF GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 test.c main .symtab .strtab .shstrtab .text .data .bss .comment .note.GNU-stack .rela.eh_frame

    Is this by design? I would rather see that string in the output frankly. Thank you

    opened by STashakkori 2
  • a few typos

    a few typos

    s/Courir/Courier/ s/are then are/are then/ s/A valid strings is/A valid string is/ s/In practise/In practice,/ s/as invalid character/as an invalid character/ s/as sequence/as a sequence/ s/therefor/therefore/ s/Therefor/Therefore/

    opened by oylenshpeegul 2
  • Implements support for start and end offsets.

    Implements support for start and end offsets.


    This relates to #3 in my Need for Speed, it would be great to be able to specify start and end offsets to read the file, this way one could cheaply multiprocess the entire scan by allocating different chunks of a large file to different stringsext instances.



    opened by KelSolaar 1
  • Implement support for Regex filtering of strings.

    Implement support for Regex filtering of strings.


    Follow-up of the email thread: I was looking at using stringsext to scrape paths in binary files, however, piping the output to grep for example, is very slow for large files, e.g. 25Go, I was thinking that having native Regex filtering of the found strings would maybe help here instead of piping a torrent of data via stdout.



    opened by KelSolaar 1
  • Split out into a library crate?

    Split out into a library crate?

    Hi, this looks great! I was wondering if you would be open to splitting out the functionality into a library crate for use via Then this repo would end up being a command line interface for the library.

    opened by rlabrecque 1
    Source code(tar.gz)
    Source code(zip)
    stringsext-v2.2.0-x86_64-apple-darwin.tar.gz(861.54 KB) KB)
    stringsext-v2.2.0-x86_64-unknown-linux-gnu.tar.gz(856.51 KB)
