Ultra-fast, spookily accurate text summarizer that works on any language

Overview

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser

Quick example:
pithy -f your_file_here.txt --sentences 4

speedtest

final_demo_gif_2

--help:

Print this help message

-f:

The file pithy will read from. Required.

--sentences:

The number of sentences for pithy to return. Defaults to 3.

--density

Experimental setting. Defaults to 3. Setting it lower 
makes for more general summaries with more common words,
setting it higher prioritises important highlights that
might not be central to the text.

--by_section:

If set, pithy splits the text into sections, and each section is
summarized separately. Defaults to false.

--chunk_size:

The number of sentences to read at a time. Defaults to 500 
if unspecified.

--force_all:

If set, pithy reads the text all at once. Can be quite 
slow once you go past the 7k mark. Defaults to false.

--force_chunk:

If set, regardless of how large the text is, pithy splits it
into chunks. Should be used in combination with chunk_size 
and by_section.

--ngrams: If set, pithy uses ngrams rather than words. It's usually crap, but you might use it as a last resort for non-spaced languages that you can't pre-tokenise. Defaults to false.

--min_length:

The minimum sentence length before filtering. Defaults to 30.

--max_length:

The maximum sentence length before filtering. Defaults to 1500.

--separator:

The separator used to split the text into sentences. 
Defaults to '. '. You can type newline to separate by newlines.

--clean_whitespace:

If set, removes sentences with excessive whitespace. Useful for 
pdfs and copy-pastes from websites.

--clean_nonalphabetic:

If set, removes sentences with too many non-alphabetic characters.

--clean_caps:

If set, removes sentences with too many capital letters. Useful 
if the text contains a lot of references or indices.

--length_penalty

The length penalty. Defaults to 1.5. Decrease to make glance for longer 
sentences, increase for shorter sentences.

--no_context

If set, the context surrounding sentences isn't provided. 
Defaults to false.

--relevance

If set, the sentences are sorted by their relevance rather 
than their order in the original text. Defaults to false.

--nobar

If set, the progress bar is not printed. Defaults to false because
progress bars are cool.
You might also like...
fastest text uwuifier in the west

uwuify fastest text uwuifier in the west transforms Hey... I think I really love you. Do you want a headpat? into hey... i think i w-weawwy wuv you.

A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

Sorta Text Format in UTF-8

STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8.

Source text parsing, lexing, and AST related functionality for Deno

Source text parsing, lexing, and AST related functionality for Deno.

better tools for text parsing

nom-text Goal: a library that extends nom to provide better tools for text formats (programming languages, configuration files). current needs Recogni

Font independent text analysis support for shaping and layout.

lipi Lipi (Sanskrit for 'writing, letters, alphabet') is a pure Rust crate that provides font independent text analysis support for shaping and layout

Makdown-like text parser.

Makdown-like text parser.

A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

Find files (ff) by name, fast!
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Owner
Catherine Koshka
Catherine Koshka
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022
The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

The fastest way to identify anything lemmeknow ⚡ Identify any mysterious text or analyze strings from a file, just ask lemmeknow. lemmeknow can be use

Swanand Mulay 594 Dec 30, 2022
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

Kasper 82 Jan 4, 2023
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

Bottom Software Foundation 345 Dec 30, 2022
Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

Benjamin Minixhofer 273 Dec 29, 2022