๐Ÿฆ” Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Overview

Sonic

Test and Build dependency status Buy Me A Coffee

Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond's time.

Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.

A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic - when under load - responds to search queries in the ฮผs range, eats ~30MB RAM and has a low CPU footprint; see our benchmarks).

Tested at Rust version: rustc 1.47.0 (18bf6b4f0 2020-10-07)

๐Ÿ‡ซ๐Ÿ‡ท Crafted in Nantes, France.

๐Ÿ“ฐ The Sonic project was initially announced in a post on my personal journal.

Sonic

ยซ Sonic ยป is the mascot of the Sonic project. I drew it to look like a psychedelic hipster hedgehog.

Who uses it?

Crisp Scrumpy

๐Ÿ‘‹ You use Sonic and you want to be listed there? Contact me.

Demo

Sonic is integrated in all Crisp search products on the Crisp platform. It is used to index half a billion objects on a $5/mth 1-vCPU SSD cloud server (as of 2019). Crisp users use it to search in their messages, conversations, contacts, helpdesk articles and more.

You can test Sonic live on: Crisp Helpdesk, and get an idea of the speed and relevance of Sonic search results. You can also test search suggestions from there: start typing at least 2 characters for a word, and get suggested a full word (press the tab key to expand suggestion). Both search and suggestions are powered by Sonic.

Demo on Crisp Helpdesk search

Sonic fuzzy search in helpdesk articles at its best. Lookup for any word or group of terms, get results instantly.

Features

  • Search terms are stored in collections, organized in buckets; you may use a single bucket, or a bucket per user on your platform if you need to search in separate indexes.
  • Search results return object identifiers, that can be resolved from an external database if you need to enrich the search results. This makes Sonic a simple word index, that points to identifier results. Sonic doesn't store any direct textual data in its index, but it still holds a word graph for auto-completion and typo corrections.
  • Search query typos are corrected if there are not enough exact-match results for a given word in a search query, Sonic tries to correct the word and tries against alternate words. You're allowed to make mistakes when searching.
  • Insert and remove items in the index; index-altering operations are light and can be committed to the server while it is running. A background tasker handles the job of consolidating the index so that the entries you have pushed or popped are quickly made available for search.
  • Auto-complete any word in real-time via the suggest operation. This helps build a snappy word suggestion feature in your end-user search interface.
  • Full Unicode compatibility on 80+ most spoken languages in the world. Sonic removes useless stop words from any text (eg. 'the' in English), after guessing the text language. This ensures any searched or ingested text is clean before it hits the index; see languages.
  • Simple protocol (Sonic Channel), that let you search your index, manage data ingestion (push in the index, pop from the index, flush a collection, flush a bucket, etc.) and perform administrative actions. Sonic Channel was designed to be lightweight on resources and simple to integrate with; read protocol specification.
  • Easy-to-use libraries, that let you connect to Sonic from your apps; see libraries.

How to use it?

Installation

Sonic is built in Rust. To install it, either download a version from the Sonic releases page, use cargo install or pull the source code from master.

๐Ÿ‘‰ Install from source:

If you pulled the source code from Git, you can build it using cargo:

cargo build --release

You can find the built binaries in the ./target/release directory.

Install build-essential, clang, libclang-dev, libc6-dev, g++ and llvm-dev to be able to compile the required RocksDB dependency.

๐Ÿ‘‰ Install from Cargo:

You can install Sonic directly with cargo install:

cargo install sonic-server

Ensure that your $PATH is properly configured to source the Crates binaries, and then run Sonic using the sonic command.

Install build-essential, clang, libclang-dev, libc6-dev, g++ and llvm-dev to be able to compile the required RocksDB dependency.

๐Ÿ‘‰ Install from Docker Hub:

You might find it convenient to run Sonic via Docker. You can find the pre-built Sonic image on Docker Hub as valeriansaliou/sonic.

First, pull the valeriansaliou/sonic image:

docker pull valeriansaliou/sonic:v1.3.0

Then, seed it a configuration file and run it (replace /path/to/your/sonic/config.cfg with the path to your configuration file):

docker run -p 1491:1491 -v /path/to/your/sonic/config.cfg:/etc/sonic.cfg -v /path/to/your/sonic/store/:/var/lib/sonic/store/ valeriansaliou/sonic:v1.3.0

In the configuration file, ensure that:

  • channel.inet is set to 0.0.0.0:1491 (this lets Sonic be reached from outside the container)
  • store.kv.path is set to /var/lib/sonic/store/kv/ (this lets the external KV store directory be reached by Sonic)
  • store.fst.path is set to /var/lib/sonic/store/fst/ (this lets the external FST store directory be reached by Sonic)

Sonic will be reachable from tcp://localhost:1491.

๐Ÿ‘‰ Install from another source (non-official):

Other installation sources are available:

Note that those sources are non-official, meaning that they are not owned nor maintained by the Sonic project owners. The latest Sonic version available on those sources might be outdated, in comparison to the latest version available through the Sonic project.

Configuration

Use the sample config.cfg configuration file and adjust it to your own environment.

If you are looking to fine-tune your configuration, you may read our detailed configuration documentation.

Run Sonic

Sonic can be run as such:

./sonic -c /path/to/config.cfg

Perform searches and manage objects

Both searches and object management (i.e. data ingestion) is handled via the Sonic Channel protocol only. As we want to keep things simple with Sonic (similarly to how Redis does it), Sonic does not offer a HTTP endpoint or similar; connecting via Sonic Channel is the way to go when you need to interact with the Sonic search database.

Sonic distributes official libraries, that let you integrate Sonic to your apps easily. Click on a library below to see library integration documentation and code.

If you are looking for details on the raw Sonic Channel TCP-based protocol, you can read our detailed protocol documentation. It can prove handy if you are looking to code your own Sonic Channel library.

๐Ÿ“ฆ Sonic Channel Libraries

1๏ธโƒฃ Official Libraries

Sonic distributes official Sonic integration libraries for your programming language (official means that those libraries have been reviewed and validated by a core maintainer):

2๏ธโƒฃ Community Libraries

You can find below a list of Sonic integrations provided by the community (many thanks to them!):

โ„น๏ธ Cannot find the library for your programming language? Build your own and be referenced here! (contact me)

Which text languages are supported?

Sonic supports a wide range of languages in its lexing system. If a language is not in this list, you will still be able to push this language to the search index, but stop-words will not be eluded, which could lead to lower-quality search results.

The languages supported by the lexing system are:

  • ๐Ÿ‡ฟ๐Ÿ‡ฆ Afrikaans
  • ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic
  • ๐Ÿ‡ฆ๐Ÿ‡ฟ Azerbaijani
  • ๐Ÿ‡ง๐Ÿ‡ฉ Bengali
  • ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian
  • ๐Ÿ‡ฒ๐Ÿ‡ฒ Burmese
  • ๐Ÿณ Catalan
  • ๐Ÿ‡จ๐Ÿ‡ณ Chinese (Simplified)
  • ๐Ÿ‡น๐Ÿ‡ผ Chinese (Traditional)
  • ๐Ÿ‡ญ๐Ÿ‡ท Croatian
  • ๐Ÿ‡จ๐Ÿ‡ฟ Czech
  • ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish
  • ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch
  • ๐Ÿ‡ฌ๐Ÿ‡ง English
  • ๐Ÿณ Esperanto
  • ๐Ÿ‡ช๐Ÿ‡ช Estonian
  • ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish
  • ๐Ÿ‡ซ๐Ÿ‡ท French
  • ๐Ÿ‡ฉ๐Ÿ‡ช German
  • ๐Ÿ‡ฌ๐Ÿ‡ท Greek
  • ๐Ÿ‡ณ๐Ÿ‡ฌ Hausa
  • ๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi
  • ๐Ÿ‡ญ๐Ÿ‡บ Hungarian
  • ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian
  • ๐Ÿ‡ฎ๐Ÿ‡น Italian
  • ๐Ÿ‡ฏ๐Ÿ‡ต Japanese
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Kannada
  • ๐Ÿ‡ฐ๐Ÿ‡ญ Khmer
  • ๐Ÿ‡ฐ๐Ÿ‡ท Korean
  • ๐Ÿณ Kurdish
  • ๐Ÿณ Latin
  • ๐Ÿ‡ฑ๐Ÿ‡ป Latvian
  • ๐Ÿ‡ฑ๐Ÿ‡น Lithuanian
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi
  • ๐Ÿ‡ณ๐Ÿ‡ต Nepali
  • ๐Ÿ‡ฎ๐Ÿ‡ท Persian
  • ๐Ÿ‡ต๐Ÿ‡ฑ Polish
  • ๐Ÿ‡ต๐Ÿ‡น Portuguese
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Punjabi
  • ๐Ÿ‡ท๐Ÿ‡บ Russian
  • ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak
  • ๐Ÿ‡ธ๐Ÿ‡ฎ Slovene
  • ๐Ÿ‡ธ๐Ÿ‡ด Somali
  • ๐Ÿ‡ช๐Ÿ‡ธ Spanish
  • ๐Ÿ‡ธ๐Ÿ‡ช Swedish
  • ๐Ÿ‡ต๐Ÿ‡ญ Tagalog
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Tamil
  • ๐Ÿ‡น๐Ÿ‡ญ Thai
  • ๐Ÿ‡น๐Ÿ‡ท Turkish
  • ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian
  • ๐Ÿ‡ต๐Ÿ‡ฐ Urdu
  • ๐Ÿ‡ป๐Ÿ‡ณ Vietnamese
  • ๐Ÿ‡ฎ๐Ÿ‡ฑ Yiddish
  • ๐Ÿ‡ณ๐Ÿ‡ฌ Yoruba
  • ๐Ÿ‡ฟ๐Ÿ‡ฆ Zulu

How fast & lightweight is it?

Sonic was built for Crisp from the start. As Crisp was growing and indexing more and more search data into a full-text search SQL database, we decided it was time to switch to a proper search backend system. When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.

At the end, we decided to build our own search backend, designed to be simple and lightweight on resources.

You can run function-level benchmarks with the command: cargo bench --features benchmark

๐Ÿ‘ฉโ€๐Ÿ”ฌ Benchmark #1

โžก๏ธ Scenario

We performed an extract of all messages from the Crisp team used for Crisp own customer support.

We want to import all those messages into a clean Sonic instance, and then perform searches on the index we built. We will measure the time that Sonic spent executing each operation (ie. each PUSH and QUERY commands over Sonic Channel), and group results per 1,000 operations (this outputs a mean time per 1,000 operations).

โžก๏ธ Context

Our benchmark is ran on the following computer:

  • Device: MacBook Pro (Retina, 15-inch, Mid 2014)
  • OS: MacOS 10.14.3
  • Disk: 512GB SSD (formatted under the AFS file system)
  • CPU: 2.5 GHz Intel Core i7
  • RAM: 16 GB 1600 MHz DDR3

Sonic is compiled as following:

  • Sonic version: 1.0.1
  • Rustc version: rustc 1.35.0-nightly (719b0d984 2019-03-13)
  • Compiler flags: release profile (-03 with lto)

Our dataset is as such:

  • Number of objects: ~1,000,000 messages
  • Total size: ~100MB of raw message text (this does not account for identifiers and other metas)

โžก๏ธ Scripts

The scripts we used to perform the benchmark are:

  1. PUSH script: sonic-benchmark_batch-push.js
  2. QUERY script: sonic-benchmark_batch-query.js

โฌ Results

Our findings:

  • We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);
  • Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;
  • CPU usage during import averaged 75% of a single CPU core;
  • RAM usage for the Sonic process peaked at 28MB during our benchmark;
  • We used a single Sonic Channel TCP connection, which limits the import to a single thread (we could have load-balanced this across as many Sonic Channel connections as there are CPUs);
  • We get an import RPS approaching 4,000 operations per second (per thread);
  • We get a search query RPS approaching 1,000 operations per second (per thread);
  • On the hyper-threaded 4-cores CPU used, we could have parallelized operations to 8 virtual cores, thus theoretically increasing the import RPS to 32,000 operations / second, while the search query RPS would be increased to 8,000 operations / second (we may be SSD-bound at some point though);

Compared results per operation (on a single object):

We took a sample of 8 results from our batched operations, which produced a total of 1,000 results (1,000,000 items, with 1,000 items batched per measurement report).

This is not very scientific, but it should give you a clear idea of Sonic performances.

Time spent per operation:

Operation Average Best Worst
PUSH 275ฮผs 190ฮผs 363ฮผs
QUERY 880ฮผs 852ฮผs 1ms

Batch PUSH results as seen from our terminal (from initial index of: 0 objects):

Batch PUSH benchmark

Batch QUERY results as seen from our terminal (on index of: 1,000,000 objects):

Batch QUERY benchmark

Limitations

  • Indexed data limits: Sonic is designed for large search indexes split over thousands of search buckets per collection. An IID (ie. Internal-ID) is stored in the index as a 32 bits number, which theoretically allow up to ~4.2 billion objects to be indexed (ie. OID) per bucket. We've observed storage savings of 30% to 40%, which justifies the trade-off on large databases (versus Sonic using 64 bits IIDs). Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured).
  • Search query limits: Sonic Natural Language Processing system (NLP) does not work at the sentence-level, for storage compactness reasons (we keep the FST graph shallow as to reduce time and space complexity). It works at the word-level, and is thus able to search per-word and can predict a word based on user input, though it is unable to predict the next word in a sentence.
  • Real-time limits: the FST needs to be rebuilt every time a word is pushed or popped from the bucket graph. As this is quite heavy, Sonic batches rebuild cycles. If you have just pushed a new word to the index and you are not seeing it in the SUGGEST command yet, wait for the next rebuild cycle to kick-in, or force it with TRIGGER consolidate in a control channel.
  • Interoperability limits: The Sonic Channel protocol is the only way to read and write search entries to the Sonic search index. Sonic does not expose any HTTP API. Sonic Channel has been designed with performance and minimal network footprint in mind. If you need to access Sonic from an unsupported programming language, you can either open an issue or look at the reference node-sonic-channel implementation and build it in your target programming language.
  • Hardware limits: Sonic performs the search on the file-system directly; ie. it does not fit the index in RAM. A search query results in a lot of random accesses on the disk, which means that it will be quite slow on old-school HDDs and super-fast on newer SSDs. Do store the Sonic database on SSD-backed file systems only.

๐Ÿ”ฅ Report A Vulnerability

If you find a vulnerability in Sonic, you are more than welcome to report it directly to @valeriansaliou by sending an encrypted email to [email protected]. Do not report vulnerabilities in public GitHub issues, as they may be exploited by malicious people to target production servers running an unpatched Sonic instance.

โš ๏ธ You must encrypt your email using @valeriansaliou GPG public key: ๐Ÿ”‘ valeriansaliou.gpg.pub.asc.

Comments
  • Implement Sonic Channel libraries for popular programming languages

    Implement Sonic Channel libraries for popular programming languages

    Sonic cannot be integrated to an existing application code without a Sonic Channel library made for the used programming language.

    Currently, a NodeJS library is available node-sonic-channel, but that's all!

    We'd need to support more programming languages with clean implementations of the Sonic Channel protocol as specified in: PROTOCOL.

    Programming languages:

    • [x] Rust
    • [x] JavaScript
    • [x] Go
    • [x] Java
    • [x] Python
    • [x] Ruby
    • [x] PHP
    • [ ] (any other language?)
    help wanted 
    opened by valeriansaliou 27
  • Slow Push (~1000 docs/minute) - How could I increase push speed?

    Slow Push (~1000 docs/minute) - How could I increase push speed?

    I've looked at issues #230 & #132, but they have not increased speed by much. I am using the docker server, with the python library to evaluate information retrieval of a DBpedia dataset, and have a collection of 12 million documents to push, at the current speed this is not possible and I expected Sonic to perform faster.

    What I've tried:

    • smaller size of input documents
    • add language tag
    • retain_word_objects = 500
    • max_files = 10000
    • give docker more processing power in settings (seems to be mostly unused)

    Any help is very much appreciated!!

    opened by philiure 13
  • Is it possible to remove stopwords from query terms before the actual search?

    Is it possible to remove stopwords from query terms before the actual search?

    Hi @valeriansaliou, I made some tests in Portuguese (Brazil) and in English (US) and found that if you ingest a text like: "this is the report from our last weeks meeting" A query for "last weeks meeting" would return empty (because it includes a stopword) but a query for "report weeks meeting" would return the object. If we remove the stop words before querying, "last weeks meeting" would also return the object.

    What would be the performance implications for this change? Other option, would be to add a removeStopwords + lang function into the Sonic Channel NPM module and make the query cleanup process in there. Any thoughts about this?

    Thanks!

    bug 
    opened by andersonsantos 10
  • Not finding queries expected to be found

    Not finding queries expected to be found

    I'm basically indexing documents like these:

    [
      {"title": "The Audio Programming Book", "isbn": "0-12-394595-X", "id": "91c1729fad074ed3a141798013ce6842"},
      {"title": "Starting Out with C++",  "isbn": "0-...", "id": "144a38277bb847258f9ba9b8002c7ee2"},
      {"title": "Information Visualization",  "isbn": "0-...", "id": "f5d8518942534a0e91e845489b628600"},
    ]
    

    Queries such as "the audio" does not return the first document's id. The query "audio", however, does. Likewise "C++" does not find the second document's id. Searching for "Information" yields nothing. Neither does "Information visualization". Solely "visualization", does yield the expected outcome. Searching for any of the "isbn" values yields the expected result.

    My configuration file:

    # Sonic
    # Fast, lightweight and schema-less search backend
    # Configuration file
    # Example: https://github.com/valeriansaliou/sonic/blob/master/config.cfg
    
    
    [server]
    
    log_level = "debug"
    
    
    [channel]
    
    inet = "0.0.0.0:1491"
    tcp_timeout = 300
    
    #auth_password = ""
    
    [channel.search]
    
    query_limit_default = 10
    query_limit_maximum = 100
    query_alternates_try = 4
    
    suggest_limit_default = 5
    suggest_limit_maximum = 20
    
    
    [store]
    
    [store.kv]
    
    path = "./data/store/kv/"
    
    retain_word_objects = 1000
    
    [store.kv.pool]
    
    inactive_after = 1800
    
    [store.kv.database]
    
    compress = true
    parallelism = 2
    max_files = 100
    max_compactions = 1
    max_flushes = 1
    
    [store.fst]
    
    path = "./data/store/fst/"
    
    [store.fst.pool]
    
    inactive_after = 300
    
    [store.fst.graph]
    
    consolidate_after = 180
    

    Version: valeriansaliou/sonic:v1.1.9 (docker)

    The gist of the ingest code:

    ingestChannel.flusho('books', 'default', book.id);
    ingestChannel.push('books', 'default', book.id, book.title);
    ingestChannel.push('books', 'default', book.id, book.isbn);
    

    What could be the cause?

    Also, are there any side effects of keeping traditionally numerical keys such as the ISBN in the same index as their titles?

    opened by AlexGustafsson 9
  • Install (ubuntu) fail (cargo install sonic-server)

    Install (ubuntu) fail (cargo install sonic-server)

    ubuntu@ip-10-0-5-62:~$ cargo install sonic-server

    Compiling librocksdb-sys v5.18.3 error: failed to compile sonic-server v1.2.0, intermediate artifacts can be found at /tmp/cargo-installA6m7Zx

    Caused by: failed to run custom build command for librocksdb-sys v5.18.3 process didn't exit successfully: /tmp/cargo-installA6m7Zx/release/build/librocksdb-sys-c9acc21d627f997d/build-script-build (exit code: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:rerun-if-changed=rocksdb/ cargo:rerun-if-changed=snappy/ cargo:rerun-if-changed=lz4/ cargo:rerun-if-changed=zstd/ cargo:rerun-if-changed=zlib/ cargo:rerun-if-changed=bzip2/

    --- stderr thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-.so', 'libclang.so.'], set the LIBCLANG_PATH environment variable to a path where one of these files can be found (invalid: [])"', src/libcore/result.rs:997:5 note: Run with RUST_BACKTRACE=1 environment variable to display a backtrace.

    opened by retf 9
  • sonic fails to retrieve results on very simple queries

    sonic fails to retrieve results on very simple queries

    I found this thing after configuring and running sonic for the first time,

    Using telnet, I manually push the following data:

    PUSH messages default id_1 "Some sample text number one"
    PUSH messages default id_2 "Some sample text number two"
    PUSH messages default id_3 "Some sample text number three"
    PUSH messages default id_4 "Some sample text number four"
    

    Then I consolidate the index, just to make sure: TRIGGER consolidate

    On all these commands (using their respective channel), I get an OK reply from the sonic process.

    So now, when I run some sample queries, I get results on a few words but not on others:

    QUERY messages default "Some", gives back no results, which is wrong โŒ QUERY messages default "sample" gives back id_4 id_3 id_2 id_1, which is correct โœ… QUERY messages default "text", gives back no results, which is wrong โŒ QUERY messages default "number", gives back no results, which is wrong โŒ QUERY messages default "one", gives back no results, which is wrong โŒ QUERY messages default "two", gives back no results, which is wrong โŒ QUERY messages default "three", gives back no results, which is wrong โŒ QUERY messages default "four", gives back no results, which is wrong โŒ

    What is going on? Is this a bug or am I doing something wrong?

    Thanks.

    opened by almosnow 8
  • Docker image errors

    Docker image errors

    Hi,

    Trying to run v1.3.0 in docker and getting an error:

    Command:

    docker run -p 1491:1491 -v /usr/local/sonic/config.cfg:/etc/sonic.cfg -v /usr/local/sonic/store/:/var/lib/sonic/store/ --rm valeriansaliou/sonic:v1.3.0
    
    sonic: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by sonic)
    

    Checking env in the image:

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    HOSTNAME=3c32b11eb74b
    HOME=/root
    

    Seems like maybe lib64 should be added to PATH as suggested here: https://stackoverflow.com/questions/20357033/usr-lib-x86-64-linux-gnu-libstdc-so-6-version-cxxabi-1-3-8-not-found

    bug 
    opened by v0l 7
  • Windows compatibility

    Windows compatibility

    Sonic currently does not run on Windows. @Git0Shuai has made a fix that seemingly allows Sonic to run on Windows, see: https://github.com/Git0Shuai/sonic/commit/a268f7e6c13536eb61c34ab2d917e5ddbd7289ac

    enhancement 
    opened by valeriansaliou 7
  • Terminal crashes when pushing to sonic

    Terminal crashes when pushing to sonic

    I am pushing a dataset of 12M documents to Sonic, but the terminal crashes due to memory issues at 2% of the push. I am running a Rust Server in release mode, and wonder why the terminal keeps crashing. Any advice on solving these memory issues is welcome. The size of the dictionary object containing the documents is 0.67 GB.

    Activity Monitor indicates 70+GB of memory use for terminal upon crashing.

    opened by philiure 6
  • OSError: Sonic response are wrong. Please write issue to github.

    OSError: Sonic response are wrong. Please write issue to github.

    hello. I'm not sure if it's a issue of the client library, of the sonic server or if my use case is not suitable for sonic.

    I'm trying to push a longstring to sonic. And i recieve a "OSError: Sonic response are wrong. Please write issue to github." Error

    
    from sonic import IngestChannel, SearchChannel, ControlChannel
    ingestcl = IngestChannel("localhost:1491", "SecretPassword")
    print(ingestcl.ping())
    longstring = """ PRรˆFET
    
    DE LA
    CHARENTE-
    MARITIME
    
    Libertรฉ
    Egalitรฉ
    Fraternitรฉ
    
    Arrรชtรฉ nยฐ2024/DS)DEN/SDTES /o1d
    en date du 14 oetobee 2021
    
    portant nomination de la dรฉlรฉguรฉe dรฉpartementale ร  la vie associative de la
    Charente-Maritime
    
    Le prรฉfet de la Charente-Maritime,
    Chevalier de la Lรฉgion d'honneur
    Officier de l'Ordre national du Mรฉrite
    
    Vu le dรฉcret nยฐ 2020-1542 du 9 dรฉcembre 2020 relatif aux compรฉtences des autoritรฉs
    acadรฉmiques dans le domaine des politiques de la jeunesse, de l'รฉducation populaire, de la
    vie associative, de l'engagement civique et des sports et ร  l'organisation des services chargรฉs
    de leur mise en ล“uvre :
    
    Vu la circulaire du Premier ministre Nยฐ 425/SG du 28 juillet 1995 instituant la crรฉation d'un
    dรฉlรฉguรฉ dรฉpartemental ร  la vie associative ;
    
    Vu la lettre du haut-commissaire ร  la jeunesse en date du 8 fรฉvrier 2010 relative ร  la
    dรฉsignation des dรฉlรฉguรฉs dรฉpartementaux ร  la vie associative :
    
    Vu la circulaire du Premier ministre Nยฐ 5811/SG du 29 septembre 2015 relative aux nouvelles
    relations entre les pouvoirs publics et les associations ;
    
    Sur proposition de la directrice acadรฉmique des services de l'ร‰ducation nationale de la
    Charente-Maritime :
    
    ARRรŠTE :
    
    Article 1 : Madame Patricia BRESSANGE est nommรฉe dรฉlรฉguรฉe dรฉpartementale ร  la vie
    associative (DDVA) de la Charente-Maritime ร  compter de la date de publication du prรฉsent
    arrรชtรฉ.
    
    Article 2 : Les missions de la dรฉlรฉguรฉe dรฉpartementale ร  la vie associative de la Charente-
    Maritime sont notamment les suivantes :
    + Identifier les centres de ressources ร  la vie associative privรฉs et publics membres ou
    non de fรฉdรฉrations, Unions ou rรฉseaux associatifs :
    
    PREFECTURE DE LA CHARENTE-MARITIME -17-2021-10-14-00007 - Arrรชtรฉ nยฐ 2021/DSDES/SDJES/01 en date du 14 octobre 2021 portant 57
    nomination de la dรฉlรฉguรฉe dรฉpartementale ร  la vie associative de la Charente-Maritime
    
    """
    ingestcl.push("raaa", "pages" , "3ec9cb35-b897-4197-9320-d654cb36b55b",longstring)
    
    
    

    My use case is to be able to make quick search in long string that came from a tesseracted ocr document.

    I let the config file a suggested in examples.

    Could you help me this this ?

    opened by ricou84 6
  • Having trouble building Sonic - librocksdb-sys v6.7.4

    Having trouble building Sonic - librocksdb-sys v6.7.4

    I read the other issue which was closed, which was using an older version of librocksdb-sys

    #140

    My current versions are:

    • g++-8
    • llvm-6.0-dev
    • libclang-6.0-dev
    • clang-6.0

    the build error I am getting is:

    Compiling librocksdb-sys v6.7.4
    error: failed to compile `sonic-server v1.3.0`, intermediate artifacts can be found at `/tmp/cargo-installkg4g3f`
    
    Caused by:
      failed to run custom build command for `librocksdb-sys v6.7.4`
    
    Caused by:
      process didn't exit successfully: `/tmp/cargo-installkg4g3f/release/build/librocksdb-sys-962ad4425b995fcb/build-script-build` (exit code: 1)
    
    error occurred: Command "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-I" "rocksdb/include/" "-I" "rocksdb/" "-I" "rocksdb/third-party/gtest-1.8.1/fused-src/" "-I" "snappy/" "-I" "lz4/lib/" "-I" "zstd/lib/" "-I" "zstd/lib/dictBuilder/" "-I" "zlib/" "-I" "bzip2/" "-I" "." "-Wall" "-Wextra" "-std=c++11" "-Wno-unused-parameter" "-msse2" "-msse4.1" "-msse4.2" "-mpclmul" "-DSNAPPY=1" "-DLZ4=1" "-DZSTD=1" "-DZLIB=1" "-DBZIP2=1" "-DNDEBUG=1" "-DHAVE_PCLMUL=1" "-DHAVE_SSE42=1" "-DOS_LINUX=1" "-DROCKSDB_PLATFORM_POSIX=1" "-DROCKSDB_LIB_IO_POSIX=1" "-o" "/tmp/cargo-installkg4g3f/release/build/librocksdb-sys-74cbf95678211d1b/out/rocksdb/db/arena_wrapped_db_iter.o" "-c" "rocksdb/db/arena_wrapped_db_iter.cc" with args "c++" did not execute successfully (status code exit code: 1).
    

    is there some other dependency that I could be missing?

    question 
    opened by ArcticFaded 6
  • can not push chinese

    can not push chinese

    "zho" is language codes chinese in iso 639-3

    image

    PUSH messages default 1 "ไฝ ๅฅฝ" LANG(zho)
    ERR invalid_meta_value(LANG[zho])
    
    PUSH messages default 2 "Hello" LANG(eng)
    OK
    

    lexing system doesn't work when ignoring argument "LANG"

    how to use it correctly

    opened by HollisMeynell 1
  • Add configurable stopwords feature

    Add configurable stopwords feature

    resolve #300

    This pr adds a feature that allows users to override the predefined stopwords of Sonic.

    The configuration file can include this to configure Sonic's stopwords only to foo and bar.

    [channel.search.stopwords]
    eng=["foo", "bar"]
    

    I also believe the pr could help #254 and #266 too because you can do this to disable stopwords completely.

    [channel.search.stopwords]
    eng=[]
    
    opened by yukiomoto 4
  • buffer size overflow error while ingesting large size document.

    buffer size overflow error while ingesting large size document.

    this is the error which i can see on sonic server side.

    (ERROR) - closing channel thread because of buffer overflow
    thread 'sonic-channel-client' panicked at 'buffer overflow (24168/20002 bytes)', /home/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/sonic-server-1.3.5/src/channel/handle.rs:149:29
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    (DEBUG) - running a tasker tick...
    

    can someone guide me how large document/ text i can put in one go and what are the config vars which i can tweak to make this working . TIA

    opened by tahseen2k8 4
  • Search

    Search "terms"? Searching for multiple words?

    In your protocol document you mention placing "search terms" in a protocol request. Those terms seem to be words separated by spaces. Can you clarify the meaning of "terms"? (plural?)

    For example, suppose I have documents that mention "rubber ducks", "rubber chickens" and "rubber tires", as well as documents that mention just "chickens" and others that mention "synthetic rubber". If I search for "rubber chickens" what should I expect to see? All those documents? Just the exact matches?

    Thank you for clarifying.

    opened by OllieJones 1
  • User defined stopwords

    User defined stopwords

    Hi,

    It is really great that Sonic has built-in stopword lists but this time I want to apply my own stopword list.

    I think it would be nice if you could override the build-in stopword lists through the config file or something so Sonic could fit more users' needs. I looked at CONFIGURATION.md but couldn't find a way to do that.

    Do you think Sonic could have this feature ?

    opened by yukiomoto 1
Releases(v1.4.0)
  • v1.4.0(Oct 20, 2022)

    • Fixed typo in README abstract [@remram44, #295].
    • Fixed typos in code and documentation [@kianmeng, #294].
    • Replaced Docker source image from Debian Slim to lighter Google distroless image [@0x0x1, #282].
    • Added an index enumeration LIST command to Sonic Channel [@trkohler, #293].
    Source code(tar.gz)
    Source code(zip)
  • v1.3.5(Jul 10, 2022)

  • v1.3.4(Jul 10, 2022)

  • v1.3.3(Jul 7, 2022)

    • Dependencies have been bumped to latest versions (namely: hashbrown, whatlang, regex) [@valeriansaliou].
    • Moved the release pipeline to GitHub Actions [@valeriansaliou].
    • The language detection system is now about 2x faster (due to the upgrade of whatlang past v0.14.0) [@valeriansaliou].
    • Added Armenian stopwords [@valeriansaliou].
    • Added Georgian stopwords [@valeriansaliou].
    • Added Gujarati stopwords [@valeriansaliou].
    • Added Tagalog stopwords [@valeriansaliou].
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Nov 9, 2021)

    • Fixed Norwegian stopwords [@valeriansaliou, #239].
    • Code has been formatted according to clippy recommendations. This does not change the way Sonic behaves [@pleshevskiy, #233].
    • Added support for Chinese word segmentation in tokenizer (note that as this adds quite some size overhead to the final binary size, the feature tokenizer-chinese can be disabled when building Sonic) [@vincascm, #209].
    Source code(tar.gz)
    Source code(zip)
  • v1.3.1(Nov 3, 2021)

    • Apple Silicon is now supported [@valeriansaliou].
    • Added Norwegian stopwords [@mikalv, #236].
    • Added Catalan stopwords [@coopanio, #227].
    • Dependencies have been bumped to latest versions (namely: rocksdb, fst-levenshtein, fst-regex, hashbrown, whatlang, byteorder, rand) [@valeriansaliou].
    • A few rarely-used languages have been removed, following whatlang v0.12.0 release, see the notes here [@valeriansaliou, 940d3c3].
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jun 27, 2020)

    • Added support for Slovak, which is now auto-detected from terms [@valeriansaliou, 19412ce].
    • Added Slovak stopwords [@valeriansaliou, 19412ce].
    • Dependencies have been bumped to latest versions (namely: whatlang) [@valeriansaliou, 19412ce].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.4(Jun 25, 2020)

    • Fixed multiple deadlocks, which where not noticed in practice by running Sonic at scale, but that are still theoretically possible [@BurtonQin, #213, #211].
    • Added support for Latin, which is now auto-detected from terms [@valeriansaliou, e6c5621].
    • Added Latin stopwords [@valeriansaliou, e6c5621].
    • Dependencies have been bumped to latest versions (namely: rocksdb, radix, hashbrown, whatlang) [@valeriansaliou].
    • Added a release script, with cross-compilation capabilities (currently for the x86_64 architecture, dynamically linked against GNU libraries) [@valeriansaliou, 961bab9].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.3(Oct 14, 2019)

    • RocksDB compression algorithm has been changed from LZ4 to Zstandard, for a slightly better compression ratio, and much better read/write performance; this will be used for new SST files only [@valeriansaliou, cd4cdfb].
    • Dependencies have been bumped to latest versions (namely: rocksdb) [@valeriansaliou, cd4cdfb].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(Jul 12, 2019)

    • Fixed a regression on optional configuration values not working anymore, due to an issue in the environment variable reading system introduced in v1.2.1 [@valeriansaliou, #155].
    • Optimized some aspects of FST consolidation and pending operations management [@valeriansaliou, #156].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jul 8, 2019)

    • FST graph consolidation is now able to ignore new words when the graph is over configured limits, which are set with the new store.fst.graph.max_size and store.fst.graph.max_words configuration variables [@valeriansaliou, 53db9c1].
    • An integration testing infrastructure has been added to the Sonic automated test suite [@vilunov, #154].
    • Configuration values can now be sourced from environment variables, using the ${env.VARIABLE} syntax in config.cfg [@perzanko, #148].
    • Dependencies have been bumped to latest versions (namely: rand, radix and hashbrown) [@valeriansaliou, c1b1f54].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(May 3, 2019)

    • Fixed a rare deadlock occuring when 3 concurrent operations get executed on different threads for the same collection, in the following timely order: PUSH then FLUSHB then PUSH [@valeriansaliou, d96546bd9d8b79332df1106766377e4a4acebd50].
    • Reworked the KV store manager to perform periodic memory flushes to disk, thus reducing startup time [@valeriansaliou, 6713488af3543bca33be6e772936f9668430ba86].
    • Stop accepting Sonic Channel commands when shutting down Sonic [@valeriansaliou, #131].
    • Introduced a server statistics INFO command to Sonic Channel [@valeriansaliou, #70].
    • Added the ability to disable the lexer for a command with the command modifier LANG(none) [@valeriansaliou, #108].
    • Added a backup and restore system for both KV and FST stores, which can be triggered over Sonic Channel with TRIGGER backup and TRIGGER restore [@valeriansaliou, #5].
    • Added the ability to disable KV store WAL (Write-Ahead Log) with the write_ahead_log option, which helps limit write wear on heavily loaded SSD-backed servers [@valeriansaliou, #130].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.9(Mar 29, 2019)

    • RocksDB has been bumped to v5.18.3, which fixes a dead-lock occuring in RocksDB at scale when a compaction task is ran under heavy disk writes (ie. disk flushes). This dead-lock was causing Sonic to stop responding to any command issued for the frozen collection. This dead-lock was due to a bug in RocksDB internals (not originating from Sonic itself) [@baptistejamin, 19c4a104a6d6aaed1dd9beb2e51d2639627825cd].
    • Reworked the FLUSHB command internals, which now use the atomic delete_range() operation provided by RocksDB v5.18 [@valeriansaliou, 660f8b714d968400fb9f88a245752dca02249bf7].
    • Added the LANG(<locale>) command modifier for QUERY and PUSH, that lets a Sonic Channel client force a text locale (instead of letting the lexer system guess the text language) [@valeriansaliou, #75].
    • The FST word lookup system, used by the SUGGEST command, now support all scripts via a restricted Unicode range forward scan [@valeriansaliou, #64].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.8(Mar 27, 2019)

    • A store acquire lock has been added to prevent 2 concurrent threads from opening the same collection at the same time [@valeriansaliou, 2628077ebe7e24155975962471e7653745a0add7].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.7(Mar 27, 2019)

    • A superfluous mutex was removed from KV and FST store managers, in an attempt to solve a rare dead-lock occurring on high-traffic Sonic setups in the KV store [@valeriansaliou, 60566d2f087fd6725dba4a60c3c5a3fef7e8399b].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.6(Mar 27, 2019)

    • Reverted changes made in v1.1.5 regarding the open files rlimit, as this can be set from outside Sonic [@valeriansaliou, f6400c61a9a956130ae0bdaa9a164f4955cd2a18].
    • Added Chinese Traditional stopwords [@dsewnr, #87].
    • Improved the way database locking is handled when calling a pool janitor; this prevents potential dead-locks under high load [@valeriansaliou, fa783728fd27a116b8dcf9a7180740d204b69aa4].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.5(Mar 27, 2019)

  • v1.1.4(Mar 27, 2019)

    • Automatically adjust rlimit for the process to the hard limit allowed by the system (allows opening more FSTs in parallel) [@valeriansaliou].
    • Added Kannada stopwords [@dileepbapat].
    • The Docker image is now much lighter [@codeflows].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.3(Mar 25, 2019)

    • Rework Sonic Channel buffer management using a VecDeque (Sonic should now work better in harsh network environments) [@valeriansaliou, 1c2b9c8fcd28b033a7cb80d678c388ce78ab989d].
    • Limit the size of words that can hit against the FST graph, as the FST gets slower for long words [@valeriansaliou, #81].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.2(Mar 24, 2019)

    • FST graph consolidation locking strategy has been improved even further, based on issues with the previous rework we have noticed at scale in production (now, consolidation locking is done at a lower-priority relative to actual queries and pushes to the index) [@valeriansaliou, #68].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Mar 24, 2019)

    • FST graph consolidation locking strategy has been reworked as to allow queries to be executed lock-free when the FST consolidate task takes a lot of time (previously, queries were being deferred due to an ongoing FST consolidate task) [@valeriansaliou, #68].
    • Removed special license clause introduced in v1.0.2, Sonic is full MPL 2.0 now. [@valeriansaliou]
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Mar 21, 2019)

    • Breaking: Change how buckets are stored in a KV-based collection (nest them in the same RocksDB database; this is much more efficient on setups with a large number of buckets - v1.1.0 is incompatible with the v1.0.0 KV database format) [@valeriansaliou].
    • Bump jemallocator to version 0.3 [@valeriansaliou].
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Mar 20, 2019)

  • v1.0.1(Mar 19, 2019)

    • Added automated benchmarks (can be ran via cargo bench --features benchmark) [@valeriansaliou].
    • Reduced the time to query the search index by 50% via optimizations (in multiple methods, eg. the lexer) [@valeriansaliou].
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 18, 2019)

Owner
Valerian Saliou
Co-Founder & CTO at Crisp.
Valerian Saliou
Strongly typed Elasticsearch DSL written in Rust

Strongly typed Elasticsearch DSL written in Rust This is an unofficial library and doesn't yet support all the DSL, it's still work in progress. Featu

null 173 Jan 6, 2023
๐Ÿ”TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

tinysearch TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites. TinySearch is written in Rust, and then com

null 2.2k Dec 31, 2022
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

Andy Lok 116 Dec 10, 2022
A lightweight full-text search library that provides full control over the scoring calculations

probly-search ยท A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations. This start initia

Quantleaf 20 Nov 26, 2022
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ โšก Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
๐Ÿ”Ž Impossibly fast web search, made for static sites.

Stork Impossibly fast web search, made for static sites. Stork is two things. First, it's an indexer: it indexes your loosely-structured content and c

James Little 2.5k Dec 27, 2022
โšก Insanely fast, ๐ŸŒŸ Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

โœจ Feature Rich | โšก Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 679 Jan 1, 2023
โšก Insanely fast, ๐ŸŒŸ Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

โœจ Feature Rich | โšก Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 0 Apr 25, 2022
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

weggli Introduction weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify int

Google Project Zero 2k Jan 5, 2023
๐Ÿ”Ž Search millions of files at lightning-fast speeds to find what you are looking for

?? Search millions of files at lightning-fast speeds to find what you are looking for

Shiv 22 Sep 21, 2022
High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage 359 Nov 27, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 7.4k Dec 28, 2022
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 69 Jan 2, 2023
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.8k Jan 7, 2023
Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful โ€˜first-responseโ€™ capability to quickly identify threats within Windows

F-Secure Countercept 1.8k Dec 31, 2022
Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

localbang Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs What are "bangs"?? Bangs are a way to define where to search inside

Jakob Kruse 7 Nov 23, 2022