๐Ÿฆ” Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Overview

Sonic

Test and Build dependency status Buy Me A Coffee

Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond's time.

Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.

A strong attention to performance and code cleanliness has been given when designing Sonic. It aims at being crash-free, super-fast and puts minimum strain on server resources (our measurements have shown that Sonic - when under load - responds to search queries in the ฮผs range, eats ~30MB RAM and has a low CPU footprint; see our benchmarks).

Tested at Rust version: rustc 1.47.0 (18bf6b4f0 2020-10-07)

๐Ÿ‡ซ๐Ÿ‡ท Crafted in Nantes, France.

๐Ÿ“ฐ The Sonic project was initially announced in a post on my personal journal.

Sonic

ยซ Sonic ยป is the mascot of the Sonic project. I drew it to look like a psychedelic hipster hedgehog.

Who uses it?

Crisp Scrumpy

๐Ÿ‘‹ You use Sonic and you want to be listed there? Contact me.

Demo

Sonic is integrated in all Crisp search products on the Crisp platform. It is used to index half a billion objects on a $5/mth 1-vCPU SSD cloud server (as of 2019). Crisp users use it to search in their messages, conversations, contacts, helpdesk articles and more.

You can test Sonic live on: Crisp Helpdesk, and get an idea of the speed and relevance of Sonic search results. You can also test search suggestions from there: start typing at least 2 characters for a word, and get suggested a full word (press the tab key to expand suggestion). Both search and suggestions are powered by Sonic.

Demo on Crisp Helpdesk search

Sonic fuzzy search in helpdesk articles at its best. Lookup for any word or group of terms, get results instantly.

Features

  • Search terms are stored in collections, organized in buckets; you may use a single bucket, or a bucket per user on your platform if you need to search in separate indexes.
  • Search results return object identifiers, that can be resolved from an external database if you need to enrich the search results. This makes Sonic a simple word index, that points to identifier results. Sonic doesn't store any direct textual data in its index, but it still holds a word graph for auto-completion and typo corrections.
  • Search query typos are corrected if there are not enough exact-match results for a given word in a search query, Sonic tries to correct the word and tries against alternate words. You're allowed to make mistakes when searching.
  • Insert and remove items in the index; index-altering operations are light and can be committed to the server while it is running. A background tasker handles the job of consolidating the index so that the entries you have pushed or popped are quickly made available for search.
  • Auto-complete any word in real-time via the suggest operation. This helps build a snappy word suggestion feature in your end-user search interface.
  • Full Unicode compatibility on 80+ most spoken languages in the world. Sonic removes useless stop words from any text (eg. 'the' in English), after guessing the text language. This ensures any searched or ingested text is clean before it hits the index; see languages.
  • Simple protocol (Sonic Channel), that let you search your index, manage data ingestion (push in the index, pop from the index, flush a collection, flush a bucket, etc.) and perform administrative actions. Sonic Channel was designed to be lightweight on resources and simple to integrate with; read protocol specification.
  • Easy-to-use libraries, that let you connect to Sonic from your apps; see libraries.

How to use it?

Installation

Sonic is built in Rust. To install it, either download a version from the Sonic releases page, use cargo install or pull the source code from master.

๐Ÿ‘‰ Install from source:

If you pulled the source code from Git, you can build it using cargo:

cargo build --release

You can find the built binaries in the ./target/release directory.

Install build-essential, clang, libclang-dev, libc6-dev, g++ and llvm-dev to be able to compile the required RocksDB dependency.

๐Ÿ‘‰ Install from Cargo:

You can install Sonic directly with cargo install:

cargo install sonic-server

Ensure that your $PATH is properly configured to source the Crates binaries, and then run Sonic using the sonic command.

Install build-essential, clang, libclang-dev, libc6-dev, g++ and llvm-dev to be able to compile the required RocksDB dependency.

๐Ÿ‘‰ Install from Docker Hub:

You might find it convenient to run Sonic via Docker. You can find the pre-built Sonic image on Docker Hub as valeriansaliou/sonic.

First, pull the valeriansaliou/sonic image:

docker pull valeriansaliou/sonic:v1.3.0

Then, seed it a configuration file and run it (replace /path/to/your/sonic/config.cfg with the path to your configuration file):

docker run -p 1491:1491 -v /path/to/your/sonic/config.cfg:/etc/sonic.cfg -v /path/to/your/sonic/store/:/var/lib/sonic/store/ valeriansaliou/sonic:v1.3.0

In the configuration file, ensure that:

  • channel.inet is set to 0.0.0.0:1491 (this lets Sonic be reached from outside the container)
  • store.kv.path is set to /var/lib/sonic/store/kv/ (this lets the external KV store directory be reached by Sonic)
  • store.fst.path is set to /var/lib/sonic/store/fst/ (this lets the external FST store directory be reached by Sonic)

Sonic will be reachable from tcp://localhost:1491.

๐Ÿ‘‰ Install from another source (non-official):

Other installation sources are available:

Note that those sources are non-official, meaning that they are not owned nor maintained by the Sonic project owners. The latest Sonic version available on those sources might be outdated, in comparison to the latest version available through the Sonic project.

Configuration

Use the sample config.cfg configuration file and adjust it to your own environment.

If you are looking to fine-tune your configuration, you may read our detailed configuration documentation.

Run Sonic

Sonic can be run as such:

./sonic -c /path/to/config.cfg

Perform searches and manage objects

Both searches and object management (i.e. data ingestion) is handled via the Sonic Channel protocol only. As we want to keep things simple with Sonic (similarly to how Redis does it), Sonic does not offer a HTTP endpoint or similar; connecting via Sonic Channel is the way to go when you need to interact with the Sonic search database.

Sonic distributes official libraries, that let you integrate Sonic to your apps easily. Click on a library below to see library integration documentation and code.

If you are looking for details on the raw Sonic Channel TCP-based protocol, you can read our detailed protocol documentation. It can prove handy if you are looking to code your own Sonic Channel library.

๐Ÿ“ฆ Sonic Channel Libraries

1๏ธโƒฃ Official Libraries

Sonic distributes official Sonic integration libraries for your programming language (official means that those libraries have been reviewed and validated by a core maintainer):

2๏ธโƒฃ Community Libraries

You can find below a list of Sonic integrations provided by the community (many thanks to them!):

โ„น๏ธ Cannot find the library for your programming language? Build your own and be referenced here! (contact me)

Which text languages are supported?

Sonic supports a wide range of languages in its lexing system. If a language is not in this list, you will still be able to push this language to the search index, but stop-words will not be eluded, which could lead to lower-quality search results.

The languages supported by the lexing system are:

  • ๐Ÿ‡ฟ๐Ÿ‡ฆ Afrikaans
  • ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic
  • ๐Ÿ‡ฆ๐Ÿ‡ฟ Azerbaijani
  • ๐Ÿ‡ง๐Ÿ‡ฉ Bengali
  • ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian
  • ๐Ÿ‡ฒ๐Ÿ‡ฒ Burmese
  • ๐Ÿณ Catalan
  • ๐Ÿ‡จ๐Ÿ‡ณ Chinese (Simplified)
  • ๐Ÿ‡น๐Ÿ‡ผ Chinese (Traditional)
  • ๐Ÿ‡ญ๐Ÿ‡ท Croatian
  • ๐Ÿ‡จ๐Ÿ‡ฟ Czech
  • ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish
  • ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch
  • ๐Ÿ‡ฌ๐Ÿ‡ง English
  • ๐Ÿณ Esperanto
  • ๐Ÿ‡ช๐Ÿ‡ช Estonian
  • ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish
  • ๐Ÿ‡ซ๐Ÿ‡ท French
  • ๐Ÿ‡ฉ๐Ÿ‡ช German
  • ๐Ÿ‡ฌ๐Ÿ‡ท Greek
  • ๐Ÿ‡ณ๐Ÿ‡ฌ Hausa
  • ๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi
  • ๐Ÿ‡ญ๐Ÿ‡บ Hungarian
  • ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian
  • ๐Ÿ‡ฎ๐Ÿ‡น Italian
  • ๐Ÿ‡ฏ๐Ÿ‡ต Japanese
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Kannada
  • ๐Ÿ‡ฐ๐Ÿ‡ญ Khmer
  • ๐Ÿ‡ฐ๐Ÿ‡ท Korean
  • ๐Ÿณ Kurdish
  • ๐Ÿณ Latin
  • ๐Ÿ‡ฑ๐Ÿ‡ป Latvian
  • ๐Ÿ‡ฑ๐Ÿ‡น Lithuanian
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi
  • ๐Ÿ‡ณ๐Ÿ‡ต Nepali
  • ๐Ÿ‡ฎ๐Ÿ‡ท Persian
  • ๐Ÿ‡ต๐Ÿ‡ฑ Polish
  • ๐Ÿ‡ต๐Ÿ‡น Portuguese
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Punjabi
  • ๐Ÿ‡ท๐Ÿ‡บ Russian
  • ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak
  • ๐Ÿ‡ธ๐Ÿ‡ฎ Slovene
  • ๐Ÿ‡ธ๐Ÿ‡ด Somali
  • ๐Ÿ‡ช๐Ÿ‡ธ Spanish
  • ๐Ÿ‡ธ๐Ÿ‡ช Swedish
  • ๐Ÿ‡ต๐Ÿ‡ญ Tagalog
  • ๐Ÿ‡ฎ๐Ÿ‡ณ Tamil
  • ๐Ÿ‡น๐Ÿ‡ญ Thai
  • ๐Ÿ‡น๐Ÿ‡ท Turkish
  • ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian
  • ๐Ÿ‡ต๐Ÿ‡ฐ Urdu
  • ๐Ÿ‡ป๐Ÿ‡ณ Vietnamese
  • ๐Ÿ‡ฎ๐Ÿ‡ฑ Yiddish
  • ๐Ÿ‡ณ๐Ÿ‡ฌ Yoruba
  • ๐Ÿ‡ฟ๐Ÿ‡ฆ Zulu

How fast & lightweight is it?

Sonic was built for Crisp from the start. As Crisp was growing and indexing more and more search data into a full-text search SQL database, we decided it was time to switch to a proper search backend system. When reviewing Elasticsearch (ELS) and others, we found those were full-featured heavyweight systems that did not scale well with Crisp's freemium-based cost structure.

At the end, we decided to build our own search backend, designed to be simple and lightweight on resources.

You can run function-level benchmarks with the command: cargo bench --features benchmark

๐Ÿ‘ฉโ€๐Ÿ”ฌ Benchmark #1

โžก๏ธ Scenario

We performed an extract of all messages from the Crisp team used for Crisp own customer support.

We want to import all those messages into a clean Sonic instance, and then perform searches on the index we built. We will measure the time that Sonic spent executing each operation (ie. each PUSH and QUERY commands over Sonic Channel), and group results per 1,000 operations (this outputs a mean time per 1,000 operations).

โžก๏ธ Context

Our benchmark is ran on the following computer:

  • Device: MacBook Pro (Retina, 15-inch, Mid 2014)
  • OS: MacOS 10.14.3
  • Disk: 512GB SSD (formatted under the AFS file system)
  • CPU: 2.5 GHz Intel Core i7
  • RAM: 16 GB 1600 MHz DDR3

Sonic is compiled as following:

  • Sonic version: 1.0.1
  • Rustc version: rustc 1.35.0-nightly (719b0d984 2019-03-13)
  • Compiler flags: release profile (-03 with lto)

Our dataset is as such:

  • Number of objects: ~1,000,000 messages
  • Total size: ~100MB of raw message text (this does not account for identifiers and other metas)

โžก๏ธ Scripts

The scripts we used to perform the benchmark are:

  1. PUSH script: sonic-benchmark_batch-push.js
  2. QUERY script: sonic-benchmark_batch-query.js

โฌ Results

Our findings:

  • We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);
  • Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;
  • CPU usage during import averaged 75% of a single CPU core;
  • RAM usage for the Sonic process peaked at 28MB during our benchmark;
  • We used a single Sonic Channel TCP connection, which limits the import to a single thread (we could have load-balanced this across as many Sonic Channel connections as there are CPUs);
  • We get an import RPS approaching 4,000 operations per second (per thread);
  • We get a search query RPS approaching 1,000 operations per second (per thread);
  • On the hyper-threaded 4-cores CPU used, we could have parallelized operations to 8 virtual cores, thus theoretically increasing the import RPS to 32,000 operations / second, while the search query RPS would be increased to 8,000 operations / second (we may be SSD-bound at some point though);

Compared results per operation (on a single object):

We took a sample of 8 results from our batched operations, which produced a total of 1,000 results (1,000,000 items, with 1,000 items batched per measurement report).

This is not very scientific, but it should give you a clear idea of Sonic performances.

Time spent per operation:

Operation Average Best Worst
PUSH 275ฮผs 190ฮผs 363ฮผs
QUERY 880ฮผs 852ฮผs 1ms

Batch PUSH results as seen from our terminal (from initial index of: 0 objects):

Batch PUSH benchmark

Batch QUERY results as seen from our terminal (on index of: 1,000,000 objects):

Batch QUERY benchmark

Limitations

  • Indexed data limits: Sonic is designed for large search indexes split over thousands of search buckets per collection. An IID (ie. Internal-ID) is stored in the index as a 32 bits number, which theoretically allow up to ~4.2 billion objects to be indexed (ie. OID) per bucket. We've observed storage savings of 30% to 40%, which justifies the trade-off on large databases (versus Sonic using 64 bits IIDs). Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured).
  • Search query limits: Sonic Natural Language Processing system (NLP) does not work at the sentence-level, for storage compactness reasons (we keep the FST graph shallow as to reduce time and space complexity). It works at the word-level, and is thus able to search per-word and can predict a word based on user input, though it is unable to predict the next word in a sentence.
  • Real-time limits: the FST needs to be rebuilt every time a word is pushed or popped from the bucket graph. As this is quite heavy, Sonic batches rebuild cycles. If you have just pushed a new word to the index and you are not seeing it in the SUGGEST command yet, wait for the next rebuild cycle to kick-in, or force it with TRIGGER consolidate in a control channel.
  • Interoperability limits: The Sonic Channel protocol is the only way to read and write search entries to the Sonic search index. Sonic does not expose any HTTP API. Sonic Channel has been designed with performance and minimal network footprint in mind. If you need to access Sonic from an unsupported programming language, you can either open an issue or look at the reference node-sonic-channel implementation and build it in your target programming language.
  • Hardware limits: Sonic performs the search on the file-system directly; ie. it does not fit the index in RAM. A search query results in a lot of random accesses on the disk, which means that it will be quite slow on old-school HDDs and super-fast on newer SSDs. Do store the Sonic database on SSD-backed file systems only.

๐Ÿ”ฅ Report A Vulnerability

If you find a vulnerability in Sonic, you are more than welcome to report it directly to @valeriansaliou by sending an encrypted email to [email protected]. Do not report vulnerabilities in public GitHub issues, as they may be exploited by malicious people to target production servers running an unpatched Sonic instance.

โš ๏ธ You must encrypt your email using @valeriansaliou GPG public key: ๐Ÿ”‘ valeriansaliou.gpg.pub.asc.

Issues
  • Implement Sonic Channel libraries for popular programming languages

    Implement Sonic Channel libraries for popular programming languages

    Sonic cannot be integrated to an existing application code without a Sonic Channel library made for the used programming language.

    Currently, a NodeJS library is available node-sonic-channel, but that's all!

    We'd need to support more programming languages with clean implementations of the Sonic Channel protocol as specified in: PROTOCOL.

    Programming languages:

    • [x] Rust
    • [x] JavaScript
    • [x] Go
    • [x] Java
    • [x] Python
    • [x] Ruby
    • [x] PHP
    • [ ] (any other language?)
    help wanted 
    opened by valeriansaliou 27
  • Is it possible to remove stopwords from query terms before the actual search?

    Is it possible to remove stopwords from query terms before the actual search?

    Hi @valeriansaliou, I made some tests in Portuguese (Brazil) and in English (US) and found that if you ingest a text like: "this is the report from our last weeks meeting" A query for "last weeks meeting" would return empty (because it includes a stopword) but a query for "report weeks meeting" would return the object. If we remove the stop words before querying, "last weeks meeting" would also return the object.

    What would be the performance implications for this change? Other option, would be to add a removeStopwords + lang function into the Sonic Channel NPM module and make the query cleanup process in there. Any thoughts about this?

    Thanks!

    bug 
    opened by andersonsantos 10
  • Not finding queries expected to be found

    Not finding queries expected to be found

    I'm basically indexing documents like these:

    [
      {"title": "The Audio Programming Book", "isbn": "0-12-394595-X", "id": "91c1729fad074ed3a141798013ce6842"},
      {"title": "Starting Out with C++",  "isbn": "0-...", "id": "144a38277bb847258f9ba9b8002c7ee2"},
      {"title": "Information Visualization",  "isbn": "0-...", "id": "f5d8518942534a0e91e845489b628600"},
    ]
    

    Queries such as "the audio" does not return the first document's id. The query "audio", however, does. Likewise "C++" does not find the second document's id. Searching for "Information" yields nothing. Neither does "Information visualization". Solely "visualization", does yield the expected outcome. Searching for any of the "isbn" values yields the expected result.

    My configuration file:

    # Sonic
    # Fast, lightweight and schema-less search backend
    # Configuration file
    # Example: https://github.com/valeriansaliou/sonic/blob/master/config.cfg
    
    
    [server]
    
    log_level = "debug"
    
    
    [channel]
    
    inet = "0.0.0.0:1491"
    tcp_timeout = 300
    
    #auth_password = ""
    
    [channel.search]
    
    query_limit_default = 10
    query_limit_maximum = 100
    query_alternates_try = 4
    
    suggest_limit_default = 5
    suggest_limit_maximum = 20
    
    
    [store]
    
    [store.kv]
    
    path = "./data/store/kv/"
    
    retain_word_objects = 1000
    
    [store.kv.pool]
    
    inactive_after = 1800
    
    [store.kv.database]
    
    compress = true
    parallelism = 2
    max_files = 100
    max_compactions = 1
    max_flushes = 1
    
    [store.fst]
    
    path = "./data/store/fst/"
    
    [store.fst.pool]
    
    inactive_after = 300
    
    [store.fst.graph]
    
    consolidate_after = 180
    

    Version: valeriansaliou/sonic:v1.1.9 (docker)

    The gist of the ingest code:

    ingestChannel.flusho('books', 'default', book.id);
    ingestChannel.push('books', 'default', book.id, book.title);
    ingestChannel.push('books', 'default', book.id, book.isbn);
    

    What could be the cause?

    Also, are there any side effects of keeping traditionally numerical keys such as the ISBN in the same index as their titles?

    opened by AlexGustafsson 9
  • Install (ubuntu) fail (cargo install sonic-server)

    Install (ubuntu) fail (cargo install sonic-server)

    [email protected]:~$ cargo install sonic-server

    Compiling librocksdb-sys v5.18.3 error: failed to compile sonic-server v1.2.0, intermediate artifacts can be found at /tmp/cargo-installA6m7Zx

    Caused by: failed to run custom build command for librocksdb-sys v5.18.3 process didn't exit successfully: /tmp/cargo-installA6m7Zx/release/build/librocksdb-sys-c9acc21d627f997d/build-script-build (exit code: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:rerun-if-changed=rocksdb/ cargo:rerun-if-changed=snappy/ cargo:rerun-if-changed=lz4/ cargo:rerun-if-changed=zstd/ cargo:rerun-if-changed=zlib/ cargo:rerun-if-changed=bzip2/

    --- stderr thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-.so', 'libclang.so.'], set the LIBCLANG_PATH environment variable to a path where one of these files can be found (invalid: [])"', src/libcore/result.rs:997:5 note: Run with RUST_BACKTRACE=1 environment variable to display a backtrace.

    opened by retf 9
  • sonic fails to retrieve results on very simple queries

    sonic fails to retrieve results on very simple queries

    I found this thing after configuring and running sonic for the first time,

    Using telnet, I manually push the following data:

    PUSH messages default id_1 "Some sample text number one"
    PUSH messages default id_2 "Some sample text number two"
    PUSH messages default id_3 "Some sample text number three"
    PUSH messages default id_4 "Some sample text number four"
    

    Then I consolidate the index, just to make sure: TRIGGER consolidate

    On all these commands (using their respective channel), I get an OK reply from the sonic process.

    So now, when I run some sample queries, I get results on a few words but not on others:

    QUERY messages default "Some", gives back no results, which is wrong โŒ QUERY messages default "sample" gives back id_4 id_3 id_2 id_1, which is correct โœ… QUERY messages default "text", gives back no results, which is wrong โŒ QUERY messages default "number", gives back no results, which is wrong โŒ QUERY messages default "one", gives back no results, which is wrong โŒ QUERY messages default "two", gives back no results, which is wrong โŒ QUERY messages default "three", gives back no results, which is wrong โŒ QUERY messages default "four", gives back no results, which is wrong โŒ

    What is going on? Is this a bug or am I doing something wrong?

    Thanks.

    opened by almosnow 8
  • MUSL build for Linux x64

    MUSL build for Linux x64

    We should now be able to perform a MUSL build for Linux x64.

    Check this is now possible, and update PACKAGING.md accordingly.

    tooling 
    opened by valeriansaliou 8
  • Windows compatibility

    Windows compatibility

    Sonic currently does not run on Windows. @Git0Shuai has made a fix that seemingly allows Sonic to run on Windows, see: https://github.com/Git0Shuai/sonic/commit/a268f7e6c13536eb61c34ab2d917e5ddbd7289ac

    enhancement 
    opened by valeriansaliou 7
  • Run as a service on UNIX machine

    Run as a service on UNIX machine

    Is it possible to run Sonic as system service? In that case, would I be able to append the logs to a ReadWriteFolder instead of the console?

    opened by funkfreakout 7
  • Docker image errors

    Docker image errors

    Hi,

    Trying to run v1.3.0 in docker and getting an error:

    Command:

    docker run -p 1491:1491 -v /usr/local/sonic/config.cfg:/etc/sonic.cfg -v /usr/local/sonic/store/:/var/lib/sonic/store/ --rm valeriansaliou/sonic:v1.3.0
    
    sonic: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by sonic)
    

    Checking env in the image:

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    HOSTNAME=3c32b11eb74b
    HOME=/root
    

    Seems like maybe lib64 should be added to PATH as suggested here: https://stackoverflow.com/questions/20357033/usr-lib-x86-64-linux-gnu-libstdc-so-6-version-cxxabi-1-3-8-not-found

    bug 
    opened by v0l 7
  • Invalid document ordering (last inserted should be returned first)

    Invalid document ordering (last inserted should be returned first)

    Similarly to #135, I'm considering using this to replace an ElasticSearch cluster. We have more than a couple of fields:

    • Message text
    • User
    • Tags
    • Message date

    However, it should be reasonably straightforward to implement these using multiple searches and buckets, with frontend filtering. It may or may not be any faster than ES; that's part of what I'd like to find out.

    The biggest problem is date, which is a range query. Is there any guaranteed ordering to query returns? If they're in any order other than insertion time, then we'd need to retriever every match and filter them in the frontend, which is unlikely to be net-positive.

    bug 
    opened by Baughn 6
  • run error-no kv store pool items need to be flushed at the moment

    run error-no kv store pool items need to be flushed at the moment

    image.png

    it display that (DEBUG) - scanning for kv store pool items to janitor (INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 0 items (DEBUG) - scanning for fst store pool items to janitor (INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 0 items (DEBUG) - scanning for kv store pool items to flush to disk (INFO) - no kv store pool items need to be flushed at the moment (DEBUG) - scanning for fst store pool items to consolidate (INFO) - no fst store pool items to consolidate in register (INFO) - ran tasker tick (took 0s + 0ms)

    when i try to run

    opened by cpluscc 0
  • Object does not support containing Chinese characters

    Object does not support containing Chinese characters

    (DEBUG) - got channel message: PUSH ChatUser 7dcb2a5c-89d6-4260-bc3e-bce73dbb0c6b {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} "{"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"}" (DEBUG) - will dispatch search command: PUSH (DEBUG) - parsed text parts (still needs post-processing): "{"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญ ๆ–‡ๅ"}" (DEBUG) - parsed text parts (post-processed): {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - dispatching ingest push in collection: ChatUser, bucket: 7dcb2a5c-89d6-4260-bc3e-bce73dbb0c6b and object: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - ingest push has text: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - will push for text: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} with hinted locale: (DEBUG) - detecting locale from lexer text: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - will detect locale for lexer safe text: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - lexer text is equal or longer than 60 characters, using the fast method (DEBUG) - guessing locale from stopwords for script: Latin and text: {"Id":"6489a4aa-d238-e468-9081-2afdf2a32156","NickName":"ไธญๆ–‡ๅ"} (DEBUG) - [fast lexer] trying to detect locale from fallback ngram instead (DEBUG) - wrote response with values: ERR (query_error)

    opened by springe2004 3
  • Will there be a problem if the retain_word_objects is set too large?

    Will there be a problem if the retain_word_objects is set too large?

    I have about a billion objects and there are a few words that will appear in a few hundred million objects. Which is the correct way to do it?

    1. set retain_word_objects to very large.
    2. use TF-IDF to filter words.
    opened by zzl221000 2
  • panicked at 'closing channel'

    panicked at 'closing channel'

    Hi,

    I'm using sonic for a bigger project and it's working great. Except when it crashes in irregular intervals (every 1 - 2 days). The communication is done via the elixir sonix client and sonic runs via the official docker container (latest version).

    config is the default one and the stored data is about 25MB (~24MB ram usage, CPU usage very low)

    The log is flooded with these lines:

      | 2021-07-02T11:45:39.424+02:00 | thread 'thread 'thread 'thread 'thread 'thread 'thread 'thread 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:sonic-channel-clientsonic-channel-client' panicked at 'closing channelsonic-channel-clientsonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:', ' panicked at 'src/channel/handle.rs188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | :188188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | :21
    ย  | 2021-07-02T11:45:39.424+02:00 | closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | ' panicked at 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T11:45:39.424+02:00 | closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T12:00:11.153+02:00 | (WARN) - took a lot of time: 125ms to process channel message
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'thread 'sonic-channel-clientsonic-channel-client' panicked at '' panicked at 'closing channelclosing channel', ', src/channel/handle.rssrc/channel/handle.rs::188188::2121
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'thread 'sonic-channel-clientsonic-channel-client' panicked at '' panicked at 'closing channelclosing channel', ', src/channel/handle.rssrc/channel/handle.rs::188188::2121
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T12:14:19.737+02:00 | thread 'sonic-channel-client' panicked at 'closing channel', src/channel/handle.rs:188:21
    ย  | 2021-07-02T13:53:21.945+02:00 | (WARN) - took a lot of time: 54ms to process channel message
    ย  | 2021-07-02T13:53:21.967+02:00 | (WARN) - took a lot of time: 76ms to process channel message
    ย  | 2021-07-02T13:53:21.997+02:00 | (WARN) - took a lot of time: 105ms to process channel message
    

    Do you have any clue what is causing these log messages and the crashes? After restarting the container everything is working fine again.

    Thanks a lot!! Philipp

    opened by PhilWaldmann 1
  • Protocol parameter to disable stopwords from being cleaned from given terms

    Protocol parameter to disable stopwords from being cleaned from given terms

    In some use cases, stopwords are actually desired (eg. if ingesting and searching movie titles). A protocol parameter should be added for that purpose, as to disable the stopwords cleanup system.

    enhancement 
    opened by valeriansaliou 1
  • Indexes Being Dropped

    Indexes Being Dropped

    Problem

    I'm having issues where a predictable amount (around 10%) of indexed searches are being dropped. I'm not sure if it's an issue with my dataset, but it doesn't seem so.

    This is occurring in development using Sonic 1.3.0 (installing on Homebrew, running with this config) and production on Sonic 1.3.0 (pulling from Docker, running on Kubernetes) environments, both connecting via Node.js on [email protected].

    FYI: I'm trying to add "interests" to the search index.

    Test Script

    Screenshot 2021-06-27 at 7 59 14 am

    This script indexes a interest and then immediately drops it. If the removed variable is equal to zero, then it means that the search term wasn't properly indexed.

    After running on 1,000 interests, the same "The Last of Us Part II" interest (the 504th interest in the array) is consistently not properly being indexed:

    Screenshot 2021-06-27 at 8 04 31 am

    After upping to 10,000 interests, the same 19 interests are being dropped from the search index:

    Screenshot 2021-06-27 at 8 06 09 am

    Conclusion

    I'm not seeing anything in my logs, changing the collection id is not having an effect either. For "The Last of Us Part II", here are the config params I'm using:

    collection: interest
    bucket: 2355282563624337408
    object: 2451202401969897472
    

    After indexing 40,000~ interests, I'm seeing exactly 4040 being consistently dropped :~(

    I'd really appreciate some help with this issueโ€”thanks!

    opened by darnfish 3
  • Updates rocksdb and hashbrown deps

    Updates rocksdb and hashbrown deps

    Title, just doing some dependency maintenance while I'm looking at the project

    opened by ZeWaka 0
  • N-Gram / Trigram support + number of results

    N-Gram / Trigram support + number of results

    I am looking for a lightweight full text engine which offers similar functionality as PostgreSQL: https://www.postgresql.org/docs/9.6/pgtrgm.html

    The example query that I use: select id, name count(*) OVER() AS count from table where name ILIKE '<searchPhrase>%' OR name ILIKE '%<searchPhrase>%' ORDER BY name ILIKE '<searchPhrase>%' desc, name

    This searches for result matching provided substring/searchPhrase, prioritizing results which starts with provided searchPhrase, also it tells how many total matches are available.

    Would it be possible to somewhat mimick that using Sonic?

    opened by tomekit 0
  • custom dictionary

    custom dictionary

    Hi everyone, how do I configure a custom dictionary, just like userdict in jieba?

    opened by carmel 1
  • fuzzy search

    fuzzy search

    Does sonic support fuzzy search or something based on edit distance like Levenshtein distance? if not what is the solution for fuzzy search with sonic?

    opened by sademakn 2
Releases(v1.3.0)
  • v1.3.0(Jun 27, 2020)

  • v1.2.4(Jun 25, 2020)

    • Fixed multiple deadlocks, which where not noticed in practice by running Sonic at scale, but that are still theoretically possible [@BurtonQin, #213, #211].
    • Added support for Latin, which is now auto-detected from terms [@valeriansaliou, e6c5621].
    • Added Latin stopwords [@valeriansaliou, e6c5621].
    • Dependencies have been bumped to latest versions (namely: rocksdb, radix, hashbrown, whatlang) [@valeriansaliou].
    • Added a release script, with cross-compilation capabilities (currently for the x86_64 architecture, dynamically linked against GNU libraries) [@valeriansaliou, 961bab9].
    Source code(tar.gz)
    Source code(zip)
    v1.2.4-x86_64.tar.gz(3.26 MB)
  • v1.2.3(Oct 14, 2019)

    • RocksDB compression algorithm has been changed from LZ4 to Zstandard, for a slightly better compression ratio, and much better read/write performance; this will be used for new SST files only [@valeriansaliou, cd4cdfb].
    • Dependencies have been bumped to latest versions (namely: rocksdb) [@valeriansaliou, cd4cdfb].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(Jul 11, 2019)

    • Fixed a regression on optional configuration values not working anymore, due to an issue in the environment variable reading system introduced in v1.2.1 [@valeriansaliou, #155].
    • Optimized some aspects of FST consolidation and pending operations management [@valeriansaliou, #156].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jul 8, 2019)

    • FST graph consolidation is now able to ignore new words when the graph is over configured limits, which are set with the new store.fst.graph.max_size and store.fst.graph.max_words configuration variables [@valeriansaliou, 53db9c1].
    • An integration testing infrastructure has been added to the Sonic automated test suite [@vilunov, #154].
    • Configuration values can now be sourced from environment variables, using the ${env.VARIABLE} syntax in config.cfg [@perzanko, #148].
    • Dependencies have been bumped to latest versions (namely: rand, radix and hashbrown) [@valeriansaliou, c1b1f54].
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(May 3, 2019)

    • Fixed a rare deadlock occuring when 3 concurrent operations get executed on different threads for the same collection, in the following timely order: PUSH then FLUSHB then PUSH [@valeriansaliou, d96546bd9d8b79332df1106766377e4a4acebd50].
    • Reworked the KV store manager to perform periodic memory flushes to disk, thus reducing startup time [@valeriansaliou, 6713488af3543bca33be6e772936f9668430ba86].
    • Stop accepting Sonic Channel commands when shutting down Sonic [@valeriansaliou, #131].
    • Introduced a server statistics INFO command to Sonic Channel [@valeriansaliou, #70].
    • Added the ability to disable the lexer for a command with the command modifier LANG(none) [@valeriansaliou, #108].
    • Added a backup and restore system for both KV and FST stores, which can be triggered over Sonic Channel with TRIGGER backup and TRIGGER restore [@valeriansaliou, #5].
    • Added the ability to disable KV store WAL (Write-Ahead Log) with the write_ahead_log option, which helps limit write wear on heavily loaded SSD-backed servers [@valeriansaliou, #130].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.9(Mar 29, 2019)

    • RocksDB has been bumped to v5.18.3, which fixes a dead-lock occuring in RocksDB at scale when a compaction task is ran under heavy disk writes (ie. disk flushes). This dead-lock was causing Sonic to stop responding to any command issued for the frozen collection. This dead-lock was due to a bug in RocksDB internals (not originating from Sonic itself) [@baptistejamin, 19c4a104a6d6aaed1dd9beb2e51d2639627825cd].
    • Reworked the FLUSHB command internals, which now use the atomic delete_range() operation provided by RocksDB v5.18 [@valeriansaliou, 660f8b714d968400fb9f88a245752dca02249bf7].
    • Added the LANG(<locale>) command modifier for QUERY and PUSH, that lets a Sonic Channel client force a text locale (instead of letting the lexer system guess the text language) [@valeriansaliou, #75].
    • The FST word lookup system, used by the SUGGEST command, now support all scripts via a restricted Unicode range forward scan [@valeriansaliou, #64].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.8(Mar 27, 2019)

    • A store acquire lock has been added to prevent 2 concurrent threads from opening the same collection at the same time [@valeriansaliou, 2628077ebe7e24155975962471e7653745a0add7].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.7(Mar 27, 2019)

    • A superfluous mutex was removed from KV and FST store managers, in an attempt to solve a rare dead-lock occurring on high-traffic Sonic setups in the KV store [@valeriansaliou, 60566d2f087fd6725dba4a60c3c5a3fef7e8399b].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.6(Mar 27, 2019)

    • Reverted changes made in v1.1.5 regarding the open files rlimit, as this can be set from outside Sonic [@valeriansaliou, f6400c61a9a956130ae0bdaa9a164f4955cd2a18].
    • Added Chinese Traditional stopwords [@dsewnr, #87].
    • Improved the way database locking is handled when calling a pool janitor; this prevents potential dead-locks under high load [@valeriansaliou, fa783728fd27a116b8dcf9a7180740d204b69aa4].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.5(Mar 27, 2019)

  • v1.1.4(Mar 27, 2019)

    • Automatically adjust rlimit for the process to the hard limit allowed by the system (allows opening more FSTs in parallel) [@valeriansaliou].
    • Added Kannada stopwords [@dileepbapat].
    • The Docker image is now much lighter [@codeflows].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.3(Mar 25, 2019)

    • Rework Sonic Channel buffer management using a VecDeque (Sonic should now work better in harsh network environments) [@valeriansaliou, 1c2b9c8fcd28b033a7cb80d678c388ce78ab989d].
    • Limit the size of words that can hit against the FST graph, as the FST gets slower for long words [@valeriansaliou, #81].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.2(Mar 24, 2019)

    • FST graph consolidation locking strategy has been improved even further, based on issues with the previous rework we have noticed at scale in production (now, consolidation locking is done at a lower-priority relative to actual queries and pushes to the index) [@valeriansaliou, #68].
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Mar 24, 2019)

    • FST graph consolidation locking strategy has been reworked as to allow queries to be executed lock-free when the FST consolidate task takes a lot of time (previously, queries were being deferred due to an ongoing FST consolidate task) [@valeriansaliou, #68].
    • Removed special license clause introduced in v1.0.2, Sonic is full MPL 2.0 now. [@valeriansaliou]
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Mar 21, 2019)

    • Breaking: Change how buckets are stored in a KV-based collection (nest them in the same RocksDB database; this is much more efficient on setups with a large number of buckets - v1.1.0 is incompatible with the v1.0.0 KV database format) [@valeriansaliou].
    • Bump jemallocator to version 0.3 [@valeriansaliou].
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Mar 20, 2019)

  • v1.0.1(Mar 19, 2019)

    • Added automated benchmarks (can be ran via cargo bench --features benchmark) [@valeriansaliou].
    • Reduced the time to query the search index by 50% via optimizations (in multiple methods, eg. the lexer) [@valeriansaliou].
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 18, 2019)

Owner
Valerian Saliou
Co-Founder & CTO at Crisp.
Valerian Saliou
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 5.3k Sep 17, 2021
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.3k Sep 10, 2021
๐Ÿ”Ž Impossibly fast web search, made for static sites.

Stork Impossibly fast web search, made for static sites. Stork is two things. First, it's an indexer: it indexes your loosely-structured content and c

James Little 1.9k Sep 8, 2021
Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful โ€˜first-responseโ€™ capability to quickly identify threats within Windows

F-Secure Countercept 599 Sep 14, 2021
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ โšก Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 18.2k Sep 16, 2021
โšก Insanely fast, ๐ŸŒŸ Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

โœจ Feature Rich | โšก Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 72 Sep 11, 2021
โšก Insanely fast, ๐ŸŒŸ Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

โœจ Feature Rich | โšก Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 72 Sep 11, 2021
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.5k Sep 10, 2021
๐Ÿ”TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

tinysearch TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites. TinySearch is written in Rust, and then com

null 1.6k Sep 12, 2021
A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

Andy Lok 87 Sep 2, 2021
High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage 357 Sep 11, 2021
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 55 Sep 10, 2021
Represent large sets and maps compactly with finite state transducers.

fst This crate provides a fast implementation of ordered sets and maps using finite state machines. In particular, it makes use of finite state transd

Andrew Gallant 1.2k Sep 14, 2021
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 66 Aug 8, 2021
Cheat engine clone attempt

betrayal_engine Cheat engine clone attempt running # first terminal cargo run --example test-program # runs a test program #second terminal ps -aux |

Wojciech Niedลบwiedลบ 3 Sep 6, 2021