Serverless search for AWS.

Tyler van Hensbergen

Last update: Jan 3, 2023

Related tags

Utilities rust aws aws-lambda serverless tantivy

Overview

Pathery 🔥 Serverless Search 🔥

Pathery is a serverless search service built on AWS using Rust, CDK and Tantivy.

🔔 WARNING: This is currently a work in progress and not ready for production usage.

Features

🔥 Fast full-text search - Pathery is built on Rust to limit its AWS Lambda cold start overhead.
🥰 Simple REST API - Pathery exposes a simple REST API to make search as easy as possible.
👍 Easy to install - Pathery ships as a CDK Component making it easy to get started.
💵 Usage based - Pathery has no long running servers, only pay for what you use.
🔼 Built for AWS - Pathery leans on AWS managed services to limit its maintenance burden and maximize its scalability.

Getting Started

Check out the getting started guide to deploy Pathery into your AWS account using CDK.

You might also like...

Rust client for AWS Infinidash service.

AWS Infinidash - Fully featured Rust client Fully featured AWS Infinidash client for Rust applications. You can use the AWS Infinidash client to make

15 Feb 12, 2022

Rusoto is an AWS SDK for Rust

Rusoto is an AWS SDK for Rust You may be looking for: An overview of Rusoto AWS services supported by Rusoto API documentation Getting help with Rusot

2.6k Jan 3, 2023

Easy switch between AWS Profiles and Regions

AWSP - CLI To Manage your AWS Profiles! AWSP provides an interactive terminal to interact with your AWS Profiles. The aim of this project is to make i

14 Dec 25, 2022

Simple fake AWS Cognito User Pool API server for development.

Fakey Cognito 🏡 Homepage Simple fake AWS Cognito API server for development. ✅ Implemented features AdminXxx on User Pools API. Get Started # run wit

4 Aug 30, 2022

Postgres proxy which allows tools that don't natively supports IAM auth to connect to AWS RDS instances.

rds-iamauth-proxy rds-proxy lets you make use of IAM-based authentication to AWS RDS instances from tools that don't natively support that method of a

10 Nov 7, 2022

A tool to run web applications on AWS Lambda without changing code.

AWS Lambda Adapter A tool to run web applications on AWS Lambda without changing code. How does it work? AWS Lambda Adapter supports AWS Lambda functi

321 Jan 2, 2023

📦 🚀 a smooth-talking smuggler of Rust HTTP functions into AWS lambda

lando 🚧 maintenance mode ahead 🚧 As of this announcement AWS not officialy supports Rust through this project. As mentioned below this projects goal

68 Dec 7, 2021

cargo-lambda a Cargo subcommand to help you work with AWS Lambda

cargo-lambda cargo-lambda is a Cargo subcommand to help you work with AWS Lambda. This subcommand compiles AWS Lambda functions natively and produces

184 Jan 5, 2023

cargo-lambda is a Cargo subcommand to help you work with AWS Lambda.

cargo-lambda cargo-lambda is a Cargo subcommand to help you work with AWS Lambda. The new subcommand creates a basic Rust package from a well defined

184 Jan 5, 2023

Comments

Consider using DynamoDB or S3 for original document storage

When Tantivy indexes documents it will optionally store the original text as well. This is used for generating snippets which are highlights of the matching text. Tantivy uses the filesystem to store these documents and stores them in a compressed format. This works fairly well when not using a networked storage solution but when using EFS, the whole store for a segment needs to be pulled in order to find a given document.

.store files tend to be considerably larger than the rest of the segment:

-rw-rw-r--  1 1001 1001  48K Nov 28 22:19 d908e24f73f04e83b85e679acf1d361b.7388140.del
-rw-rw-r--  1 1001 1001   99 Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fast
-rw-rw-r--  1 1001 1001 2.7M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.fieldnorm
-rw-rw-r--  1 1001 1001  30M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.idx
-rw-rw-r--  1 1001 1001  17M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.pos
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001 5.8M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.term

Looking at all the .store files in the test index it is clear that pulling .store files to look up specific documents by id would be incredibly inefficient when taking network latency into account:

-rw-rw-r--  1 1001 1001 149M Nov 28 22:14 03578055b76b45bd961cf3931a0282d9.store
-rw-rw-r--  1 1001 1001 176M Nov 28 23:29 103552f789714d07a2dff9f7143e001c.store
-rw-rw-r--  1 1001 1001 162M Nov 29 00:06 1bfb3f7ef08e40b4bd166919c0786769.store
-rw-rw-r--  1 1001 1001 9.9M Nov 29 00:38 87c0dd2f0577477ba233ee6a1c57c948.store
-rw-rw-r--  1 1001 1001  66M Nov 29 00:16 92f9621c4f3e41a6938ad65d2c37e969.store
-rw-rw-r--  1 1001 1001  51M Nov 29 00:32 b208026b097445c4a8eef6ea7dc6754e.store
-rw-rw-r--  1 1001 1001  91M Nov 28 21:44 d908e24f73f04e83b85e679acf1d361b.store
-rw-rw-r--  1 1001 1001  58M Nov 29 00:24 dd957117dbf3437ca3cdb552b38cc8c4.store
-rw-rw-r--  1 1001 1001  38M Nov 29 00:37 ea6445abef914faeb767686e1c054987.store

Instead of using Tantivy's built-in storage capability, we could use S3 or DynamoDB to store original documents such that they could be retrieved efficiently by id. A beneficial side-effect of this change would be that it should be cheaper as well as both DynamoDB and S3 have a lower monthly storage cost compared to EFS.

enhancement

opened by tvanhens 1

Query fails sporadically when indexing due to deleted files from merged segments

It appears that segment merging is the culprit here. Merging segments deletes the merged segments but readers may have out of date meta.json files which refer to these deleted segments.
bug

opened by tvanhens 1
feature: batching and batch index endpoint

Adds a batch indexing endpoint POST /index/{index_id}/batch to upload batches of documents.

Batches are serialized to S3 and a message pointing to the S3 object is enqueued to SQS. This allows for large batches without running into SQS payload limitations.

Closes #2

opened by tvanhens 0
Allow all tantivy schema fields to be used by pathery indexes
Pathery Schemas include functions for transforming JSON data as well as self-serialization. This means the custom code for both functions can be removed and it should improve feature coverage.

[x] dates

[x] integers

[ ] floats

enhancement
opened by tvanhens 0