A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

Max

Last update: Aug 10, 2022

Related tags

Web programming mini_google

Overview

Mini Google

Course project for the Architecture of Computer Systems course.

Overview:

Architecture:

We are working on multiple components of the web crawler at the same time:

Website backend
Elasticsearch database backend
Two crawlers (one in Python, and one in Rust)
Language detection backend in Rust and Python.

Each component is intended to run as a separate Docker container, for us to be able to freely mix them in different amounts and on different computers/servers.

Progress can be tracked over here.

Usage:

Launch each container independently with instructions in respective directories, or launch all of them together:

# Rust takes some time to download and compile all of the dependencies since it
# produces static binaries, so be aware! We are figuring out a way to fix this.
docker-compose build

docker-compose up

Prerequisites:

Credits:

License:

MIT License

Noria: data-flow for high-performance web applications

Noria: data-flow for high-performance web applications Noria is a new streaming data-flow system designed to act as a fast storage backend for read-he

4.5k Dec 28, 2022

A simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit frontend.

Rust-auth-example This repository aims to represent a simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit

4 Feb 19, 2023

Web Application with using Rust(Actix, Diesel and etc)

Santa Service App Used technology stack Web Server with using Rust (Actix, Actix-web, Diesel) Data base (Postgres) Console Application (Tokio) Tasks o

3 Jan 8, 2023

An API project using Rust, Actix Web and JWT. WIP

Actix-web REST API with JWT (WIP) A simple CRUD backend app using Actix-web, Diesel and JWT Require Rust Stable Postgres Or using Docker How to run Ma

4 Sep 21, 2023

Experiments with Rust CRDTs using Tokio web application framework Axum.

crdt-genome Synopsis Experiments with Rust CRDTs using Tokio web application framework Axum. Background Exploring some ideas of Martin Kleppmann, part

3 Mar 18, 2022

A secure and efficient gateway for interacting with OpenAI's API, featuring load balancing, user request handling without individual API keys, and global access control.

OpenAI Hub OpenAI Hub is a comprehensive and robust tool designed to streamline and enhance your interaction with OpenAI's API. It features an innovat

30 Jun 16, 2023

Search Confluence from Alfred and open results in your browser.

Alfred Confluence Workflow Search Confluence from Alfred and open results in your browser. Features Search Confluence from Alfred and open results in

26 Nov 7, 2022

A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM).

BPM Is A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM). Hosted at https

11 Nov 5, 2021

Rust Rest API Stack with User Management

A secure-by-default rest api stack implemented with hyper, tokio, bb8 and postgres. This project is focused on providing end-to-end encryption by default for 12-factor applications. Includes a working user management and authentication backend written in postgresql with async S3 uploading for POST-ed data files.

10 Dec 25, 2022

Comments

Cache Cargo dependencies in a separate layer

Hi @maxymkuz and @LastGenius-edu,

Please take a look at the next Dockerfile snippet.

FROM ekidd/rust-musl-builder:latest AS builder

RUN mkdir -p /home/rust/my-project
WORKDIR /home/rust/my-project
VOLUME ["/home/rust/myproject"]

# cache dependency artifacts. 'Cargo build' builds 
# both dependencies and project files, 
# splitting the project into dependencies with single 
# empty main and current project sources.
COPY --chown=rust:rust Cargo.toml Cargo.toml
COPY --chown=rust:rust Cargo.lock Cargo.lock
RUN echo 'fn main() {}' > src/main.rs
RUN cargo build --release

# build the project using prebuilt deps
RUN rm /home/rust/my-project/src/main.rs
COPY --chown=rust:rust ./src ./src
RUN cargo build --release

This trick solves the issue with dependencies caching during cargo build. It splits the build process into two steps: first builds dependencies with an empty main and commits the separate image layer. Docker will reuse this layer later during the project build. .`

This is just a prototype that needs to be implemented for each Rust project. There is no need to use a separate project; you can implement it for /home/rust/src.

opened by OldFrostDragon 1

Finish working on pre-presentation stuff
We've completely transferred to Elasticsearch, improved text parsing and set up a proper database backend. And search now works!

[x] Set up Elasticsearch and transfer existing PostgreSQL functionality to it

[x] Implement database backend to handle requests from crawlers and web backend

[x] Set up web backend to properly communicate with the database backend

[x] Connect the Python crawler to the database.

[x] Improve Docker infrastructure (still a lot to work on)
opened by last-genius 0
Implement a more advanced asynchronous multi-processor crawler
Worked a little more on the crawler, and now it uses a thread pool where each thread does asynchronous I/O with given webpages.

Updating the list from the previous Pull Request #1, tasks done for this week are:

Completed:

[x] A minimal Docker image for the crawler.

[x] A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.

[x] The app saves the collected structured data in plain form into an output file.

[x] Implement into asynchronous implementations for each worker in the thread pool

[x] Improve link collection, stop considering anchors on the same page as a different page

[x] Add more documentation and split the file into several modules

Things left to do, roughly in the descending order of importance:

[ ] Add tests for the crawler and thread pool implementations

[ ] Test the module dynamically and figure out the best amount of threads and simultaneous async requests

[ ] Fix up progress display in a Docker image

[ ] Add more detailed progress info on what each thread is doing
opened by last-genius 0
Finished working on a basic parallel crawler for Structured Data in Rust
I've implemented a simple crawler that uses thread pool and message passing to crawl webpages in parallel. While it uses most of CPU's resources and does everything we agreed upon for this week, there is still a lot of room for improvement, more on this below.

Completed:

[x] A minimal Docker image for the crawler.

[x] A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.

[x] The app saves the collected structured data in plain form into an output file.

Things left to do, roughly in the descending order of importance:

[ ] Look into asynchronous implementations for each worker in the thread pool

[ ] Improve link collection, stop considering anchors on the same page as a different page

[ ] Add more documentation and split the file into several modules

[ ] Add tests for the crawler and thread pool implementations

[ ] Fix up progress display in a Docker image
opened by last-genius 0

Owner

Max

GitHub

A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

Related tags

Overview

Mini Google

Overview:

Architecture:

Usage:

Prerequisites:

Credits:

License:

You might also like...

Noria: data-flow for high-performance web applications

A simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit frontend.

Web Application with using Rust(Actix, Diesel and etc)

An API project using Rust, Actix Web and JWT. WIP

Experiments with Rust CRDTs using Tokio web application framework Axum.

A secure and efficient gateway for interacting with OpenAI's API, featuring load balancing, user request handling without individual API keys, and global access control.

Search Confluence from Alfred and open results in your browser.

A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM).

Rust Rest API Stack with User Management

Comments

Cache Cargo dependencies in a separate layer

Finish working on pre-presentation stuff

Implement a more advanced asynchronous multi-processor crawler

Finished working on a basic parallel crawler for Structured Data in Rust

Owner

Max

Code template for a production Web Application using Axum: The AwesomeApp Blueprint for Professional Web Development.

axum-serde is a library that provides multiple serde-based extractors and responders for the Axum web framework.

Seed is a Rust front-end framework for creating fast and reliable web apps with an Elm-like architecture.

Layers, extractors and template engine wrappers for axum based Web MVC applications

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Sauron is an html web framework for building web-apps. It is heavily inspired by elm.

Hot reload static web server for deploying mutiple static web site with version control.

A highly customizable, full scale web backend for web-rwkv, built on axum with websocket protocol.

📝 Web-based, reactive Datalog notebooks for data analysis and visualization

A Rust library to extract useful data from HTML documents, suitable for web scraping.

A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

Related tags

Overview

Mini Google

Overview:

Architecture:

Usage:

Prerequisites:

Credits:

License:

You might also like...

Noria: data-flow for high-performance web applications

A simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit frontend.

Web Application with using Rust(Actix, Diesel and etc)

An API project using Rust, Actix Web and JWT. *WIP*

Experiments with Rust CRDTs using Tokio web application framework Axum.

A secure and efficient gateway for interacting with OpenAI's API, featuring load balancing, user request handling without individual API keys, and global access control.

Search Confluence from Alfred and open results in your browser.

A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM).

Rust Rest API Stack with User Management

Comments

Cache Cargo dependencies in a separate layer

Finish working on pre-presentation stuff

Implement a more advanced asynchronous multi-processor crawler

Finished working on a basic parallel crawler for Structured Data in Rust

Owner

Max

Code template for a production Web Application using Axum: The AwesomeApp Blueprint for Professional Web Development.

axum-serde is a library that provides multiple serde-based extractors and responders for the Axum web framework.

Seed is a Rust front-end framework for creating fast and reliable web apps with an Elm-like architecture.

Layers, extractors and template engine wrappers for axum based Web MVC applications

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Sauron is an html web framework for building web-apps. It is heavily inspired by elm.

Hot reload static web server for deploying mutiple static web site with version control.

A highly customizable, full scale web backend for web-rwkv, built on axum with websocket protocol.

📝 Web-based, reactive Datalog notebooks for data analysis and visualization

A Rust library to extract useful data from HTML documents, suitable for web scraping.

An API project using Rust, Actix Web and JWT. WIP