A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

Overview

Mini Google

Course project for the Architecture of Computer Systems course.

Overview:

Architecture:

We are working on multiple components of the web crawler at the same time:

Each component is intended to run as a separate Docker container, for us to be able to freely mix them in different amounts and on different computers/servers.

Progress can be tracked over here.

Usage:

Launch each container independently with instructions in respective directories, or launch all of them together:

# Rust takes some time to download and compile all of the dependencies since it
# produces static binaries, so be aware! We are figuring out a way to fix this.
docker-compose build

docker-compose up

Prerequisites:

Credits:

License:

MIT License

You might also like...
Noria: data-flow for high-performance web applications

Noria: data-flow for high-performance web applications Noria is a new streaming data-flow system designed to act as a fast storage backend for read-he

A simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit frontend.

Rust-auth-example This repository aims to represent a simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit

Web Application with using Rust(Actix, Diesel and etc)
Web Application with using Rust(Actix, Diesel and etc)

Santa Service App Used technology stack Web Server with using Rust (Actix, Actix-web, Diesel) Data base (Postgres) Console Application (Tokio) Tasks o

An API project using Rust, Actix Web and JWT. *WIP*

Actix-web REST API with JWT (WIP) A simple CRUD backend app using Actix-web, Diesel and JWT Require Rust Stable Postgres Or using Docker How to run Ma

Experiments with Rust CRDTs using Tokio web application framework Axum.

crdt-genome Synopsis Experiments with Rust CRDTs using Tokio web application framework Axum. Background Exploring some ideas of Martin Kleppmann, part

A secure and efficient gateway for interacting with OpenAI's API, featuring load balancing, user request handling without individual API keys, and global access control.

OpenAI Hub OpenAI Hub is a comprehensive and robust tool designed to streamline and enhance your interaction with OpenAI's API. It features an innovat

Search Confluence from Alfred and open results in your browser.

Alfred Confluence Workflow Search Confluence from Alfred and open results in your browser. Features Search Confluence from Alfred and open results in

A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM).

BPM Is A static website that allows a user to tap along with a beat, displaying the song's calculated tempo in beats per minute (BPM). Hosted at https

Rust Rest API Stack with User Management
Rust Rest API Stack with User Management

A secure-by-default rest api stack implemented with hyper, tokio, bb8 and postgres. This project is focused on providing end-to-end encryption by default for 12-factor applications. Includes a working user management and authentication backend written in postgresql with async S3 uploading for POST-ed data files.

Comments
  • Cache Cargo dependencies in a separate layer

    Cache Cargo dependencies in a separate layer

    Hi @maxymkuz and @LastGenius-edu,

    Please take a look at the next Dockerfile snippet.

    FROM ekidd/rust-musl-builder:latest AS builder
    
    RUN mkdir -p /home/rust/my-project
    WORKDIR /home/rust/my-project
    VOLUME ["/home/rust/myproject"]
    
    # cache dependency artifacts. 'Cargo build' builds 
    # both dependencies and project files, 
    # splitting the project into dependencies with single 
    # empty main and current project sources.
    COPY --chown=rust:rust Cargo.toml Cargo.toml
    COPY --chown=rust:rust Cargo.lock Cargo.lock
    RUN echo 'fn main() {}' > src/main.rs
    RUN cargo build --release
    
    # build the project using prebuilt deps
    RUN rm /home/rust/my-project/src/main.rs
    COPY --chown=rust:rust ./src ./src
    RUN cargo build --release
    

    This trick solves the issue with dependencies caching during cargo build. It splits the build process into two steps: first builds dependencies with an empty main and commits the separate image layer. Docker will reuse this layer later during the project build. .`

    This is just a prototype that needs to be implemented for each Rust project. There is no need to use a separate project; you can implement it for /home/rust/src.

    opened by OldFrostDragon 1
  • Finish working on pre-presentation stuff

    Finish working on pre-presentation stuff

    We've completely transferred to Elasticsearch, improved text parsing and set up a proper database backend. And search now works!

    • [x] Set up Elasticsearch and transfer existing PostgreSQL functionality to it
    • [x] Implement database backend to handle requests from crawlers and web backend
    • [x] Set up web backend to properly communicate with the database backend
    • [x] Connect the Python crawler to the database.
    • [x] Improve Docker infrastructure (still a lot to work on)
    opened by last-genius 0
  • Implement a more advanced asynchronous multi-processor crawler

    Implement a more advanced asynchronous multi-processor crawler

    Worked a little more on the crawler, and now it uses a thread pool where each thread does asynchronous I/O with given webpages.

    Updating the list from the previous Pull Request #1, tasks done for this week are:

    Completed:

    • [x] A minimal Docker image for the crawler.
    • [x] A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.
    • [x] The app saves the collected structured data in plain form into an output file.
    • [x] Implement into asynchronous implementations for each worker in the thread pool
    • [x] Improve link collection, stop considering anchors on the same page as a different page
    • [x] Add more documentation and split the file into several modules

    Things left to do, roughly in the descending order of importance:

    • [ ] Add tests for the crawler and thread pool implementations
    • [ ] Test the module dynamically and figure out the best amount of threads and simultaneous async requests
    • [ ] Fix up progress display in a Docker image
    • [ ] Add more detailed progress info on what each thread is doing
    opened by last-genius 0
  • Finished working on a basic parallel crawler for Structured Data in Rust

    Finished working on a basic parallel crawler for Structured Data in Rust

    I've implemented a simple crawler that uses thread pool and message passing to crawl webpages in parallel. While it uses most of CPU's resources and does everything we agreed upon for this week, there is still a lot of room for improvement, more on this below.

    Completed:

    • [x] A minimal Docker image for the crawler.
    • [x] A CLI app that takes a text input file and then crawls through these webpages in parallel, collecting more and more links.
    • [x] The app saves the collected structured data in plain form into an output file.

    Things left to do, roughly in the descending order of importance:

    • [ ] Look into asynchronous implementations for each worker in the thread pool
    • [ ] Improve link collection, stop considering anchors on the same page as a different page
    • [ ] Add more documentation and split the file into several modules
    • [ ] Add tests for the crawler and thread pool implementations
    • [ ] Fix up progress display in a Docker image
    opened by last-genius 0
Owner
Max
Max
Code template for a production Web Application using Axum: The AwesomeApp Blueprint for Professional Web Development.

AwesomeApp rust-web-app More info at: https://awesomeapp.dev/rust-web-app/ rust-web-app YouTube episodes: Episode 01 - Rust Web App - Course to Produc

null 45 Sep 6, 2023
axum-serde is a library that provides multiple serde-based extractors and responders for the Axum web framework.

axum-serde ?? Overview axum-serde is a library that provides multiple serde-based extractors / responses for the Axum web framework. It also offers a

GengTeng 3 Dec 12, 2023
Seed is a Rust front-end framework for creating fast and reliable web apps with an Elm-like architecture.

Seed is a Rust front-end framework for creating fast and reliable web apps with an Elm-like architecture.

null 3.6k Jan 6, 2023
Layers, extractors and template engine wrappers for axum based Web MVC applications

axum-template Layers, extractors and template engine wrappers for axum based Web MVC applications Getting started Cargo.toml [dependencies] axum-templ

Altair Bueno 11 Dec 15, 2022
Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Actix 16.2k Jan 2, 2023
Sauron is an html web framework for building web-apps. It is heavily inspired by elm.

sauron Guide Sauron is an web framework for creating fast and interactive client side web application, as well as server-side rendering for back-end w

Jovansonlee Cesar 1.7k Dec 26, 2022
Hot reload static web server for deploying mutiple static web site with version control.

SPA-SERVER It is to provide a static web http server with cache and hot reload. δΈ­ζ–‡ README Feature Built with Hyper and Warp, fast and small! SSL with

null 7 Dec 18, 2022
A highly customizable, full scale web backend for web-rwkv, built on axum with websocket protocol.

web-rwkv-axum A axum web backend for web-rwkv, built on websocket. Supports BNF-constrained grammar, CFG sampling, etc., all streamed over network. St

Li Junyu 12 Sep 25, 2023
πŸ“ Web-based, reactive Datalog notebooks for data analysis and visualization

Percival is a declarative data query and visualization language. It provides a reactive, web-based notebook environment for exploring complex datasets, producing interactive graphics, and sharing results.

Eric Zhang 486 Dec 28, 2022
A Rust library to extract useful data from HTML documents, suitable for web scraping.

select.rs A library to extract useful data from HTML documents, suitable for web scraping. NOTE: The following example only works in the upcoming rele

Utkarsh Kukreti 829 Dec 28, 2022