🕵️Scrape multiple media providers on a cron job and dispatch webhooks when changes are detected.

Related tags

HTTP Client jiu
Overview

Scrape multiple media providers on a cron job and dispatch webhooks when changes are detected.

Jiu

Jiu is a multi-threaded media scraper capable of juggling thousands of endpoints from different providers with unique restrictions/requirements.

Providers

Provider is the umbrella term that encapsulates all endpoints for a given domain.

For example, https://weverse.io/bts/artist and https://weverse.io/dreamcatcher/artist are 2 endpoints under the Weverse provider.

Supported providers

Priority

Priority is the system that determines how frequently an endpoint needs to be queued to be checked again on a scale from 1 to 10. Priority 1 endpoints are checked once every 2 hours and priority 10 endpoints are checked once a week.

All new endpoints start with a priority of 5 and move up or down based on how frequently changes are being detected. Endpoints that don't yield changes frequently are moved down one level and every detected change moves the endpoint up one level.

Rate Limits

Each provider has a rate limit shared across the same domain to prevent bans. This can be customized per-provider to allow for higher or stricter rate limits or bursts sizes based on what the API allows.

Authorization

Anonymous request are always preferred when possible.

There is a customizable login flow for providers that require authorization which allows logging into APIs after an authorization error, and persists additional data (such as a JWT token) to be shared across each provider during the lifetime of the process.

The login flow is reversed engineered for providers that don't have a public API.

Juggling multiple accounts per provider is currently not supported and probably won't be as long as your accounts aren't getting banned (and if they are then you're sending too many requests and need to increase your rate limits).

Jiu will try its best to identify itself in its requests' User-Agent header, but will submit a fake UA for providers that gate posts behind a user agent check (currently none).

Proxies

Proxies are not supported or needed.

Webhooks

Jiu is capable of sending webhooks to multiple destinations when an update for a provider is detected.

{
  "provider": {
    "type": "weverse.artist_feed",
    "id": "14",
    "page": "https://weverse.io/dreamcatcher/artist",
  },
  "media": [
    {
      "type": "image",
      "media_url": "https://cdn-contents-web.weverse.io/user/xlx2048/jpg/8a0561f034564758b77551745d7d62c6349.jpg",
      "page_url": "https://weverse.io/dreamcatcher/artist/1666051913313967?photoId=216521736",
      "post_date": "2021-07-23T01:30:21Z",
      "reference_url": "https://weverse.io/dreamcatcher/artist/1666051913313967?photoId=216521736",
      "unique_identifier": "216521736",
      "provider_metadata": {
        "author_id": 61,
        "author_name": "지유",
        "height": 1920,
        "width": 2443,
        "thumbnail_url": "https://cdn-contents-web.weverse.io/user/mx750/jpg/8a0561f034564758b77551745d7d62c6349.jpg"
      }
    }
  ]
},

Every provider has its own provider_metadata field that may contain extra information about the image or the post it was found under, but may also be missing. Documentation WIP

The unique_identifier field is unique per provider and not globally.

If a Discord webhook URL is detected, the payload is changed to allow Discord to display the images in the channel.

There is currently no retry mechanism for webhooks that fail to deliver successfully.

Jiu is NOT:

  • For bombarding sites like Twitter with requests to detect changes within seconds.
  • Capable of executing javascript with a headless browser.
  • Able to send requests to any social media site without explicit support.

Jiu IS:

  • For slowly monitoring changes in different feeds over the course of multiple hours without abusing the provider.
  • Capable of adjusting the frequency of scrapes based on how frequently the source is updated.
  • Able to send webhooks to different sites like Discord for automatic updates.
  • The lead singer of Dreamcatcher.

Usage

Copy over .env.example to .env and fill out relevant fields

WIP

If you would like to use this project, please change the USER_AGENT environment variable to identify your crawler accurately.

Built for simp.pics

You might also like...
A framework for iterating over collections of types implementing a trait without virtual dispatch
A framework for iterating over collections of types implementing a trait without virtual dispatch

zero_v Zero_V is an experiment in defining behavior over collections of objects implementing some trait without dynamic polymorphism.

Fast Function Dispatch: Improving the performance of Rust's dynamic function calls

Fast Function Dispatch: Improving the performance of Rust's dynamic function calls A safe, pragmatic toolkit for high-performance virtual function cal

Simple async library for triggering IFTTT events using webhooks.

IFTTT Webhook A simple Rust async library for triggering IFTTT events using webhooks. Installation Installation can be performed using cargo add: carg

🔔 CLI utility to send notifications to Slack via integration webhooks

Slack notifier Just a simple CLI tool to send notifications to Slack. Please note that this project is just a playground to start learning Rust, it is

A little command-line script written in Rust to interface with Discord webhooks.

Rust Discord Webhook Agent This is a little "script" I wrote for practice with Rust and asynchronous operations within Rust. Getting started Clone thi

Toggleable cron reminders app for Mac, Linux and Windows
Toggleable cron reminders app for Mac, Linux and Windows

Remind Me Again Remind Me Again Toggleable reminders app for Mac, Linux and Windows Download for Mac, Windows or Linux Dev instructions Get started In

Generate progress bars from cron expressions.
Generate progress bars from cron expressions.

jalm Generate Progress Bars from Cron Expressions Installation and Usage Grab the latest binary from the Github Actions tab. Alternatively, to build f

a simple rust service for Scheduling commands execution on time basis, an easy alternative to cron

Tasker A Simple crate which provides a service and a configuration API for genrating commands based tasks ,on time basis. Installation build from sour

🚀 TaskFly - The Fun Cron Alternative

🚀 TaskFly - The Fun Job Scheduler Daemon Experience a whole new way of scheduling tasks with TaskFly. Representing tasks' urgency with emojis, TaskFl

Rust flavor of the popular cron scheduler croner.

Croner Croner is a fully featured, lightweight, efficient Rust library for parsing and evaluating cron patterns. Designed with simplicity and performa

Async Rust cron scheduler running on Tokio.

Grizzly Cron Scheduler A simple and easy to use scheduler, built on top of Tokio, that allows you to schedule async tasks using cron expressions (with

This tool was developed as part of a course on forensic analysis and cybersecurity. It is intended to be used as a training resource to help students understand the structure and content of job files in Windows environments.

Job File Parser Job File Parser is a Rust-based tool designed for parsing both legacy binary job files and modern XML job files used by the Windows Ta

A job queue built on sqlx and PostgreSQL.

sqlxmq A job queue built on sqlx and PostgreSQL. This library allows a CRUD application to run background jobs without complicating its deployment. Th

Wojak quits his job at McDonalds and becomes a Takeaway driver.
Wojak quits his job at McDonalds and becomes a Takeaway driver.

DeliveryGuy 🚴 Summary This repo is a crypto cross-exchange arbitrage implementation in Rust. work for Binance | OKX | Bybit | GateIO | Kucoin How to

A simple, efficient Rust library for handling asynchronous job processing and task queuing.

job_queue Setup cargo add job_queue Usage Create a job use job_queue::{Error, Job, typetag, async_trait, serde}; #[derive(Debug, serde::Deserialize,

Fang - Background job processing library for Rust.
Fang - Background job processing library for Rust.

Fang Background job processing library for Rust. Currently, it uses Postgres to store state. But in the future, more backends will be supported.

A CI inspired approach for local job automation.

nauman A CI inspired approach for local job automation. Features • Installation • Usage • FAQ • Examples • Job Syntax About nauman is an easy-to-use j

An unsafe botched job that doesn't rely on types being 'static lifetime.

An unsafe botched job that doesn't rely on types being 'static lifetime. Will panic if provided a 0 field struct. I will fix this when I figure out how.

Message/job queue based on bonsaidb, similar to sqlxmq.

Bonsaimq Simple database message queue based on bonsaidb. The project is highly influenced by sqlxmq. Warning: This project is in early alpha and shou

Comments
  • Is this capable of bypassing bot detections such as Imperva?

    Is this capable of bypassing bot detections such as Imperva?

    I saw this and really want to use it. However, the site I am scraping has a bot detection system from Imperva. The only way I can bypass it is by using Puppeteer (not headless). I was wondering if this is capable of bypassing such systems.

    opened by OwaisSiddiqui 2
Owner
Xetera
Experts advise against the consumption of container-fluid.
Xetera
An easy and powerful Rust HTTP Client

reqwest An ergonomic, batteries-included HTTP Client for Rust. Plain bodies, JSON, urlencoded, multipart Customizable redirect policy HTTP Proxies HTT

Sean McArthur 6.8k Dec 31, 2022
A more modern http framework benchmarker supporting HTTP/1 and HTTP/2 benchmarks.

rewrk A more modern http framework benchmark utility.

Harrison Burt 273 Dec 27, 2022
Minimal Rust HTTP client for both native and WASM

ehttp: a minimal Rust HTTP client for both native and WASM If you want to do HTTP requests and are targetting both native and web (WASM), then this is

Emil Ernerfeldt 105 Dec 25, 2022
Fast and friendly HTTP server framework for async Rust

Tide Serve the web API Docs | Contributing | Chat Tide is a minimal and pragmatic Rust web application framework built for rapid development. It comes

http-rs 4.1k Jan 2, 2023
xh is a friendly and fast tool for sending HTTP requests. It reimplements as much as possible of HTTPie's excellent design, with a focus on improved performance.

xh is a friendly and fast tool for sending HTTP requests. It reimplements as much as possible of HTTPie's excellent design, with a focus on improved performance

Mohamed Daahir 3.4k Jan 6, 2023
An asynchronous dumb exporter proxy for prometheus. This aggregates all the metrics and exposes as a single scrape endpoint.

A dumb light weight asynchronous exporter proxy This is a dumb lightweight asynchronous exporter proxy that will help to expose multiple application m

Dark streams 3 Dec 4, 2022
CLI and utilities for converting media files (images/videos) to ascii outputs (output media file or print to console)

CLI and utilities for converting media files (images/videos) to ascii outputs (output media file or print to console). Supports most standard image formats, and some video formats.

Michael 30 Jan 1, 2023
Media Cleaner is a simple CLI tool to clean up your media library based on your Overseerr requests and Tautulli history, written in Rust.

Media Cleaner Media Cleaner is a simple CLI tool to clean up your media library based on your Overseerr requests and Tautulli history, written in Rust

Felix Bjerhem Aronsson 21 Mar 22, 2023
Qovery Engine is an open-source abstraction layer library that turns easy apps deployment on AWS, GCP, Azure, and other Cloud providers in just a few minutes.

Qovery Engine is an open-source abstraction layer library that turns easy apps deployment on AWS, GCP, Azure, and other Cloud providers in just a few minutes.

Qovery 1.9k Jan 4, 2023
Core Fiberplane data models and methods for transforming them (templates, providers, markdown conversion)

fiberplane This repository is a monorepo for Rust code that is used throughout Fiberplane's product. Overview base64uuid - A utility for working with

Fiberplane 18 Feb 22, 2023