An enhanced search engine just for Lemmy/Fediverse

Overview

docker build:latest docker build:dev

GitHub tag (latest SemVer pre-release) GitHub

Please Read

If anyone wants to help contribute to this, please feel free to reach out to me. You can obviously find me on Lemmy, mainly https://lemmy.world/u/marsara9 or you can find me on Discord. If you can't contribute but still want to help, feel free to raise feature requests for things you'd like to see.

Lemmy-Search

The fediverse creates some unique problems when it comes to searching. Mostly that existing search engines can't deal with the concept that multiple servers may exist that are ultimately hosting the same content. These same search engines also aren't aware that you only have an account on one, or maybe a select few of these instances.

Lemmy-Search, ya I need a better name, will uniquely search any Lemmy instance and attempt to index the entire ferdiverse and then present a familiar search interface that will allow users to:

  • Users can choose a preferred instance. Such that all links that you open from the search results will automatically open with that user's instance, where they should already be logged in.
  • The big search engines let you filter by a particular website, but this doesn't make sense for the fediverse. Instead you can still refine your searches by:
    • Instance -- This will limit your search to just communities that were created on that particular instance.
    • Community -- You can also filter search results by just the particular community.
    • Author -- You can also just search for posts that were made by a particular user.

search results page

How it Works

For any given post that is found, all non-alphanumeric characters are removed a distinct list of words (anything that has a space between it) is taken from both the post title and body. Then when the user performs a search a similar process is applied to the query and all of those distinct words are then queried from the database. Posts that then have the highest number of matches are returned first and then those are sorted by the total score of said post. As it is assumed that if there are more matches from your query the post is more relevant to you, and that posts with a higher score are more trust-worthy.

Note that a post that just contains the same word repeated over and over will still only count for a single match compared to a post that only mentions the word once.

Road map

For the first release I expect to have the following features:

  • Indexing will be limited to a single 'seed instance'. Now assuming that instance is federated, you should still be able to search across all of the posts that your seed instance is aware of.
  • Federated instances of that 'seed instance' will only be indexed so that opening links will work on that target instance.
  • Users can type in any search string and it will match on the contents of any Post.
    • Short words are automatically removed from the search query to help reduce false positives.
  • Preferred Instance selection. This will be limited to instances that the search engine has found as it indexes the fediverse.
  • Filtering by Instance, Community and/or Author.

Eventually some ideas I'd like to support (in no particular order):

  • Incorporate other fediverse type servers, including Mastodon, Kbin, etc...
  • Include comment data in the index as well.
  • Refine searches by comment authors instead of just post authors.
  • Explore other options of indexing and/or sharing data with other search engine instances. Essentially have the individual search engines participate in their own mini-fediverse. This way I can lighten the load on the actual Lemmy instances during a crawl.
  • Language selection. For now queries don't account for language at all and will just match on what you type.

Hosting your own instance

I've included a sample docker-compose.yml file that you reference to get things started. There's no environment variables or anything that you need to pass to the docker container, but there is a config.yml file that allows you to fine-tune the settings of the search engine and it's associated crawler.

Step by Step guide

To setup your own instance or begin development, start with pulling down a copy of the docker-compose.yml file. You'll then want to edit any usernames and/or passwords, but the default values should work for development right out of the box.

One exception though, is you'll want to modify which tag to pull down. If you're just wanting to stand-up your own instance you can refer to the table below to see which tag you should use. However if you wanting to do actual development, you'll want to uncomment the section that builds from the dockerfile. that looks something like this:

  build:
    context: ../
    dockerfile: dev.dockerfile

Next, pull down a copy of the config.yml file. If you edited any values in the docker-compose.yml file you'll want to then make the same changes here. Also make sure you place this in the volume that you've mapped to the lemmy-search service.

Finally you'll want to pull a copy of the nginx.conf. The default configuration assumes that you have SSL certificates and are planning to host publicly as an HTTPS server. Feel free to modify this as needed, no special headers need to be passed to Nginx, but it is assumed to run at the root of the domain, i.e. not in a subpath. (I haven't actually tested running this on a subpath, it may just work.)

Assuming you have everything configured correctly, you should now just be able to call docker compose up -d and the server should start up.

Due note that crawling of your seed instance is a process that only runs at a regular interval. So you may need to wait 24hrs for the initial crawl to finish. Alternatively you can edit mod.rs to change that interval to whatever you want, but you should keep it so that it's a fairly long time between runs. If a new crawler starts while an existing one is still running, they will both start writing the same entries to the database. For development purposes there's a config property development_mode that enables a few QOL features, specifically for development, including an endpoint /crawl that you can send a simple GET request to that will start an instance of the crawler.

PLEASE try and use your own private Lemmy instance for development. This instance MUST be running on port 443 though, so it'll have to be on a separate machine or different sub-domain.

Docker Tag Reference

Name Details
vX.Y.Z This tag will always correspond to a particular release. It won't receive any updates apart from any critical bugs that may be discovered.
latest This tag will always match the master branch. It should be the most stable apart from actual releases. Note that this tag will be updated when a release goes out.
dev This tag will always align with the develop branch. I cannot guarantee that everything will work on this tag as feature development is on-going.
test This is my local testing tag. It can be updated multiple times per day and may not align to any particular code in the repository. I recommend that no-one uses this, or if they do, do so at your own risk.
Comments
  • Actix web server is hanging and/or crashing after some time.

    Actix web server is hanging and/or crashing after some time.

    Still investigating but after some time the server just appears to hang for no reason. Internally everything keeps working but no requests are being processed.

    bug help wanted critical 
    opened by marsara9 1
  • The crawler frequently crashes while indexing

    The crawler frequently crashes while indexing

    Every time the crawler starts, it makes it through about 3000-4000 posts before encountering an error.

    Need to find a way to gracefully handle these but at the same time they can't be skipped. Unless we also want to find a way to restart the indexing progress periodically. But as it stands the number of new posts being created per day is exceeding the number of posts being crawled.

    bug help wanted 
    opened by marsara9 1
  • Fixing bug in crawler where it wouldn't get the right word_id for xref updates.

    Fixing bug in crawler where it wouldn't get the right word_id for xref updates.

    So mcmxci on Discord found a bug where after I had updated the insertion logic to process everything in bulk, that searches were only returning one search result per term, at best. This should resolve that issue.

    opened by marsara9 0
  • Communities and Authors that belong to other instances aren't formatted correctly for cross-linking.

    Communities and Authors that belong to other instances aren't formatted correctly for cross-linking.

    When clicking on community name or author name in the search results, for a community or author that belongs to an instance that isn't the user's preferred instance, the link is currently incorrect as no data about what instance that community belongs to is passed to the client.

    the SearchCommunity and SearchAuthor structs just need to be updated to include the actual instance that they belong to and then the results page needs to link to those correctly.

    Then the UI needs to be updated to compare the owning instance to the user's preferred instance.

    • if they match the link should be formatted such as:
      • https://<preferred-instance>/c/<community-name>
    • if they don't match the link should be formatted such as:
      • https://<preferred-instance>/c/!<community-name>@<owning-instance> (the same is true for authors but use /u/ instead of /c/).
    bug help wanted good first issue critical 
    opened by marsara9 0
  • `page` query parameter for search results does nothing

    `page` query parameter for search results does nothing

    Currently the page parameter when passed to the backend, isn't actually being used. To help speed up render times this parameter should be incorporated into the search query somehow to limit the number of items actually returned. The number of results found should still include the total number of actual results regardless of how many were returned to the client.

    bug 
    opened by marsara9 0
  • If preferred_instance= is missing, default to preferred instance stored in cookie

    If preferred_instance= is missing, default to preferred instance stored in cookie

    Is your feature request related to a problem? Please describe.

    I'd like to be able to link users to a specific query and have it use their preferred instance for the search. E.g.

    https://search-lemmy.com/results?query=lemmyverse

    Describe the solution you'd like

    Per our discord conversation, the preferred instance is stored as a cookie.

    If the cookie is missing and the above URL is used, potentially drop them on the homepage with the query already filled(?)

    Describe alternatives you've considered

    Prompting the user every time - Seems excessive Defaulting to Lemmy.world - This is bad practice, even if everyone else is currently doing it

    Additional context

    bug enhancement good first issue 
    opened by rcmaehl 1
  • Multiple-line text is cut-off vertically

    Multiple-line text is cut-off vertically

    To Reproduce Not sure.

    Expected behavior The letters in the first result shouldn't be missing a part of them.

    Screenshots Cropped screenshot of search results for
denmark'

    Desktop:

    • OS: Windows 11
    • Browser: Firefox 114.0.2
    • Version: Live instance

    Additional context https://search-lemmy.com/results?query=denmark&preferred_instance=lemmyis.fun&page=1

    bug good first issue 
    opened by krestenlaust 1
  • Search result input is hard to use on a mobile device.

    Search result input is hard to use on a mobile device.

    When on the search results screen on a mobile device the input bar can be rather small and hard to use.

    Currently on mobile, the search bar, search button and the preferred-instance dropdown is all on a single line. With the reduced with available on mobile, this doesn't leave much room.

    bug 
    opened by marsara9 0
  • Filters must currently cannot be at the beginning of a search string.

    Filters must currently cannot be at the beginning of a search string.

    opened by marsara9 0
  • Make this opt-in

    Make this opt-in

    Every few weeks, someone else has the glorious idea of indexing the entire fediverse, and every few weeks, we have to debate the issue over again.

    Lack of searchability is a feature for many people (and in fact intended as such in Mastodon and its derivatives). People are migrating to the fediverse to escape corporate ecosystems where their data is harvested all the time, and for many people, lack of global search is also an anti-harassment feature, to prevent the Twitter-esque harassment wherein people will search for marginalised communities to torpedo their activism or just plain survival by means of trolling, doxxing, etc. Several people have tried to spin up search engines for the fediverse, and that has almost always ended with such instances blocked widely across the network (and many users don't care to differentiate between search engine and scraping when one can easily be used for the other).

    Of course, if a server wants to be searchable, that's their decision to make (and it makes sense for Lemmy and Kbin, being Redditlikes). But it's not a decision anyone can make for the entirety of the network. On that grounds, any such mechanism absolutely has to operate on an opt-in basis.

    help wanted 
    opened by mxamber 2
  • Need to fix tag workflow

    Need to fix tag workflow

    Currently bumping the version in the Cargo.toml is a manual process. I'd like it if whenever a tag is pushed to the master branch that the Cargo.toml file could be bumped automatically.

    help wanted good first issue 
    opened by marsara9 0
Owner
null
a terminal user interface for lemmy

Lemmy-Terminal-Viewer Terminal User Interface for lemmy for Linux Terminals (should work in MacOs but i can't test) Install and Usage Linux Download l

Luna 21 Oct 29, 2022
An enhanced history(1) for bash

history This is a replacement for the history builtin in bash. It has a couple of additional features that relative to the one included with bash: Con

Robert T. McGibbon 4 Aug 25, 2022
lemmy-help is a emmylua parser as well as a CLI which takes that parsed tree and converts it into vim help docs.

lemmy-help is a emmylua parser as well as a CLI which takes that parsed tree and converts it into vim help docs.

Vikas Raj 117 Jan 3, 2023
An enhanced version of filetime, which can set file creation time on Windows.

filetime_creation Documentation An enhanced version of filetime, which can set file creation time on Windows. Internally, this use SetFileTime Win32 A

29 4 Dec 5, 2022
A utility for exporting administrative/moderation statistics from your Lemmy instance's PostgreSQL database to InfluxDB!

Lemmy (Stats) Data Exporter About This Project This project aims to act as a bridge between Lemmy's PostgreSQL database and InfluxDB, primarily to tra

Russell 3 Jul 5, 2023
Create, reorder, group, and focus workspaces easily in i3. Fully configurable with enhanced polybar modules.

Create, reorder, group, and focus workspaces fast and easily in i3. Features Focus Mode: Eliminate Distractions Enable Focus Mode: Use groups and focu

i3-wsman 15 Sep 2, 2023
Rust version of webpack/enhanced-resolve

Oxc Resolver Rust port of enhanced-resolve. built-in tsconfig-paths-webpack-plugin support extending tsconfig defined in tsconfig.extends support path

oxc 9 Dec 30, 2023
A visual novel engine written by Rust. Just Ayaka.

Ayaka Ayaka is currently a project for OSPP 2022. About the name The frontend is Ayaka. The runtime is Ayaka. The script is Ayaka. Just Ayaka. What we

UniGal 86 Dec 30, 2022
Peer-to-Peer Search Engine System

Kamilata A Peer-to-Peer Search Engine System Abstract Search engines have always been quintessentially centralized systems. The need for a central dat

null 4 Feb 14, 2023
A command line tool that resembles a debugger as well as Cheat Engine, to search for values in memory

Summary This is a small command-line tool designed to peek around memory of a running Linux process. It also provides filtering mechanisms similar to

null 213 Jul 4, 2023
a search engine for your events!

event engine description event engine is a search engine for your events! too many websites and emails to keep track of? event engine takes care of th

Tao Tien 3 Apr 7, 2024
🤖 just is a handy way to save and run project-specific commands.

just just is a handy way to save and run project-specific commands. (非官方中文文档,这里,快看过来!) Commands, called recipes, are stored in a file called justfile

Casey Rodarmor 8.2k Jan 5, 2023
2048 in `tui`, just for fun

TUI 2048 - Have a relax at anytime - ?? ^_^ How to run repo clone this repo, git clone https://github.com/WanderHuang/game-2048-tui.git cd game-2048-t

wander 43 Dec 16, 2022
zman is a CLI year (time) progress that small, fast, and just one single binary.

zman zman is a CLI year (time) progress that small, fast, and just one single binary. Features Show year progress Show month, and week progress Show r

azzamsa 17 Dec 21, 2022
Just a UNIX's cat copy, but less bloated and in Rust.

RAT The opposite of UNIX's cat, less bloated, and in Rust. About the project The idea of this CLI is "A CLI program that is basically UNIX's cat comma

Renan Fernandes 2 Mar 5, 2022
A simple CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal.

Welcome to rust-qrcode-cli ?? A CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal. Install git clon

Dhravya Shah 2 Mar 2, 2022
A super simple prompt for Fish shell, just shows git info and Vi mode.

vifi is a portmandeau of 'Vi' and 'Fish', because it's a prompt for Fish shell, primarily focused around showing proper indicators when using Vi key bindings.

Mat Jones 1 Sep 15, 2022
Just a collection of tiny Rust projects I've did. None warrant a whole repo rn

Daily Rust I try to write some tiny programs daily to gradually improve my Rust skills! Current Program Descriptions first_prog.rs: My first program a

null 3 Nov 5, 2022
🐎 Just a simple cross-platform neofetch for all the bronies out there.

⚠️ (WIP) This project is not ready for any serious use right now. A cross-platform command-line interface (CLI) tool written in Rust to display system

Jakub 4 Dec 15, 2022