An enhanced search engine just for Lemmy/Fediverse

Last update: Jul 4, 2023

Related tags

Command-line lemmy-search

Overview

Please Read

If anyone wants to help contribute to this, please feel free to reach out to me. You can obviously find me on Lemmy, mainly https://lemmy.world/u/marsara9 or you can find me on Discord. If you can't contribute but still want to help, feel free to raise feature requests for things you'd like to see.

Lemmy-Search

The fediverse creates some unique problems when it comes to searching. Mostly that existing search engines can't deal with the concept that multiple servers may exist that are ultimately hosting the same content. These same search engines also aren't aware that you only have an account on one, or maybe a select few of these instances.

Lemmy-Search, ya I need a better name, will uniquely search any Lemmy instance and attempt to index the entire ferdiverse and then present a familiar search interface that will allow users to:

Users can choose a preferred instance. Such that all links that you open from the search results will automatically open with that user's instance, where they should already be logged in.
The big search engines let you filter by a particular website, but this doesn't make sense for the fediverse. Instead you can still refine your searches by:
- Instance -- This will limit your search to just communities that were created on that particular instance.
- Community -- You can also filter search results by just the particular community.
- Author -- You can also just search for posts that were made by a particular user.

How it Works

For any given post that is found, all non-alphanumeric characters are removed a distinct list of words (anything that has a space between it) is taken from both the post title and body. Then when the user performs a search a similar process is applied to the query and all of those distinct words are then queried from the database. Posts that then have the highest number of matches are returned first and then those are sorted by the total score of said post. As it is assumed that if there are more matches from your query the post is more relevant to you, and that posts with a higher score are more trust-worthy.

Note that a post that just contains the same word repeated over and over will still only count for a single match compared to a post that only mentions the word once.

Road map

For the first release I expect to have the following features:

Indexing will be limited to a single 'seed instance'. Now assuming that instance is federated, you should still be able to search across all of the posts that your seed instance is aware of.
Federated instances of that 'seed instance' will only be indexed so that opening links will work on that target instance.
Users can type in any search string and it will match on the contents of any Post.
- Short words are automatically removed from the search query to help reduce false positives.
Preferred Instance selection. This will be limited to instances that the search engine has found as it indexes the fediverse.
Filtering by Instance, Community and/or Author.

Eventually some ideas I'd like to support (in no particular order):

Incorporate other fediverse type servers, including Mastodon, Kbin, etc...
Include comment data in the index as well.
Refine searches by comment authors instead of just post authors.
Explore other options of indexing and/or sharing data with other search engine instances. Essentially have the individual search engines participate in their own mini-fediverse. This way I can lighten the load on the actual Lemmy instances during a crawl.
Language selection. For now queries don't account for language at all and will just match on what you type.

Hosting your own instance

I've included a sample docker-compose.yml file that you reference to get things started. There's no environment variables or anything that you need to pass to the docker container, but there is a config.yml file that allows you to fine-tune the settings of the search engine and it's associated crawler.

Step by Step guide

To setup your own instance or begin development, start with pulling down a copy of the docker-compose.yml file. You'll then want to edit any usernames and/or passwords, but the default values should work for development right out of the box.

One exception though, is you'll want to modify which tag to pull down. If you're just wanting to stand-up your own instance you can refer to the table below to see which tag you should use. However if you wanting to do actual development, you'll want to uncomment the section that builds from the dockerfile. that looks something like this:

  build:
    context: ../
    dockerfile: dev.dockerfile

Next, pull down a copy of the config.yml file. If you edited any values in the docker-compose.yml file you'll want to then make the same changes here. Also make sure you place this in the volume that you've mapped to the lemmy-search service.

Finally you'll want to pull a copy of the nginx.conf. The default configuration assumes that you have SSL certificates and are planning to host publicly as an HTTPS server. Feel free to modify this as needed, no special headers need to be passed to Nginx, but it is assumed to run at the root of the domain, i.e. not in a subpath. (I haven't actually tested running this on a subpath, it may just work.)

Assuming you have everything configured correctly, you should now just be able to call docker compose up -d and the server should start up.

Due note that crawling of your seed instance is a process that only runs at a regular interval. So you may need to wait 24hrs for the initial crawl to finish. Alternatively you can edit mod.rs to change that interval to whatever you want, but you should keep it so that it's a fairly long time between runs. If a new crawler starts while an existing one is still running, they will both start writing the same entries to the database. For development purposes there's a config property development_mode that enables a few QOL features, specifically for development, including an endpoint /crawl that you can send a simple GET request to that will start an instance of the crawler.

PLEASE try and use your own private Lemmy instance for development. This instance MUST be running on port 443 though, so it'll have to be on a separate machine or different sub-domain.

Docker Tag Reference

Name	Details
vX.Y.Z	This tag will always correspond to a particular release. It won't receive any updates apart from any critical bugs that may be discovered.
latest	This tag will always match the master branch. It should be the most stable apart from actual releases. Note that this tag will be updated when a release goes out.
dev	This tag will always align with the develop branch. I cannot guarantee that everything will work on this tag as feature development is on-going.
test	This is my local testing tag. It can be updated multiple times per day and may not align to any particular code in the repository. I recommend that no-one uses this, or if they do, do so at your own risk.

Comments

Actix web server is hanging and/or crashing after some time.

Still investigating but after some time the server just appears to hang for no reason. Internally everything keeps working but no requests are being processed.
bug help wanted critical

opened by marsara9 1
The crawler frequently crashes while indexing

Every time the crawler starts, it makes it through about 3000-4000 posts before encountering an error.

Need to find a way to gracefully handle these but at the same time they can't be skipped. Unless we also want to find a way to restart the indexing progress periodically. But as it stands the number of new posts being created per day is exceeding the number of posts being crawled.
bug help wanted

opened by marsara9 1
Fixing bug in crawler where it wouldn't get the right word_id for xref updates.

So mcmxci on Discord found a bug where after I had updated the insertion logic to process everything in bulk, that searches were only returning one search result per term, at best. This should resolve that issue.

opened by marsara9 0
Communities and Authors that belong to other instances aren't formatted correctly for cross-linking.
When clicking on community name or author name in the search results, for a community or author that belongs to an instance that isn't the user's preferred instance, the link is currently incorrect as no data about what instance that community belongs to is passed to the client.

the SearchCommunity and SearchAuthor structs just need to be updated to include the actual instance that they belong to and then the results page needs to link to those correctly.

Then the UI needs to be updated to compare the owning instance to the user's preferred instance.

if they match the link should be formatted such as:

https://<preferred-instance>/c/<community-name>

if they don't match the link should be formatted such as:

https://<preferred-instance>/c/!<community-name>@<owning-instance> (the same is true for authors but use /u/ instead of /c/).

bug help wanted good first issue critical
opened by marsara9 0
`page` query parameter for search results does nothing

Currently the page parameter when passed to the backend, isn't actually being used. To help speed up render times this parameter should be incorporated into the search query somehow to limit the number of items actually returned. The number of results found should still include the total number of actual results regardless of how many were returned to the client.
bug

opened by marsara9 0
If preferred_instance= is missing, default to preferred instance stored in cookie

Is your feature request related to a problem? Please describe.

I'd like to be able to link users to a specific query and have it use their preferred instance for the search. E.g.

https://search-lemmy.com/results?query=lemmyverse

Describe the solution you'd like

Per our discord conversation, the preferred instance is stored as a cookie.

If the cookie is missing and the above URL is used, potentially drop them on the homepage with the query already filled(?)

Describe alternatives you've considered

Prompting the user every time - Seems excessive Defaulting to Lemmy.world - This is bad practice, even if everyone else is currently doing it

Additional context
bug enhancement good first issue

opened by rcmaehl 1
Multiple-line text is cut-off vertically
To Reproduce Not sure.

Expected behavior The letters in the first result shouldn't be missing a part of them.

Screenshots

Desktop:

OS: Windows 11

Browser: Firefox 114.0.2

Version: Live instance

Additional context https://search-lemmy.com/results?query=denmark&preferred_instance=lemmyis.fun&page=1
bug good first issue
opened by krestenlaust 1
Search result input is hard to use on a mobile device.

When on the search results screen on a mobile device the input bar can be rather small and hard to use.

Currently on mobile, the search bar, search button and the preferred-instance dropdown is all on a single line. With the reduced with available on mobile, this doesn't leave much room.
bug

opened by marsara9 0
Filters must currently cannot be at the beginning of a search string.

Searching for something and applying a filter such as community:[email protected] doesn't work if it is the first thing in the search string.

Example: test community:[email protected] works but community:[email protected] test does not.
bug

opened by marsara9 0
Make this opt-in

Every few weeks, someone else has the glorious idea of indexing the entire fediverse, and every few weeks, we have to debate the issue over again.

Lack of searchability is a feature for many people (and in fact intended as such in Mastodon and its derivatives). People are migrating to the fediverse to escape corporate ecosystems where their data is harvested all the time, and for many people, lack of global search is also an anti-harassment feature, to prevent the Twitter-esque harassment wherein people will search for marginalised communities to torpedo their activism or just plain survival by means of trolling, doxxing, etc. Several people have tried to spin up search engines for the fediverse, and that has almost always ended with such instances blocked widely across the network (and many users don't care to differentiate between search engine and scraping when one can easily be used for the other).

Of course, if a server wants to be searchable, that's their decision to make (and it makes sense for Lemmy and Kbin, being Redditlikes). But it's not a decision anyone can make for the entirety of the network. On that grounds, any such mechanism absolutely has to operate on an opt-in basis.
help wanted

opened by mxamber 2
Need to fix tag workflow

Currently bumping the version in the Cargo.toml is a manual process. I'd like it if whenever a tag is pushed to the master branch that the Cargo.toml file could be bumped automatically.
help wanted good first issue

opened by marsara9 0