Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sonic fails to retrieve results on very simple queries #252

Closed
almosnow opened this issue May 19, 2021 · 8 comments
Closed

sonic fails to retrieve results on very simple queries #252

almosnow opened this issue May 19, 2021 · 8 comments

Comments

@almosnow
Copy link

almosnow commented May 19, 2021

I found this thing after configuring and running sonic for the first time,

Using telnet, I manually push the following data:

PUSH messages default id_1 "Some sample text number one"
PUSH messages default id_2 "Some sample text number two"
PUSH messages default id_3 "Some sample text number three"
PUSH messages default id_4 "Some sample text number four"

Then I consolidate the index, just to make sure:
TRIGGER consolidate

On all these commands (using their respective channel), I get an OK reply from the sonic process.

So now, when I run some sample queries, I get results on a few words but not on others:

QUERY messages default "Some", gives back no results, which is wrong ❌
QUERY messages default "sample" gives back id_4 id_3 id_2 id_1, which is correct ✅
QUERY messages default "text", gives back no results, which is wrong ❌
QUERY messages default "number", gives back no results, which is wrong ❌
QUERY messages default "one", gives back no results, which is wrong ❌
QUERY messages default "two", gives back no results, which is wrong ❌
QUERY messages default "three", gives back no results, which is wrong ❌
QUERY messages default "four", gives back no results, which is wrong ❌

What is going on? Is this a bug or am I doing something wrong?

Thanks.

@pleshevskiy
Copy link
Contributor

Some, text, number, one, two, three, four are stop words in English
You can see the whole list here https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs

@almosnow
Copy link
Author

Whoa, thanks @pleshevskiy, I missed that.

From what I see, that's a very broad list of stopwords. I found out this by debugging why a query with 'computer' was not retrieving a specific record. Now I see, 'computer' is a stop word as well.

Is there a way to make sonic use a specific list of stopwords? I could change it and recompile, of course, but maybe there's already a flag or something.

@valeriansaliou
Copy link
Owner

valeriansaliou commented May 29, 2021

Yes! That's possible, stopwords are listed there: https://github.com/valeriansaliou/sonic/tree/master/src/stopwords

Weirdly, computer is listed as one for English: https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs#L210

@almosnow
Copy link
Author

almosnow commented May 29, 2021

Hi @valeriansaliou! Thank you for sonic, it is a great tool.

Could you tell us a bit more about why did you chose those stopwords?

I plan to use the one from MySQL's MyISAM engine, which has worked fine for me in the past. See bottom of this page: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html

@valeriansaliou
Copy link
Owner

@almosnow I extracted them from list of stopwords from existing stopwords libraries, so it's a bit weird that this one is considered as one. I'm sure there are more non-stopwords there thus, I am accepting PRs for anyone who'd want to filter out those non-stopwords :)

@almosnow
Copy link
Author

I'm willing to help with that, but what criteria should we use to discriminate stop words? For instance, I wouldn't consider 'number' a stop word, but I can see many scenarios where it could be one.

Another alternative would be to encode different lists and let the user choose them in a similar way as LANG.

For reference, I found this: https://github.com/igorbrigadir/stopwords

@joyously
Copy link

I'm working on this for a different search, and I think it works best if there is a base list like Google uses, and then a function to add more words (or translate them). That way each environment can eliminate the words that are common for them.

@almosnow
Copy link
Author

Thank you, forgot to close it. Best to all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants