New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sonic fails to retrieve results on very simple queries #252
Comments
Some, text, number, one, two, three, four are stop words in English |
Whoa, thanks @pleshevskiy, I missed that. From what I see, that's a very broad list of stopwords. I found out this by debugging why a query with 'computer' was not retrieving a specific record. Now I see, 'computer' is a stop word as well. Is there a way to make sonic use a specific list of stopwords? I could change it and recompile, of course, but maybe there's already a flag or something. |
Yes! That's possible, stopwords are listed there: https://github.com/valeriansaliou/sonic/tree/master/src/stopwords Weirdly, computer is listed as one for English: https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs#L210 |
Hi @valeriansaliou! Thank you for sonic, it is a great tool. Could you tell us a bit more about why did you chose those stopwords? I plan to use the one from MySQL's MyISAM engine, which has worked fine for me in the past. See bottom of this page: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html |
@almosnow I extracted them from list of stopwords from existing stopwords libraries, so it's a bit weird that this one is considered as one. I'm sure there are more non-stopwords there thus, I am accepting PRs for anyone who'd want to filter out those non-stopwords :) |
I'm willing to help with that, but what criteria should we use to discriminate stop words? For instance, I wouldn't consider 'number' a stop word, but I can see many scenarios where it could be one. Another alternative would be to encode different lists and let the user choose them in a similar way as LANG. For reference, I found this: https://github.com/igorbrigadir/stopwords |
I'm working on this for a different search, and I think it works best if there is a base list like Google uses, and then a function to add more words (or translate them). That way each environment can eliminate the words that are common for them. |
Thank you, forgot to close it. Best to all! |
I found this thing after configuring and running sonic for the first time,
Using telnet, I manually push the following data:
Then I consolidate the index, just to make sure:
TRIGGER consolidate
On all these commands (using their respective channel), I get an OK reply from the sonic process.
So now, when I run some sample queries, I get results on a few words but not on others:
QUERY messages default "Some"
, gives back no results, which is wrong ❌QUERY messages default "sample"
gives backid_4 id_3 id_2 id_1
, which is correct ✅QUERY messages default "text"
, gives back no results, which is wrong ❌QUERY messages default "number"
, gives back no results, which is wrong ❌QUERY messages default "one"
, gives back no results, which is wrong ❌QUERY messages default "two"
, gives back no results, which is wrong ❌QUERY messages default "three"
, gives back no results, which is wrong ❌QUERY messages default "four"
, gives back no results, which is wrong ❌What is going on? Is this a bug or am I doing something wrong?
Thanks.
The text was updated successfully, but these errors were encountered: