Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to remove stopwords from query terms before the actual search? #74

Closed
andersonsantos opened this issue Mar 24, 2019 · 10 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@andersonsantos
Copy link

andersonsantos commented Mar 24, 2019

Hi @valeriansaliou,
I made some tests in Portuguese (Brazil) and in English (US) and found that if you ingest a text like:
"this is the report from our last weeks meeting"
A query for "last weeks meeting" would return empty (because it includes a stopword) but a query for "report weeks meeting" would return the object.
If we remove the stop words before querying, "last weeks meeting" would also return the object.

What would be the performance implications for this change?
Other option, would be to add a removeStopwords + lang function into the Sonic Channel NPM module and make the query cleanup process in there. Any thoughts about this?

Thanks!

@valeriansaliou
Copy link
Owner

Hi! Can you run Sonic in "debug" mode, and issue both queries separately, and report debug log there?

On the stopwords removal in the libraries, that is possible but we'd be better off keeping it in Sonic itself for performance and uniformity reasons (otherwise there's added maintenance overhead on the libraries).

@andersonsantos
Copy link
Author

Debug for "report weeks meeting":

(DEBUG) - got mode response: search
(DEBUG) - got channel message: QUERY messages default "report weeks meeting"
(DEBUG) - will dispatch search command: QUERY
(DEBUG) - parsed text parts (still needs post-processing): "report weeks meeting"
(DEBUG) - parsed text parts (post-processed): report weeks meeting
(DEBUG) - dispatching search query #6ekNFFON on collection: messages and bucket: default
(DEBUG) - will search for #6ekNFFON with text: report weeks meeting, limit: 10, offset: 0
(DEBUG) - detecting locale from lexer text: report weeks meeting
(DEBUG) - will detect locale for lexer safe text: report weeks meeting
(DEBUG) - lexer text is shorter than 60 characters, using the slow method
(INFO) - [slow lexer] locale detected from text: report weeks meeting (Nederlands from Latin at 0.028969013501629507/1; 0s + 0ms)
(DEBUG) - [slow lexer] trying to detect locale from stopwords instead
(DEBUG) - guessing locale from stopwords for script: Latin and text: report weeks meeting
(INFO) - kv store not in pool for collection: messages <4f744a98>, opening it
(DEBUG) - opening key-value database for collection: <4f744a98>
(DEBUG) - configuring key-value database
(DEBUG) - opened and cached kv store in pool for collection: messages (pool key: <4f744a98>)
(INFO) - fst store not in pool for collection: messages <4f744a98> / bucket: default <6aa915e1>, opening it
(DEBUG) - opening finite-state transducer graph for collection: <4f744a98> and bucket: <6aa915e1>
(DEBUG) - opened and cached fst store in pool for collection: messages (pool key: <4f744a98>/<6aa915e1>)
(DEBUG) - lexer yielded word: report
(DEBUG) - store get term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98]
(DEBUG) - got term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98] with decoded value: [7]
(DEBUG) - not enough iids were found (1/10000), completing for term: report
(DEBUG) - looking-up word in fst via 'begins': report with regex: report([\x{0000}-\x{024F}]*)
(DEBUG) - looking up for word: report in 'begins' fst stream
(DEBUG) - looking-up word in fst via 'typos': report with typo factor: 1
(DEBUG) - looking up for word: report in 'typos' fst stream
(DEBUG) - done completing results for term: report, now 1 results
(DEBUG) - got search executor iids: {7} for term: report
(DEBUG) - got search executor iid intersection: {7} for term: report
(DEBUG) - lexer yielded word: weeks
(DEBUG) - store get term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206]
(DEBUG) - got term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206] with decoded value: [7]
(DEBUG) - not enough iids were found (1/10000), completing for term: weeks
(DEBUG) - looking-up word in fst via 'begins': weeks with regex: weeks([\x{0000}-\x{024F}]*)
(DEBUG) - looking up for word: weeks in 'begins' fst stream
(DEBUG) - looking-up word in fst via 'typos': weeks with typo factor: 1
(DEBUG) - looking up for word: weeks in 'typos' fst stream
(DEBUG) - done completing results for term: weeks, now 1 results
(DEBUG) - got search executor iids: {7} for term: weeks
(DEBUG) - got search executor iid intersection: {7} for term: weeks
(DEBUG) - lexer yielded word: meeting
(DEBUG) - store get term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48]
(DEBUG) - got term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48] with decoded value: [7]
(DEBUG) - not enough iids were found (1/10000), completing for term: meeting
(DEBUG) - looking-up word in fst via 'begins': meeting with regex: meeting([\x{0000}-\x{024F}]*)
(DEBUG) - looking up for word: meeting in 'begins' fst stream
(DEBUG) - looking-up word in fst via 'typos': meeting with typo factor: 1
(DEBUG) - looking up for word: meeting in 'typos' fst stream
(DEBUG) - done completing results for term: meeting, now 1 results
(DEBUG) - got search executor iids: {7} for term: meeting
(DEBUG) - got search executor iid intersection: {7} for term: meeting
(DEBUG) - store get iid-to-oid: '3:6aa915e1:7' [3, 225, 21, 169, 106, 7, 0, 0, 0]
(INFO) - got search executor final oids: ["english:test"]
(DEBUG) - wrote response with values: PENDING (6ekNFFON)
(DEBUG) - wrote response with values: EVENT (QUERY 6ekNFFON english:test)
(INFO) - took 39ms/39418us/39418165ns to process channel message
(DEBUG) - running a tasker tick...
(DEBUG) - scanning for kv store pool items to janitor
(DEBUG) - found non-expired kv store pool item: <4f744a98>; elapsed time: 18s
(INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 1 items
(DEBUG) - scanning for fst store pool items to janitor
(DEBUG) - found non-expired fst store pool item: <4f744a98>/<6aa915e1>; elapsed time: 18s
(INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 1 items
(DEBUG) - scanning for fst store pool items to consolidate
(INFO) - done scanning for fst store pool items to consolidate (move: 0, push: 0, pop: 0)
(INFO) - ran tasker tick (took 0s + 0ms)

@andersonsantos
Copy link
Author

Debug for "last week meetings"

(DEBUG) - got mode response: search
(DEBUG) - got channel message: QUERY messages default "last weeks meeting"
(DEBUG) - will dispatch search command: QUERY
(DEBUG) - parsed text parts (still needs post-processing): "last weeks meeting"
(DEBUG) - parsed text parts (post-processed): last weeks meeting
(DEBUG) - dispatching search query #TEL5Pqu8 on collection: messages and bucket: default
(DEBUG) - will search for #TEL5Pqu8 with text: last weeks meeting, limit: 10, offset: 0
(DEBUG) - detecting locale from lexer text: last weeks meeting
(INFO) - kv store not in pool for collection: messages <4f744a98>, opening it
(DEBUG) - opening key-value database for collection: <4f744a98>
(DEBUG) - configuring key-value database
(DEBUG) - opened and cached kv store in pool for collection: messages (pool key: <4f744a98>)
(INFO) - fst store not in pool for collection: messages <4f744a98> / bucket: default <6aa915e1>, opening it
(DEBUG) - opening finite-state transducer graph for collection: <4f744a98> and bucket: <6aa915e1>
(DEBUG) - opened and cached fst store in pool for collection: messages (pool key: <4f744a98>/<6aa915e1>)
(DEBUG) - lexer yielded word: last
(DEBUG) - store get term-to-iids: '1:6aa915e1:59bba3f0' [1, 225, 21, 169, 106, 240, 163, 187, 89]
(DEBUG) - no term-to-iids found: '1:6aa915e1:59bba3f0' [1, 225, 21, 169, 106, 240, 163, 187, 89]
(DEBUG) - not enough iids were found (0/10000), completing for term: last
(DEBUG) - looking-up word in fst via 'begins': last with regex: last([\x{0000}-\x{024F}]*)
(DEBUG) - looking up for word: last in 'begins' fst stream
(DEBUG) - looking-up word in fst via 'typos': last with typo factor: 1
(DEBUG) - looking up for word: last in 'typos' fst stream
(DEBUG) - did not get any completed word for term: last
(DEBUG) - got search executor iids: {} for term: last
(DEBUG) - got search executor iid intersection: {} for term: last
(INFO) - stop search executor as no iid was found in common for term: last
(INFO) - got search executor final oids: []
(DEBUG) - wrote response with values: PENDING (TEL5Pqu8)
(DEBUG) - wrote response with values: EVENT (QUERY TEL5Pqu8 )
(INFO) - took 9ms/9341us/9341747ns to process channel message

@andersonsantos
Copy link
Author

On the stopwords removal in the libraries, that is possible but we'd be better off keeping it in Sonic itself for performance and uniformity reasons (otherwise there's added maintenance overhead on the libraries).

I agree 100% with this <3

@valeriansaliou
Copy link
Owner

valeriansaliou commented Mar 24, 2019

Thanks for the debug logs, I have all I need 👍

I'll handle this later today / or next week. My guess is that the lexer is taking "last" as a stopwords when you ingest the text, and thus it's not in index. If the search query is not long enough, the detected locale may not be correct, and thus the stopword "last" from query may not be eluded, thus hitting the search index.

What do you think of adding a way to add an optional language hint to the search QUERY command, which will force the lexer detected language to the one passed? This would fix your issue.

@andersonsantos
Copy link
Author

@valeriansaliou went arread and logged the output for when I ingest the term:

(DEBUG) - got mode response: ingest
(DEBUG) - got channel message: PUSH messages default testing:english:locale "this is the report from our last weeks meeting"
(DEBUG) - will dispatch search command: PUSH
(DEBUG) - parsed text parts (still needs post-processing): "this is the report from our last weeks meeting"
(DEBUG) - parsed text parts (post-processed): this is the report from our last weeks meeting
(DEBUG) - dispatching ingest push in collection: messages, bucket: default and object: testing:english:locale
(DEBUG) - ingest push has text: this is the report from our last weeks meeting
(DEBUG) - detecting locale from lexer text: this is the report from our last weeks meeting
(DEBUG) - will detect locale for lexer safe text: this is the report from our last weeks meeting
(DEBUG) - lexer text is shorter than 60 characters, using the slow method
(INFO) - [slow lexer] locale detected from text: this is the report from our last weeks meeting (English from Latin at 1/1; 0s + 0ms)
(INFO) - kv store not in pool for collection: messages <4f744a98>, opening it
(DEBUG) - opening key-value database for collection: <4f744a98>
(DEBUG) - configuring key-value database
(DEBUG) - opened and cached kv store in pool for collection: messages (pool key: <4f744a98>)
(INFO) - fst store not in pool for collection: messages <4f744a98> / bucket: default <6aa915e1>, opening it
(DEBUG) - opening finite-state transducer graph for collection: <4f744a98> and bucket: <6aa915e1>
(DEBUG) - opened and cached fst store in pool for collection: messages (pool key: <4f744a98>/<6aa915e1>)
(DEBUG) - store get oid-to-iid: '2:6aa915e1:59d4519' [2, 225, 21, 169, 106, 25, 69, 157, 5]
(DEBUG) - no oid-to-iid found: '2:6aa915e1:59d4519' [2, 225, 21, 169, 106, 25, 69, 157, 5]
(INFO) - must initialize push executor oid-to-iid and iid-to-oid
(DEBUG) - store get meta-to-value: '0:6aa915e1:0' [0, 225, 21, 169, 106, 0, 0, 0, 0]
(DEBUG) - got meta-to-value: '0:6aa915e1:0' [0, 225, 21, 169, 106, 0, 0, 0, 0]
(DEBUG) - store set meta-to-value: '0:6aa915e1:0' [0, 225, 21, 169, 106, 0, 0, 0, 0]
(DEBUG) - store set oid-to-iid: '2:6aa915e1:59d4519' [2, 225, 21, 169, 106, 25, 69, 157, 5]
(DEBUG) - store set oid-to-iid: '2:6aa915e1:59d4519' [2, 225, 21, 169, 106, 25, 69, 157, 5] with encoded value: [8, 0, 0, 0]
(DEBUG) - store set iid-to-oid: '3:6aa915e1:8' [3, 225, 21, 169, 106, 8, 0, 0, 0]
(DEBUG) - store get iid-to-terms: '4:6aa915e1:8' [4, 225, 21, 169, 106, 8, 0, 0, 0]
(INFO) - got push executor stored iid-to-terms: {}
(DEBUG) - lexer did not yield word: this because: word is a stop-word
(DEBUG) - lexer did not yield word: is because: word is a stop-word
(DEBUG) - lexer did not yield word: the because: word is a stop-word
(DEBUG) - lexer yielded word: report
(DEBUG) - store get term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98]
(DEBUG) - got term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98] with decoded value: [7]
(INFO) - has push executor term-to-iids: 8
(DEBUG) - store set term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98]
(DEBUG) - store set term-to-iids: '1:6aa915e1:62f3bc6e' [1, 225, 21, 169, 106, 110, 188, 243, 98] with encoded value: [8, 0, 0, 0, 7, 0, 0, 0]
(DEBUG) - lexer did not yield word: from because: word is a stop-word
(DEBUG) - lexer did not yield word: our because: word is a stop-word
(DEBUG) - lexer did not yield word: last because: word is a stop-word
(DEBUG) - lexer yielded word: weeks
(DEBUG) - store get term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206]
(DEBUG) - got term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206] with decoded value: [7]
(INFO) - has push executor term-to-iids: 8
(DEBUG) - store set term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206]
(DEBUG) - store set term-to-iids: '1:6aa915e1:cea0a9c6' [1, 225, 21, 169, 106, 198, 169, 160, 206] with encoded value: [8, 0, 0, 0, 7, 0, 0, 0]
(DEBUG) - lexer yielded word: meeting
(DEBUG) - store get term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48]
(DEBUG) - got term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48] with encoded value: [7, 0, 0, 0]
(DEBUG) - got term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48] with decoded value: [7]
(INFO) - has push executor term-to-iids: 8
(DEBUG) - store set term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48]
(DEBUG) - store set term-to-iids: '1:6aa915e1:304fa5c0' [1, 225, 21, 169, 106, 192, 165, 79, 48] with encoded value: [8, 0, 0, 0, 7, 0, 0, 0]
(INFO) - has push executor iid-to-terms commits: [1660140654, 3466635718, 810526144]
(DEBUG) - store set iid-to-terms: '4:6aa915e1:8' [4, 225, 21, 169, 106, 8, 0, 0, 0]
(DEBUG) - store set iid-to-terms: '4:6aa915e1:8' [4, 225, 21, 169, 106, 8, 0, 0, 0] with encoded value: [110, 188, 243, 98, 198, 169, 160, 206, 192, 165, 79, 48]
(DEBUG) - wrote response with no values: OK
(INFO) - took 11ms/11337us/11337565ns to process channel message
(DEBUG) - running a tasker tick...
(DEBUG) - scanning for kv store pool items to janitor
(DEBUG) - found non-expired kv store pool item: <4f744a98>; elapsed time: 15s
(INFO) - done scanning for kv store pool items to janitor, expired 0 items, now has 1 items
(DEBUG) - scanning for fst store pool items to janitor
(DEBUG) - found non-expired fst store pool item: <4f744a98>/<6aa915e1>; elapsed time: 15s
(INFO) - done scanning for fst store pool items to janitor, expired 0 items, now has 1 items
(DEBUG) - scanning for fst store pool items to consolidate
(INFO) - done scanning for fst store pool items to consolidate (move: 0, push: 0, pop: 0)
(INFO) - ran tasker tick (took 0s + 0ms)

@valeriansaliou valeriansaliou added this to the v1.1.2 milestone Mar 24, 2019
@valeriansaliou valeriansaliou self-assigned this Mar 24, 2019
@valeriansaliou valeriansaliou added the bug Something isn't working label Mar 24, 2019
@andersonsantos
Copy link
Author

Thanks for the debug logs, I have all I need 👍

I'll handle this later today / or next week. My guess is that the lexer is taking "last" as a stopwords when you ingest the text, and thus it's not in index. If the search query is not long enough, the detected locale may not be correct, and thus the stopword "last" from query may not be eluded, thus hitting the search index.

What do you think of adding a way to add an optional language hint to the search QUERY command, which will force the lexer detected language to the one passed? This would fix your issue.

That would be perfect!
Maybe also add the option for force/set the locale for ingestion too.

@andersonsantos
Copy link
Author

Really love Sonic's concept and plan to start learning Rust just to help you guys.

@valeriansaliou
Copy link
Owner

Closing this, as this issue is now handled from #75

@valeriansaliou
Copy link
Owner

valeriansaliou commented Mar 29, 2019

Hi @andersonsantos! Just a heads up to let you know this has been implemented in #75 and will be released today in v1.1.9. Locale code can be hinted to QUERY and PUSH as LANG(<locale>) where <locale> is an ISO 639-3 locale code eg. eng for English.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants