In my introduction to text search in MongoDB , we had a look at the basic features. Today we’ll have a closer look at the details.
API
You may have noticed that a text search is not executed with a find() command. Instead you call
1db.foo.runCommand( "text", {search: "bar"} )
Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.
I expect to see a new query operator like $text
or $textsearch
as soon as text search is integrated with the standard find() command.
Text Query Syntax
In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:
1db.foo.drop() 2db.foo.ensureIndex( {txt: "text"} ) 3db.foo.insert( {txt: "Robots are superior to humans"} ) 4db.foo.insert( {txt: "Humans are weak"} ) 5db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )
A search for “robot” will find two documents, the same it true for “human”:
1> db.foo.runCommand("text", {search: "robot"}).results.length 22 3> db.foo.runCommand("text", {search: "human"}).results.length 42
When searching for multiple terms, an OR search is performed, yielding three documents in our example:
1> db.foo.runCommand("text", {search: "human robot"}).results.length 23
I would have expected that the given search words are AND-ed not OR-ed.
Negation
By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.
1> db.foo.runCommand("text", {search: "robot -humans"}) 2{ 3 "queryDebugString" : "robot||human||||", 4 "language" : "english", 5 "results" : [ 6 { 7 "score" : 0.6666666666666666, 8 "obj" : { 9 "_id" : ObjectId("50ebc484214a1e88aaa4ada0"), 10 "txt" : "I, Robot - by Isaac Asimov" 11 } 12 } 13 ], 14 "stats" : { 15 "nscanned" : 2, 16 "nscannedObjects" : 0, 17 "n" : 1, 18 "timeMicros" : 212 19 }, 20 "ok" : 1 21}
Phrase Search
By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search . Inside a phrase, order is important and stop words are also taken into account:
1> db.foo.runCommand("text", {search: '"robots are"'}) 2{ 3 "queryDebugString" : "robot||||robots are||", 4 "language" : "english", 5 "results" : [ 6 { 7 "score" : 0.6666666666666666, 8 "obj" : { 9 "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"), 10 "txt" : "Robots are superior to humans" 11 } 12 } 13 ], 14 "stats" : { 15 "nscanned" : 2, 16 "nscannedObjects" : 0, 17 "n" : 1, 18 "timeMicros" : 185 19 }, 20 "ok" : 1 21}
Please have a look at the “queryDebugField”:
1"queryDebugString" : "robot||||robots are||"
It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:
1> // order matters inside phrase 2> db.foo.runCommand("text", {search: '"are robots"'}).results.length 30 4> // no phrase search --> OR query 5> db.foo.runCommand("text", {search: 'are robots'}).results.length 62
Multi Language Support
Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages .
In order to use another language for indexing and searching, you do this when creating the index:
1db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )
With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:
1> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } ) 2> db.de.validate().keysPerIndex["text.de.$txt_text"] 32
As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:
1> db.de.runCommand("text", {search: "ich"}).results.length 20 3> db.de.runCommand("text", {search: "Vater"}).results.length 41 5> db.de.runCommand("text", {search: "Luke"}).results.length 61
Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).
It it also possible to mix multiple languages in the same index. Each single document can have its own language:
1db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )
If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.
1// default language: german -> no hits 2> db.de.runCommand("text", {search: "ich"}) 3{ 4 "queryDebugString" : "||||||", 5 "language" : "german", 6 "results" : [ ], 7 "stats" : { 8 "nscanned" : 0, 9 "nscannedObjects" : 0, 10 "n" : 0, 11 "timeMicros" : 96 12 }, 13 "ok" : 1 14} 15 16// search for English -> one hit 17> db.de.runCommand("text", {search: "ich", language: "english"}) 18{ 19 "queryDebugString" : "ich||||||", 20 "language" : "english", 21 "results" : [ 22 { 23 "score" : 0.625, 24 "obj" : { 25 "_id" : ObjectId("50ed163b1e27d5e73741fafb"), 26 "language" : "english", 27 "txt" : "Ich bin ein Berliner" 28 } 29 } 30 ], 31 "stats" : { 32 "nscanned" : 1, 33 "nscannedObjects" : 0, 34 "n" : 1, 35 "timeMicros" : 161 36 }, 37 "ok" : 1 38}
What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.
What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.
Multiple Fields
A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.
1> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } ) 2> db.mail.getIndices() 3[ 4 ... 5 { 6 "v" : 0, 7 "key" : { 8 "_fts" : "text", 9 "_ftsx" : 1 10 }, 11 "ns" : "de.mail", 12 "name" : "subject_text_body_text", 13 "weights" : { 14 "body" : 1, 15 "subject" : 10 16 }, 17 "default_language" : "english", 18 "language_override" : "language" 19 } 20]
We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:
1> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } ) 2> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } ) 3> db.mail.runCommand("text", {search: "robot"}) 4{ 5 "queryDebugString" : "robot||||||", 6 "language" : "english", 7 "results" : [ 8 { 9 "score" : 6.666666666666666, 10 "obj" : { 11 "_id" : ObjectId("50ed1be71e27d5e73741fafe"), 12 "subject" : "Robot leader to minions", 13 "body" : "Humans suck" 14 "prio" : 0 15 } 16 }, 17 { 18 "score" : 0.75, 19 "obj" : { 20 "_id" : ObjectId("50ed1bfd1e27d5e73741faff"), 21 "subject" : "Human leader to minions", 22 "body" : "Robots suck" 23 "prio" : 1 24 } 25 } 26 ], 27 "stats" : { 28 "nscanned" : 2, 29 "nscannedObjects" : 0, 30 "n" : 2, 31 "timeMicros" : 148 32 }, 33 "ok" : 1 34}
The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.
Filtering and Projection
You can apply additional search criteria via filtering:
1> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } ) 2{ 3 "queryDebugString" : "robot||||||", 4 "language" : "english", 5 "results" : [ 6 { 7 "score" : 6.666666666666666, 8 "obj" : { 9 "_id" : ObjectId("50ed22621e27d5e73741fb04"), 10 "subject" : "Robot leader to minions", 11 "body" : "Humans suck", 12 "prio" : 0 13 } 14 } 15 ], 16 "stats" : { 17 "nscanned" : 2, 18 "nscannedObjects" : 2, 19 "n" : 1, 20 "timeMicros" : 185 21 }, 22 "ok" : 1 23}
Please note that filtering does not use an index.
If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):
1> db.mail.runCommand("text", {search: "robot", project: {_id:0, prio:0} } ) 2{ 3 "queryDebugString" : "robot||||||", 4 "language" : "english", 5 "results" : [ 6 { 7 "score" : 6.666666666666666, 8 "obj" : { 9 "subject" : "Robot leader to minions", 10 "body" : "Humans suck" 11 } 12 }, 13 { 14 "score" : 0.75, 15 "obj" : { 16 "subject" : "Human leader to minions", 17 "body" : "Robots suck" 18 } 19 } 20 ], 21 "stats" : { 22 "nscanned" : 2, 23 "nscannedObjects" : 0, 24 "n" : 2, 25 "timeMicros" : 127 26 }, 27 "ok" : 1 28}
Filtering and projection can be combined, of course.
Examples
All examples can be found on github . Try them yourself.
Summary
With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.
More articles
fromTobias Trelle
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Tobias Trelle
Software Architect
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.