Beliebte Suchanfragen
//

MongoDB: Close Encounters of the Third Kind

11.11.2012 | 15 minutes of reading time

Just with the first key strokes of this post I am entering the third week of my MongoDB developers course @ 10gen . Luckily I found another film title that fits in here, well one way or the other . Writing this blog series along with taking the lectures is really filling up a good amount of my time currently and to be able to keep up the writing I had to sacrifice the Python homework so far (what a pity with respect to Python). That means most likely that there is hardly any chance for passing the final certification. But writing on this is helping me a great deal to get a better understanding of MongoDB and – most important – it is at the same time a great deal of fun. And as I am confident that some alter ego of mine from a parallel universe will skip the blogging and do the certification instead, nothing is lost in the end :). Wondering what all my other selfs are doing right now, but never mind.

Let’s jump to the topic of this week’s lecture from the class and thus the topic of this blog post, which is:

Schema Design

I am looking forward to this one, especially as I understood in the beginning that MongoDB is schema-less. But it has already turned out during the class that this only means MongoDB does no require a strict and pre-defined schema. Browsing the topics from the lectures this will also be about transactions (or probably the absence of them), atomic operations and joins (the Mongo-way).

“If you are coming from the world of relational databases you know that there is a best ideal way to design your schema, which is the third normal form.” – Quotes from the course

Everyone developing software using relational databases knows the time that is spent on the design of the table schema. You might have some real “hardliners” that take the Third Normal Form by the letter. Others value some redundant data if it eases accessing that data. This might also have some well-thought performance reasons. Overall one can – and normally does – spend a good amount of time discussing the “optimal” schema. There is nothing bad about this as it is – as with other design decisions of application development – that there is a need to discuss these things. The only bad feeling that sneaks in on me every now and then is that all this has to be mainly considered from the view-point of the SQL database. It is a kind of science in itself that is mostly independent from accessing that data later on from any application (which we still have to program in the end). And that can be quite tough despite the fact that we have object-relational frameworks nowadays that should help us closing that gap, but to be honest: I sometimes have the feeling things become worse as it gets harder and harder to understand what is really happening.

And these thoughts just fit so perfectly well to the following quote from the course.

“When you are not biased towards any particular access pattern you are equally bad at all of them.” – Quotes from the course

This was already coming to my mind during last week’s class, but now it is getting much clearer and is also addressed more directly in the lectures: With an approach as it is offered by MongoDB the database schema is much more tailored to the needs of the application development and not other way around. In the class this is called Application-Driven Schema-Design. I have to slow-down my enthusiasm a little bit as I am still lacking any real-life programming experience (especially in a bigger project) with MongoDB, but it sounds exactly as the right thing to do. And it also means I have to urgently do some Java programming on this. Days (and nights) are simply too short.

It should be made clear here that this does not mean that everything will be put into one big document. Inside our application the data is for sure as well not inside one big object without references. But it probably means that we will do more “pre-joins” by putting some data inside one document, which would be for sure spread across different tables in a relational database. In the course the example of a blog application is used and the comments are stored along with the blog post inside one document structure. A thing that would never be done in SQL, but seems quite natural when giving it a second thought.

“Occasionally – for performance reasons – we’re gonna decide that we want to duplicate the data within the document. But that’s not gonna be the default.” – Quotes from the course

As there is no construct to join data in MongoDB – the way we know it from relational databases – it will be the application’s responsibility to perform any “joins”. It is really a rather big mind shift – and that’s probably the reason I am repeating myself on this – but we should check more carefully for possibilities to embed data in one document. This might be anyway more naturally from an application’s point of view. Of course we have to put aside our “training” on normalising database structures. Another advantage of this is the potential performance gain. Reading data in from three different documents would require MongoDB to load data from three different data files, requiring more disc seeks. Reading in one document will be faster and probably most of the time also easier to implement. One thing to keep in mind is the 16MB limit for documents of course. (And maybe it is not a good idea to go from one extreme to the other anyway.)

Pre-Joins are the Constraints of the MongoDB world

Database constraints are another interesting topic and I would like to start with a short anecdote from a project I have been working in quite some time back. You know constraints in relational databases. It is a great thing to keep your data consistent by “glueing” things together. A record in table X with foreign key Z can only exist if there is a record in table Z that has a proper id-value. Ok, back in that project I was working in we had a really complex and complicated database design with dozens of foreign key constraints. But one of the problems was: If you wanted to delete something you needed to know the exact order of all these constraints as otherwise you would fail deleting anything. In the end there have been scripts disabling all the constraints, doing the deletion and then enabling them again while keeping the fingers crossed that everything was still intact. My feelings towards the heavy use of constraints are a bit divided since then, even though they still can come in quite handy to let the (relational) database solve certain problems for you.

“One good rule of thumb with MongoDB schema design is, that if you see yourself doing it exactly the same way you would do it in a relational design, you probably not taking the best approach.” – Quotes from the course

How is it then with constraints and MongoDB? Well, the simple answer is the concept of constraints does not exist in MongoDB. Either you are embedding the data that belongs together or the constraints have to be enforced programmatically in the application. My takeaway on this so far is: If data can be embedded in a meaningful way inside one document things will get much easier than in the relational world. If for whatever reason this pre-joining of data is not possible the danger of getting inconsistent data is increasing and implementation will probably get more complex to ensure this is avoided.

No Transactions

Not really anything new here by now. MongoDB does not support transactions. But it does support atomic operations on documents, which means that one will always see all changes to a document or none of them. Again it helps a lot considering the differences of MongoDB and relational databases to classify the fact that transactions are missing. In relational databases transactions are often required to ensure that updates of one data set, which is spread over several tables, is done in a consistent way. Writing the next sentence I have more and more the feeling I have been completely assimilated to the MongoDB world already. (Mental note, scan for MongoDB nanoprobes in my blood stream !) Back to topic :-): If it happens that all data fits well into one document (meaning it makes sense and does not come close the 16MB barrier), this and the fact that operations are atomic on documents will very well replace any needs for additional transaction-support.

So basically there are the following proposed approaches:

  • Structure the application in a way that it can work with a single document. This is pretty much what I mentioned before.
  • Implement transactions in software if updates to different collections are required. This depends then highly on the application as such, but as the word semaphore was used in the class , I had to repeat it here.
  • Tolerate a little bit of inconsistency.

The last point from the list leads me to this quote from the class and some more discussion on it.

“Just tolerance a little bit of inconsistency (…) It does not matter if everyone sees your wall update (in Facebook) simultaneously.” -Quotes from the course

As Facebook was used here as an example it is easy for me to agree. But for a lot of other systems this might not be a real option for different reasons. Some systems might really require consistency all the time and in a lot of other systems it is at least believed that it is required all the time. I am not sure if I would like to be the one to start a discussing about “a little bit of (temporary) inconsistency” ;-). Anyway, as there are two other possibilities to achieve transactional behaviour using MongoDB this is not a real problem either.

Data Relation

I am still looking at this through the glasses of someone used to relational databases. And as such I am having my experiences how to model the different relations that can occur between entities. So we have one-to-one, one-to-many, and many-to-many. With MongoDB a new learning process is obviously needed.

One-To-One Relations

Let’s take a look at an example for a one-to-one relation.


captain {
    _id  : 'JamesT.Kirk',
    name : 'James T. Kirk',
    age  : 38,
    ...
    ship : 'ussenterprise'
}

starship {
    _id     : 'ussenterprise',
    name    : 'USS Enterprise',
    class   : 'Galaxy',
    ...
    captain : 'JamesT.Kirk'
}

This is obviously an example for a one-to-one relation. A star ship can only have one captain at a time and a captain is captain of exactly one star ship. Now in the above example both entities are put in different collections. To be able to “join” them (inside our program, not using MongoDB) we have created references from the captain-collection to the starship-collection and vise versa. This is perfectly ok and might make sense depending on the additional data that is stored in the different documents. The following aspects should be considered when modelling one-to-one relations:

  • Frequency of access of the different documents.
  • Size of the items, keeping the 16 MB limit in mind.
  • Atomically of the data and thus data consistency.

“You should pre-join your data, you should embed the data in ways that make sense for your application. For lots of different reasons and one is that it helps keeping your data intact and consistent.” – Quotes from the course

Probably in the above example it would be ok to embed the captain to its starship as follows:


starship {
    _id     : 'ussenterprise',
    name    : 'USS Enterprise',
    class   : 'Galaxy',
    ...
    captain : {
        name : 'James T. Kirk',
        age  : 38,
        ...
    }
}

Looks quite natural by now, doesn’t it. Well, probably it simply requires some experience to be able to get a good understanding when to embed.

One-To-Many Relations

One starship is having many crew members. Might be a god example to start with for the one-to-many relation. Does it make sense to embed the list of an entire crew inside the starship-document. Well in principle yes I would say, but the problem could be that when extending our example to a borg cube we might run into the 16 MB limit. That thing can have a crew compliment of up to 130.000. (It just comes to my mind that here instead of a name-attribute a designation-attribute could be used in the document thanks to MongoDB‘s flexibility.)

“When it requires two collections then it requires two collections.” – Quotes from the course

In case of 1-to-many relations there will often be a need for real linking between collections. It is advisable to link from the collection storing the many values to the collection storing the one. Thus every borg drone knows on which cube it belongs. What I found a very good rule of thumb in deciding the data structure here is the question: Is this really one-to-many or is it just one-to-few. In the latter case an array inside the document is most likely the better choice.

Many-To-Many Relations

Another example to start with. At Starfleet Academy there are candidates and instructors. Several candidates will be assigned to one instructor and one instructor will have several candidates to train.


candidates {
    _id  : 99,
    name : 'Harry Kim',
    ...
    instructors : [1, 2]
}

instructors {
    _id  : 1,
    name : 'Tuvok',
    ...
    candidates : [99, 100]
}

In the above example we have two different collections. We have a link in both directions by having for each candidate a list of instructors and for each instructor a list of candidates.

This is MongoDB, so no one but you and your program are responsible for ensuring consistency of the documents, e.g. that there really exists another candidate where _id equals 100.

Again it helps asking the question is this really many-to-many or is it just few-to-few in order to decide if embedding is an option. And one should consider if the entities should be able to exist independent of each other. Embedding for example the candidates into the instructors-collection (also a bad idea for other reasons) in the above example would require to create an instructor document to be able to create a candidate. They could no longer exist independent of each other. Probably keeping them separate and only linking them together is the better idea here.

I guess we agree that this blog post was now far too long without any real hacking, wasn’t it? So let’s take a look at the Mongo shell and something called “Multikey Indexes”:


> db.instructors.find()
{ "_id" : 1, "name" : "Tuvok", "candidates" : [ 99, 100 ] }
{ "_id" : 2, "name" : "Spock", "candidates" : [ 99, 100 ] }
{ "_id" : 3, "name" : "Pike", "candidates" : [ 100, 101 ] }
> db.candidates.find()
{ "_id" : 99, "name" : "Harry Kim" }
{ "_id" : 100, "name" : "Tom Paris" }
{ "_id" : 101, "name" : "Seven of Nine" }
> db.instructors.ensureIndex({candidates : 1})
> db.instructors.find({candidates : {$all : [99,100]}})
{ "_id" : 1, "name" : "Tuvok", "candidates" : [ 99, 100 ] }
{ "_id" : 2, "name" : "Spock", "candidates" : [ 99, 100 ] }

What have I done here? Obviously I created two collections and added some data. Let’s look from the perspective of the instructors, where Tuvok and Spock are both having Harry and Tom as candidates. Christopher Pike is having Tom and Seven of Nine. Now we can create an index on the candidates-array in the instructors-collection issuing: db.instructors.ensureIndex({candidates : 1})

“The ability to structure and express rich data is one of the things that makes MongoDB so interesting.” – Quotes from the course

The following query definitely has worked without the index, but let’s assume we have really a lot of data. Then let’s find out all the instructors having both Harry and Kim as candidates. Good to have a little refreshing from the querying-syntax that was considered quite extensively previous week (and in the corresponding blog post ). More details on the indexing will follow at a later time in the class (that was at least promised), but I cannot withstand to show you the explain()-command right away.


> db.instructors.find({candidates : {$all : [99,100]}}).explain()
{
    "cursor" : "BtreeCursor candidates_1",
    "isMultiKey" : true,
    "n" : 2,
    "nscannedObjects" : 2,
    "nscanned" : 2,
    "nscannedObjectsAllPlans" : 2,
    "nscannedAllPlans" : 2,
    "scanAndOrder" : false,
    "indexOnly" : false,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "millis" : 0,
    "indexBounds" : {
        "candidates" : [
            [
                99,
                99
            ]
        ]
    },
    "server" : "Thomass-MacBook-Pro.local:27017"
}

Well, it should be visible from the “cursor” : “BtreeCursor candidates_1”-entry that our created index has been used. Ok, probably we have to wait for next weeks lessons to learn more on this :-).

To be continued …


The MongoDB class series

Part 1 – MongoDB: First Contact
Part 2 – MongoDB: Second Round
Part 3 – MongoDB: Close Encounters of the Third Kind
Part 4 – MongoDB: I am Number Four
Part 5 – MongoDB: The Fith Element
Part 6 – MongoDB: The Sixth Sense
Part 7 – MongoDB: Tutorial Overview and Ref-Card

Java Supplemental Series

Part 1 – MongoDB: Supplemental – GRIDFS Example in Java
Part 2 – MongoDB: Supplemental – A complete Java Project – Part 1
Part 3 – MongoDB: Supplemental – A complete Java Project – Part 2

share post

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.