Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Mongo for relatively big amounts of documents
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Mongo for relatively big amounts of documents

TrafficTraffic Member

I am willing to use MongoDB as a document storage for a project I'm working on. Data will rarely be read, but will be written in big amounts per day (>1 million documents per day).

Does anyone here have experience with such environments? What kind of setup/servers do you have for this?

Thanks in advance.

Thanked by 1rzlosty

Comments

  • Sharding is your friend.

    Thanked by 1Traffic
  • @albertdb said:
    Sharding is your friend.

    Thanks for your reply. Yes, I was thinking of using sharding, but I was hoping to get ideas on specific configurations suitable for handling this amount of data from real life experience.

    Of course, I will stress test the system before going live, but I want to have an idea of what would be a good starting point.

  • Rarely read -> disk based database. Cassandra is probably a good choice but it depends on the specifics.

    Thanked by 1Traffic
  • ehabehab Member

    i had sensors push 10's of values in intervals per minute 24 hours a day, using sharding, replication, and didn't have any problems. normal centos server with at least 4GB ram is good enough.

    Thanked by 1Traffic
  • TrafficTraffic Member
    edited June 2015

    @deadbeef said:
    Rarely read -> disk based database. Cassandra is probably a good choice but it depends on the specifics.

    It's just information about the client performing the request. The only things I require are being able to "find" a record from a unique ID and being able to search even with only full field match - although those would be only made from time to time, manually, so no problem if they must be queued to the user.

    The data is ephemeral (30 days), so 30-60million rows should be able to be handled without problems.

    @ehab said:
    i had sensors push 10's of values in intervals per minute 24 hours a day, using sharding, replication, and didn't have any problems. normal centos server with at least 4GB ram is good enough.

    I think this data is a bit larger than just sensor data, however your reply still helps me make an idea of the kind of HW I'd need.

  • @Traffic said:
    The data is ephemeral (30 days), so 30-60million rows should be able to be handled without problems.

    Ah! Well, Mongo will do then :)

    Thanked by 1Traffic
  • I've scaled Mongo quite large but you need to think a lot about memory.

    Another option is Elasticsearch

    Thanked by 2Traffic dlaxotn2
  • deadbeefdeadbeef Member
    edited June 2015

    @MarkTurner said:
    Another option is Elasticsearch

    Just be careful with it, because it is not a dabatabe in the sense that it does NOT guarantee data integrity. In its docs its says it cannot be used as a "source of truth" and you need to have the data on a database for that.

    Thanked by 2Traffic dlaxotn2
  • raindog308raindog308 Administrator, Veteran

    @joepie91 will be along shortly to explain why using Mongo is always a mistake.

    Thanked by 3Traffic netomx GM2015
  • @raindog308 said:
    joepie91 will be along shortly to explain why using Mongo is always a mistake.

    And what is your opinion? Is it a mistake? I think there's a use case for every storage system.

  • raindog308raindog308 Administrator, Veteran

    I personally have no use for a storage system that admits unreliability.

    Thanked by 1Traffic
  • @raindog308 said:
    I personally have no use for a storage system that admits unreliability.

    And what would be your suggestion?

  • dlaxotn2dlaxotn2 Member
    edited June 2015

    @deadbeef said:

    I wouldn't use elasticsearch as a general purpose db. It's tailored to search.
    But if you had to search, es is awesome.

    Thanked by 2deadbeef Traffic
  • joepie91joepie91 Member, Patron Provider
    edited June 2015

    raindog308 said: @joepie91 will be along shortly to explain why using Mongo is always a mistake.

    Traffic said: And what is your opinion? Is it a mistake?

    Yes.

    Mongo...

    • ... loses data (1, 2)
    • ... in fact, for a long time, ignored errors by default and assumed every single read succeeded no matter what (which on 32-bits systems led to losing all data silently after some 3GB, due to MongoDB limitations)
    • ... is slow, even at its advertised usecases, and claims to the contrary are completely lacking evidence (3, 4)
    • ... forces the poor habit of implicit schemas in nearly all usecases (4)
    • ... has locking issues (4)
    • ... is not ACID-compliant (5)
    • ... is a nightmare to scale and maintain
    • ... isn't even exclusive in its offering of JSON-based storage; PostgreSQL does it too, and other (better) document stores like CouchDB have been around for a long time (6, 7)

    ... so realistically, there's nothing it's good at, and a bunch of stuff it's outright bad at.

    Traffic said: I think there's a use case for every storage system.

    That's nonsense. There is absolutely nothing that prevents a piece of software from being objectively bad. And MongoDB is such an objectively bad piece of software - there are no usecases that aren't better solved by alternative options. It lives purely off hype.

    Traffic said: And what would be your suggestion?

    PostgreSQL, most likely. That, or Cassandra. It depends on the kind of data.

  • dlaxotn2dlaxotn2 Member
    edited June 2015

    i agree with ^ 100%
    i think the hype comes from the fact that it was written partially in js

    Thanked by 2Traffic netomx
  • @joepie91 Thanks a lot for your insights regarding MongoDB and your advice on storage systems. It's been really useful.

    I think I now have enough information to start making trials. More input is welcome of course, and thanks a lot to everyone who posted in this thread so far.

    Thanked by 2deadbeef netomx
  • blackblack Member

    What are you actually writing? You said "documents." If it's actually documents then Cassandra is not good for you. If it's like a location to a document or some sort of URL that points to the document, then you're fine. Here are the data types supported by Cassandra http://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html . Make sure you understand the difference between NoSQL and other SQL like language in terms of what you're able to query.

    Cassandra is pretty fast and scales linearly. For a distributed system, it's fairly easy to manage.

    Thanked by 1Traffic
  • TrafficTraffic Member
    edited June 2015

    black said: What are you actually writing?

    I need to save the data from each click (hit), and be able to recover it later with an unique ID. The click information is an array of data.

    Traffic said: It's just information about the client performing the request. The only things I require are being able to "find" a record from a unique ID and being able to search even with only full field match - although those would be only made from time to time, manually, so no problem if they must be queued to the user.

  • blackblack Member

    Traffic said: I need to save the data from each click (hit), and be able to recover it later with an unique ID. The click information is an array of data.

    Cassandra should be fine then.

    Thanked by 1Traffic
  • jcalebjcaleb Member

    joepie91 said: .. so realistically, there's nothing it's good at, and a bunch of stuff it's outright bad at.

    it is good for doing small experiments. not good for anything to be used in real life

    Thanked by 1Traffic
  • joepie91joepie91 Member, Patron Provider

    jcaleb said: it is good for doing small experiments. not good for anything to be used in real life

    Well, no, not even really that.

    A small experiment is generally one of two things:

    • Something that might grow out to a 'real' project; if you're using MongoDB, you're going to have a bad time.
    • Experimenting with a new technology/concept to learn about it; you're basing your learning (indirectly) on a technology that you won't be able to deploy in production.

    In both cases, you're better off using something that is either already production-ready, or likely will be production-ready in the near future. There's not really a point in experimenting with something that you can't use in production anyway.

    Thanked by 1Traffic
Sign In or Register to comment.