Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


How to scale when database gets huge?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

How to scale when database gets huge?

sandrosandro Member
edited January 2013 in General

I'm creating a project that could get a lot of users very quickly and I wanted to know what is the suggested approach I should consider from the start.

I'm developing it with a PHP framework and MYSQL both optimized as far as my knowledge allows :P I basically want to start small...I mean running on a simple VPS and scale to better hardware as needed but in the end I was thinking that one database's table could be become a real problem. Basically there will be a table that stores activities every user could do 100 times per day (I'm not going into details) so considering that in the future I would have, say, 10k users doing that every day I would get to a point of millions of new data per day (even if it just stores 2-3 numbers). And this table even if it's not queried to retrieve data at every visit but only only to INSERT I would need to read it and manipulate data...and what happens with 1 trillion rows? I'm not an expert but I'm pretty sure that having those numbers in a table cannot be good especially if it grows everyday non-stop.

SO my questions are:

-is there another database method I should start with? Something in front of MYSQL made for large tables? If any!
-how do big web sites handle these things? They store "part" of the rows on multiple servers?
-is it OK to use the simple BIGINT (primary key) for this table considering that is not unlimited? I read that using varchars as primary key slows down the database significantly.
-could services like AWS do this for me transparently?

Thank you.

«13

Comments

  • They use replication and sharding. As in you'll have to build clusters of them.

  • This is some interesting topic - but please note, something like that must always specified to a project, since there is no bulletproof-solution, which you can use everywhere.

    -is there another database method I should start with? Something in front of MYSQL made for large tables? If any!

    Theoretically there is Cassandra, Hadoop and some others, but at least cassandra was designed to run at hundret Nodes :)
    I'm sure for your project some shardening and clever replication would help a lot

    -how do big web sites handle these things? They store "part" of the rows on multiple servers?

    See my previous answer :)
    But, some years ago I was also like that and even wanted to use Cassandra for small projects (since they could get a lot of users, but they never have) - but, all this is sooo much work - just concentrate on your app (Does not make sense to specialized on scaling, but the app hasn't much functions, etc), and when the time comes, where scaling is necessary, there are ways how to do it.

    -could services like AWS do this for me transparently?

    A service like AWS only helps if you need a lot of resources very quickly, otherwise dedicated servers are always cheaper and more powerful.

    If you like to talk about all that regarding your app, you can send me PM, so I can take a look at it :)

  • hi!
    I read something about Cassandra and Hadoop... do they just replicate or also load balance? I can't understand what they actually do to be honest :) When and if the time comes will I need to change my code too (basically to interact with Cassandra)?

  • @sandro said: I'm creating a project that could get a lot of users very quickly and I wanted to know what is the suggested approach I should consider from the start.

    What's important is to** know** how to scale, that way, if/when the time comes you'll be prepared. Until then, hang on tight with a LEB!

    There's two ways of scaling MySQL, horizontally and vertically.
    Horizontally involves adding more servers, vertically involves moving to a bigger server.
    Horizontally only makes sense when you have a very large data-set, and when you're maxing out a large-ish dedicated server.

    With that in mind, I'd personally stick to just going vertically until you absolutely have to scale horizontally.
    I'd also recommend you use an alternative MySQL-compatible server, such as MariaDB or Percona. They'll offer significant performance increases.

    There are services that offer managed MySQL; Xeround comes to mind immediately, but be aware that it's incredibly expensive.
    Since you're posting on this forum and you're developing something, I'll assume you're entirely capable of learning a bit of MySQL server management :)

  • @sandro said: When and if the time comes will I need to change my code too (basically to interact with Cassandra)?

    If you actually use MySQL, then yes, since the whole architecture is completely different - but as I said, just with clever SQL replication, sharding and load balancing you can do a lot (i.e: WordPress.com uses MySQL replication through some datacenters)

    do they just replicate or also load balance?

    Depends on the configuration: replication should be done from the DB, loadbalancing can be simply done by haproxy.

    But it's a big topic... just focus on your app, scaling will come then and can be done from some profis in this area :)

  • Pre-optimization kills you ... just as @Amfy mentioned, focus on your app and the rest will fall into place.

    I have an app that has over 3000 users. I started out worrying about "what ifs" but I finally got through it and focused on delivering a tool that many people love to use. I'm nowhere near having to add a second DB server. Using the right tool from the get-go definitely helps. For example, I was going to store photos and docs on the local filesystem. I ended up using MongoDB's GridFS feature and haven't looked back.

  • Oh now I got that Cassandra is actually another thing compared to MySQL.

    I agree with you all. I'm not per-optimizing I'm just asking if I should use "something" that helps me scale without re-doing everything from scratch but I guess that in the end I'm just looking to know what my options are when the time comes as @ElliotJ suggested.

    Replication + load balancing seems interesting but would that really solve the problem? I mean is MySQL (or similar) ready to query a table with trillions of data? I mean just the query itself could it be problematic, is there a "limit"? In my mind even if I scale vertically I'll reach a point where more hardware can't do anything :)
    Maybe sharding is the only effective solution, but even there...BIGINT as a primary key has its limit (ok we're talking about a HUGGEEEEE number but still limited :) )

    Or maybe I should just archive that table periodically, summarize the totals per user on a temp table and start a new one. I just can't find a reason that keeping a quadrillion rows all together could be any good, it just looks like a bad practice... but anyway out of curiosity how much does it take to query a table with a quadrillions rows? Do companies really do this (along with sharing of course), keeping tables so big?

    So do you really recommend starting with an alternative MySQL server? Do they use the same language? Can I move to one later with no problems?

  • I think you're massively overestimating the size of your database. You're never going to have trillions of records, at least not at your estimate of 100 records per user per day, even with 10 million users. MySQL can easily handle millions and millions of rows if it's configured properly (proper indexes, enough memory given to it, etc). Once you get beyond that, sharding is pretty much the only solution (at least that I know of). And by the time you run out of BIGINTs to use for primary keys, there will be a BIGGERINT or something... actually, you'll likely be long gone.

  • mikhomikho Member, Host Rep

    Dont know about quadrillions rows but i once did a job for a company that saved everything in a log-table.
    Some updated indexes worked wonder but it's hard to say anything specific without know what the table looks like and what info you will write/read from it.

  • thekreekthekreek Member
    edited January 2013

    Not the best way to do it, but sometime ago I remember reading about splitting your database in multiple databases.

    This way you could setup multiple mysql servers with a portion of your database and with different resources.

    If one database it's getting too big you can move it to another server, of course your code will have to support this also.

    Original database: (project.db)

    • users
    • blog-pages
    • blog-posts
    • posts-comments
    • sources

    Alternative way:

    • users-database
    • blog-posts-database
    • blog-comments-database
    • extra-data-database

    Each database with their own tables according to the project.

  • So the inserts are realtime when the rest is batch?

    If so, just archive the data to another (slower) server/database/table to do your queries. 100 million simple rows is nothing to worry about.

  • sandrosandro Member
    edited January 2013

    @thekreek great suggestion, i never thought about it! Even though the big stress is from one table only really. But yeah it's another quick way to spread the load.

    @abaxasabaxas basically yes the inserts on that table is performed on the fly if the user does a specifically activity, while counting the activity this user has made as a TOTAL I would like to make it non-real time. Think about the twitter followers number, you might want to show increments in real time till the last digit if the number is in the hundreds but there's no point in counting at every visit if the total is in the millions...you would show the total in "thousands increment" like 1.2M and 1.3M .

    What I'd like to know is: does counting a result of 100 rows put the same stress on the system as counting another one of 1 million rows from the same table?

    @NickM sorry it's 100 rows per hour. Of course it's all hypothetical

  • mikhomikho Member, Host Rep

    It depends on the question and the setup.
    Any filtering might need to go thru every record in the table to get what you need.

  • it's the same filtering. Just the user_id is different

  • mikhomikho Member, Host Rep
    edited January 2013

    You want to get the latest 100 entries by a user?

    Without proper indexing of the table it would need to go thru all records in that table. Every time.

  • For MySQL

    • Lots of Memory for various caching i.e. memcached assistance
    • SSD/SSD PCI-E based disk I/O scalability i.e. Intel S3700 SSD, SLC SSD based Virident FlashMax PCI-E device which supports replaceable NAND modules in a raid 7 design or SLC SSD based Fusion ioDrive2 Duo PCI-E device
    • Sharding i.e. scalebase http://www.scalebase.com/products/product-architecture/
    • Better scalable MySQL versions like MariaDB 5.5/10.x and Percona 5.5/5.6 and eventually Oracle MySQL 5.6 etc
    • MySQL replication, Tungsten MySQL replication, Galera Cluster, Oracle MySQL Cluster etc

    just some ideas in general :)

  • I still prefer text documents for storage. I have my big webapp running reading from text documents. Much easier doing HA with text documents than MySQL.

  • raindog308raindog308 Administrator, Veteran

    I attended an Oracle conference a year Go in which one presentation was titled "The Trillion-Row Table". It was by a guy from CERN.

    I'm skeptical you're going to have that much data...

  • Not to thread hi-jack, but if I'm serving files from 2kB-200mB from a local filesystem, would it be smarter to store them as blobs in the database? I'm just thinking about when I plan to expand and replicate the database to another server.

  • sandrosandro Member
    edited January 2013

    @winston said: Not to thread hi-jack, but if I'm serving files from 2kB-200mB from a local filesystem, would it be smarter to store them as blobs in the database? I'm just thinking about when I plan to expand and replicate the database to another server.

    Why do you need a database to replicate files between servers?

  • @winston said: Not to thread hi-jack, but if I'm serving files from 2kB-200mB from a local filesystem, would it be smarter to store them as blobs in the database? I'm just thinking about when I plan to expand and replicate the database to another server.

  • @sandro said: Why do you need a database to replicate files between servers?

    Easier to back up, and eventually serve the files in a sort of CDN

  • mikhomikho Member, Host Rep

    +1 NickM
    Bad practice

  • dragontamerdragontamer Member
    edited January 2013

    @NickM said: I think you're massively overestimating the size of your database. You're never going to have trillions of records, at least not at your estimate of 100 records per user per day, even with 10 million users. MySQL can easily handle millions and millions of rows if it's configured properly (proper indexes, enough memory given to it, etc). Once you get beyond that, sharding is pretty much the only solution (at least that I know of). And by the time you run out of BIGINTs to use for primary keys, there will be a BIGGERINT or something... actually, you'll likely be long gone.

    Even ignoring that fact, when you get into trillions of rows, if you have a small string and a few BIGINTs per row, you need to create a Database system that can support maybe ~50 TB.

    Just managing that server (cluster), no matter what server you use, is going to be a royal PITA. You need some serious experts to even design systems that can support that amount of data... let alone the expertise you'd need to design that kind of a database.

    Not to thread hi-jack, but if I'm serving files from 2kB-200mB from a local filesystem, would it be smarter to store them as blobs in the database? I'm just thinking about when I plan to expand and replicate the database to another server.

    Why not just rsync the data between two servers? Simple master / slave "replication" going on there. Set a cron-job to rsync the data every few minutes and then loadbalance between the two. All writes go to the "master" server, which is eventually replicated to slaves.

    BTW, a bit of database theory for ya. Pick two of the three: Consistency, Availability, Partition Tolerance. (ie: CAP theory) This system above would be Available and Partition Tolerant, but not consistent. It is mathematically impossible to have all three, so you gotta sacrifice something.


    That said, I think blobs, while extremely inefficient compared to a filesystem, makes it easier to do server backups. When everything is in a database, you can just mysqldump and save off the data.

    When you have your data in a filesystem, you gotta keep the filesystem backup in sync with the sql backup. You gotta set your group permissions correctly, gotta worry about directory traversals / security problems, etc. etc.

    There is a definite case for simplicity by using SQL Blobs.

  • @sandro:

    Let's start with the low hanging fruit.

    and what happens with 1 trillion rows?

    Don't even try 1 trillion rows. Once you get into the 10's of millions of rows of data in a table you are asking for problems. The data gets big on disk, slow to manipulate (do updates, compact, run checks against, etc.).

    You have to figure out what is current and needed for the user. No need to store all the user data in the live table space.

    100 times per day (I'm not going into details) so considering that in the future I would have, say, 10k users doing that every day I would get to a point of millions of new data per day (even if it just stores 2-3 numbers)

    Are these 100 points of new data or are the data that gets changes, like say status, location, etc.?

    10k users x 100 new inserts = 1,000,000 that's 1 million pieces per day.

    60 seconds x 60 minutes x 24 hours = 86,400 seconds per day

    1,000,000 / 86400 = 11.57 inserts per second.

    That's nothing to handle. Peak time though could be 100-1000 times more though. That gets more complicated.

    To start you need to develop a database schema to deal with the data. That's a layout. These points of 100 per user, they need a space in live data tables and older data needs swept out to a lesser used table space in same database. Do joins on the user ID which should be one of the keys in both tables.

    is it OK to use the simple BIGINT (primary key) for this table considering that is not unlimited? I read that using varchars as primary key slows down the database significantly.

    No.

    Typically we just use INT for primary keys. Especially in the case of users which we often have an internal incrementing number value for each.

    Everyone tends to have a finite number of users. However the activity of the users is where the blowup happens and lots of records. In that table you also typically would use the users INT value key to match the activity.

    I read that using varchars as primary key slows down the database significantly.

    Probably. A varchar tends to be a place we plug generated UUID values. Most folks throw similar on the fly semi secure sort of IDs into varchars like that. Of course doing a comparison is slower because dealing with much longer varchar ID compared to much longer stored values. Also blows the database size up unnecessary.

    You HAVE to think about that big fat daily data issue after say 90 days of storing data. Create a sweep routine, table re-compression routinely when such happens and means to match that old data to user when and if necessary.

  • @pubcrawler said: You have to figure out what is current and needed for the user. No need to store all the user data in the live table space.

    I know, I still have to figure out how to do this though.

    @pubcrawler said: Are these 100 points of new data or are the data that gets changes, like say status, location, etc.?

    No they're just data to store and they'll never be changed again.

    @pubcrawler said: Everyone tends to have a finite number of users. However the activity of the users is where the blowup happens and lots of records. In that table you also typically would use the users INT value key to match the activity.

    I didn't mean for the user id but BIGINT to distinguish every single activity from another. I doubt that not having and index/primary key is good for performance.

    @pubcrawler said: To start you need to develop a database schema to deal with the data. That's a layout. These points of 100 per user, they need a space in live data tables and older data needs swept out to a lesser used table space in same database. Do joins on the user ID which should be one of the keys in both tables.

    Yes that's what I want to do, the thing is I will have to determine which data has to go to the live table and of which user. To know this i have to query the BIG ONE many times per day. But I guess when that time comes I would be able to afford more servers to deal with that.

  • I currently have a DB with about 200M records and 300 queries/s.
    Optimization was needed. After several trials, the main things I did were:
    1) split the main table(s) into 100 small tables (if the app allows it)
    2) try to have the tables read-only or write-only (separate I/O ops)

  • +1 for hostingwized_net approach.

    You could have trillions of records when segmenting.

    Read only tables are great. Write tables should be looked at like swap space almost. Require regular checksumming, compaction, etc.

    The read only tables need same checksumming and compaction only after bulk inserts (i.e. after moving data in mass from the write table over to the read online table).

    I'd take that even further where possible and put in different drives for the read space and write space.

    I prefer to build bulk pig tables like this on SSDs. All improvements make a big difference in aggregate.

  • No they're just data to store and they'll never be changed again.

    That is a read only table then. Sweep that data from the live write table to a read only table every x hours or y days. Keep the write table small in size.

    I typically do the data move overs on a set schedule. Every 90 days in most instances. Simple query to collect the data in the write table then insert it into the read only table. Just based on current data -90 days.

    Inserts can take time so we grab the low and high record ID on the primary key. Then when imported to the read table we just delete the records between that low ID and high ID. Simple as pie.

    I doubt that not having and index/primary key is good for performance.

    IDs should be INT unless some compelling reason not to. The ID in your instance appears to the user/member/customer ID. All records related to the user should bear the same ID as field for comparison/match.

    Hope that helps.

Sign In or Register to comment.