hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

  • RagingHungryPanda@lemm.ee
    link
    fedilink
    arrow-up
    1
    ·
    4 days ago

    yes? maybe, depending on what you mean.

    Let’s say you’re doing a job and that job will involve reading 1M records or something. Pagination means you grab N number at a time, say 1000, in multiple queries as they’re being done.

    Reading your post again to try and get context, it looks like you’re identifying duplicates as part of a job.

    I don’t know what you’re using to determine a duplicate, if it’s structural or not, but since you’re running on HDDs, it might be faster to get that information into ram and then do the job in batches and update in batches. This will also allow you to do things like writing to the DB while doing CPU processing.

    BTW, your hard disks are going to be your bottleneck unless you’re reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That’ll blow any other suggestion I have out of the water.

    BUT! there are ways to help things out. I don’t know what language you’re working in. I’m a dotnet dev, so I can answer some things from that perspective.

    One thing you may want to do, especially if there’s other traffic on this server:

    • use WITH (NOLOCK) so that you’re not stopping other reads and write on the tables you’re looking at
    • use pagination, either with windowing or LIMIT/SKIP to grab only a certain number of records at a time

    Use a HashSet (this can work if you have record types) or some other method of equality that’s property based. Many Dictionary/HashSet types can take some kind of equality comparer.

    So, what you can do is asynchronously read from the disk into memory and start some kind of processing job. If this job does also not require the disk, you can do another read while you’re processing. Don’t do a write and a read at the same time since you’re on HDDs.

    This might look something like:

    offset = 0, limit = 1000
    
    task = readBatchFromDb(offset, limit)
    
    result = await task
    
    data = new HashSet\<YourType>(new YourTypeEqualityComparer()) // if you only care about the equality and not the data after use, you can just store the hash codes
    
    while (!result.IsEmpty) {
    
    offset = advance(offset)
    
    task = readBatchFromDb(offset, limit) // start a new read batch
    
    
    
    dataToWork = data.exclusion(result) // or something to not rework any objects
    
    data.addRange(result)
    
    
    
    dataToWrite = doYourThing(dataToWork)
    
    // don't write while reading
    
    result = await task
    
    
    
    await writeToDb(dataToWrite) // to not read and write. There's a lost optimization on not doing any cpu work
    
    }
    
    
    
    // Let's say you can set up a read or write queue to keep things busy
    
    abstract class IoJob {
    
    public sealed class ReadJob(your args) : IoJob
    
    {
    
    Task\<Data> ReadTask {get;set;}
    
    }
    
    public sealed class WriteJob(write data) : IoJob
    
    {
    
    Task WriteTask {get;set;}
    
    }
    
    }
    
    
    
    Task\<IoJob> executeJob(IoJob job){
    
    switch job {
    
    ReadJob rj => readBatchFromDb(rj.Offset, rj.Limit), // let's say this job assigns the data to the ReadJob and returns it
    
    WriteJob wj => writeToDb(wj) // function should return the write job
    
    }
    
    }
    
    
    
    Stack\<IoJob> jobs = new ();
    
    
    
    jobs.Enqueue(new ReadJob(offset, limit));
    
    jobs.Enqueue(new ReadJob(advance(offset), limit)); // get the second job ready to start
    
    
    
    job = jobs.Dequeue();
    
    do () {
    
    // kick off the next job
    
    if (jobs.Peek() != null) executeJob(jobs.Peek());
    
    
    
    if (result is ReadJob rj) {
    
    
    
    data = await rj.Task;
    
    if (data.IsEmpty) continue;
    
    
    
    jobs.Enqueue(new ReadJob(next stuff))
    
    
    
    dataToWork = data.exclusion(data)
    
    data.AddRange(data)
    
    
    
    dataToWrite = doYourThing(dataToWork)
    
    jobs.Enqueue(new WriteJob(dataToWrite))
    
    }
    
    else if (result is WriteJob wj) {
    
    await writeToDb(wj.Data)
    
    }
    
    
    
    } while ((job = jobs.Dequeue()) != null)
    
    
    • PoisonedPrisonPanda@discuss.tchncs.deOP
      link
      fedilink
      arrow-up
      2
      ·
      4 days ago

      BTW, your hard disks are going to be your bottleneck unless you’re reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That’ll blow any other suggestion I have out of the water.

      Yes, we are currently in the process of migrating to PostgreSQL and to a new hardware. Nonetheless the approach we are using is a disaster. So we will refactor our approach as well. Appreciate your input.

      I don’t know what language you’re working in.

      All processing and SQL related transactions are executed via python. But should not have any influence since the SQL server is the bottleneck.

      WITH (NOLOCK)

      Yes I have considered this already for the next update. Since our setup can accept dirty reads - but I have not tested/quantified any benefits yet.

      Don’t do a write and a read at the same time since you’re on HDDs.

      While I understand the underlying issue here, I do not know yet how to control this. Since we have multiple microservices set up which are connected to the DB and either fetch (read), write or delete from different tables. But to my understanding since I am currently not using NOLOCK such occurrences should be handled by SQL no? What I mean is that during a process the object is locked - so no other process can interfere on the SQL object?

      Thanks for putting this together I will review it tomorrow again (Y).

      • RagingHungryPanda@lemm.ee
        link
        fedilink
        arrow-up
        1
        ·
        4 days ago

        Thanks for giving it a good read through! If you’re getting on nvme ssds, you may find some of your problems just go away. The difference could be insane.

        I was reading something recently about databases or disk layouts that were meant for business applications vs ones meant for reporting and one difference was that on disk they were either laid out by row vs by column.