Skip Navigation

InitialsDiceBearhttps://github.com/dicebear/dicebearhttps://creativecommons.org/publicdomain/zero/1.0/„Initials” (https://github.com/dicebear/dicebear) by „DiceBear”, licensed under „CC0 1.0” (https://creativecommons.org/publicdomain/zero/1.0/)PO
Posts
1
Comments
347
Joined
2 yr. ago

  • Ms sql is trash Running anything on a Hdd is a joke

    thank you, I convinced management to migrate to a modern hardware, and we switch to PostgreSQL together with refactoring our design and approach.

    You read write and compare continuously? Did you try to split it into smaller chunks?

    Locks are handled by SQL. but yes, multiple tables are read, written and the final table compared with multiple requests/transactions (connections?) simultaneously. Split into smaller chunks would nonetheless mean that the query would loop through the whole table - in chunks? how would this help with simultaneous transactions?

  • A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

    Did this because of the convincing replies in this thread. Migrating to modern hardware and switch SQL server with PostgreSQL (because its used by the other system we work with already, and there is know-how available in this domain).

    You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

    But how can we then ensure that I am not adding/processing products which are already in the "final" table, when I have no knowledge about ALL the products which are in this final table?

    Create an index and a table constraint on the relevant columns. ... just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

    This is helpful and also what I experienced. In the peak of the period where the server was overloaded the CPU load was pretty much zero - all processing happened related to disk read/write. Which was because we implemented poor query design/architecture.

    For read-heavy workflows, consider whether caches or read replicas will benefit you.

    May you elaborate what you mean with read replicas? Storage in memory?

    And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

    Yes, I will swallow the pill. but thanks to the replies here I have many starting points on where to start.

    RTFM is nice - but starting with page 0 is overwhelming.

  • With MSSQL, the first thing you should check is your indexes. You should have indexes on commonly queried fields and any foreign keys. It’s the best place to start because indexing alone can often make or break database performance.

    Indexing is the most answered step. But for foreign key, to my understanding, I apologize is this is maybe wrong, would lead to split the data into separate tables all related by this key right? What would be the difference in splitting the columns of a table into multiple tables - all related by an mutual column, lets say "id"?

  • So yes. Stack Overflow is going to tell you to RTFM. Because someone needs to sit down with this mess, determine the pros and cons of the system design, and figure out where to start overhauling.

    yes thats me. But thanks to the numerous replies to this thread, I have no a clearer picture about culprits and steps where to start with.

    The tradeoffs you mentioned are exactly why we are in this mess. In the beginning with no knowledge we thought that certain measures would help us. but it turned out that those poor decisions led to the wrong direction.

    Thank you for reply.

  • To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

    This can be the new slogan of our development. :')

    I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

    As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

    Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

    If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

    so no. No narrowing.

    Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

    Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

  • First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

    Currently about ~50. But like 30 of them are the result of splitting them into a common column like "country". In the beginning I assumed this lead to the same as partitioning one large table?

    Also, look at your slowest queries

    The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

    They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

    Actually no JOIN. Most "complex" query is INSERT INTO with a WHEN NOT EXIST constraint.

    But thank you for your advice. I will incorporate the tips in our new design approach.

  • first of all many thanks for the bullets. Good to have some guidance on where to start.

    2nd level cache shared between services

    I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

    faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

    I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

    generate indexes

    How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?

  • Sort of harsh approach, but I get it.

    Yeah. To me it feels like we used a powertool as a hammer. Brute force in the wrong way.

    As an update: I was able to convince my people to migrate to a modern server - altogether we also switch from SQL server to PostgreSQL. During this migration we also try to refactor our workflow since it was flawed by design.

    So, many thanks for the input.

  • Realistically, this setup is 10 years too old

    thanks for this input. This was the winning argument for my boss for migrating to a modern server. While I admit that I see many flaws in our design, we are now working on refactoring our architecture and approach itself.

    Thanks to the other numerous answers leading me to the right direction (hopefully).

  • Why aren’t you de-duping the table before processing?

    I need to dedupe the to-be-processed data with the data thats already processed in the "final" table. We are working with hundreds of millions of products therefore we thought about "simply" using random batches from the data to be processed. But thanks to the many replies Ive learned already that our approach was in the beginning already wrong.

  • yes. the problem is, we are fetching products from an API. and since we thought processing power will be a limiting factor, we thought that sorting out duplicates would reduce load.

    but since the different microservices which process the data are taking different times we are using the sql tables as a pool. this should help upcscaling by using multiple microservices.

    cloud services are yet not a solution as we are still in development.

  • thats true. when I tried it I even found it more difficult because of the limite. you must trust the scripts and if something does not work its a lot more complicated than setting it up by yourself.