The following blog post, unless otherwise noted, was written by a member of Gamasutras community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.
Ever been wondering how much work might go into the seemingly small, and relatively vague bullet points on the update list? In particular, the ones in this list related to network optimizations?
Well, either way, I’m here to tell you a little about it. It’s actually a very big and pretty complicated topic, even though the results can be described in relatively simple terms.
As you all probably know, Satisfactory is a game about managing and handling tons of buildings and resources. This creates a very unique problem that most multiplayer games don’t have to handle, as most multiplayer games only need to keep track of a few other players (or nowadays, about 100 other players) and whatever comes out the other end of their gun. Which is very different from handling a base with over 2000 conveyors, transporting thousands upon thousands of items every second. On top of that, those items move in clear-to-see patterns, meaning that it is easy to see when they end up wrong. For us this means there is less room for simplifications. Opposed to, for example, when a player at a distance of 100 meters happens to aim their gun 10 degrees wrong.
This is just a portion of a base, but pretty much everything you can see here has states that change over time. These states need to replicate for all players simultaneously.
To solve this there are essentially two solutions. Solution A would be to initially synchronize the whole map to all clients fully, and then do a local simulation on all clients’ machines. This will keep everything 100% predictable at all times, and will only send and sync information when unpredictable events, such as player input, happen.
Solution B would instead run everything primarily on the server while clients send their input to the server, which would then replicate its state to all clients. Similar to how most shooters handle this.
Both methods have their pros and cons. So let’s dive a little bit deeper and explain which solution we are actually using, and why.
So like we said, solution A relies on simulating everything locally after the initial full sync. This requires us to keep everything 100% predictable/deterministic at all times. Which in itself is a bit of work, but doable, and something we want to strive for either way. In this case it would not only be a goal, but an absolute requirement. The slightest inaccuracy could cause a chain reaction, resulting in the entire game-state being desynchronized beyond repair for everyone. Keeping things deterministic for factories and such is one thing, but keeping the physics simulations for complex actors such as vehicles perfectly deterministic on all clients is really hard, if not impossible, without rewriting parts of the engine. Meaning those would likely require a special workaround solution.
What makes this even more complicated, however, is the sending and syncing information when unpredictable events, such as player input, happen. To make this work you would then also have to run everything in lock steps. Lock steps ensure the game does not advance a frame before everyone can confirm they received the input messages from all players for the given frame, which would result in stutters for all other players if just one player is running behind.
An alternative to lock steps is requiring all players to keep multiple copies of the entire game state for every frame up until 2 seconds back in time. This enables them to rewind and simulate forward again when input from a laggy client arrives. Multiple copies of the game state would eat away memory, while rewinding and simulating forward multiple frames in a single frame would cause a lot of CPU overhead and potentially performance spikes.
All are solvable problems, but require a lot of work. It also makes adding new features more difficult, as you need to consider all these cases and systems.
We didn’t choose this method. We instead went with solution B. Which, like most multiplayer games, runs everything on the server, and then sends and replicates a representation of that to all clients when needed. But this requires us to be really smart with how we use the available bandwidth, since just replicating everything straight up will be impossible in a game like this.
So what we’ve been working with in the months since launch has been the following:
Minimizing what data is sent, by figuring out when data is actually needed and only send it at that time. Things the player doesn’t notice are not needed.
Reducing how often the data is sent. If things don’t change often or are more predictable, it can be sent less often.
Compressing and minimizing the amount of data needed to represent a game state.
On top of all this there is also the problem of evaluating all the objects separately for each client on the server-PC, since which data is prioritized and needed will be different for every client. This means a lot more work for the server which can lead to greater CPU load, which needs to be minimized to keep the performance of the server as good as possible.
If these things are not done properly, the server can be performing a lot worse than the clients. Clients can experience a lot of extra latency (much greater than the actual ping), there can be a lot of dropped packets (will explain what this is shortly), things will start to get choppy, and things might look strange in general for the clients. Essentially all kinds of issues can occur.
To what extent this can be addressed or fixed varies greatly from case to case, so to give a better understanding of these concepts I plan to dive into a few concrete examples of improvements we have made for this patch.
To get us all on the same page, let’s cover a few common and maybe less common terms that we are going to use.
Packet: A bunch of data sent over the network as a chunk. This is how any and all information is sent between clients and servers.
Packet loss: When data is not being received or sent properly, resulting in holes in the intended data. Which can cause all kinds of issues if not handled.
Ping: is the time it takes to send a data packet to another PC and get a response back. Usually measured in ms (milliseconds).
Latency: The time it takes from initiation of an action until it’s actually performed/the user gets feedback of it happening.
Send Rate: How much data is aimed to be sent per second.
Send Buffer: A chunk of memory that is written to, which the PC will read from and consume when sending packets.
Bufferbloat: when too much data is written to the buffer, making it take too long to consume and work through the data. Think of it like a queue. Sometimes this leads to an extreme latency increase.
Buffer Overflow: When you try to write too much data to a buffer so some of it doesn’t fit and actually gets dropped. Can result in what seems like dropped packets.
Replication: The act of replicating the state of something on a client. It’s often not the whole state, but it’s enough to give the correct impression of the server’s state.
Serialization: The act of taking a complex data structure and writing it to a linear stream of data, which can be sent in a packet to later be turned back into a data structure.
We will use a few other terms as well, but let’s cover those as we encounter them.
What we’ve done
So, when starting on this update we had a pretty big plate of things to fix. There were many issues with the net code at launch, resulting in all kinds of issues imaginable. Most of which we’ve attempted to solve now, but a few that we have plans to follow up on again a little later. Soon™.
The first overarching and biggest problem was simply the raw amount of data that was necessary to be sent, growing larger as the factory grew. Large bases and factories would cause an always-full send buffer, meaning bufferbloat, and even worse, possibly causing overflow.
To understand what a bloated buffer means for the game, think of the buffer as a queue that data needs to go through before it is actually sent. So a full/long buffer means data has to wait around longer before it is actually sent, thus increasing latency, even on our highest priority messages (like player input and movement). Ideally, a responsive game should write little enough data so it is all consumed before the next update. Making it so that the new packets can start writing at the start of the buffer (the queue). Meaning the newest data from the latest frame gets through as soon as possible.
So, our number one goal was to reduce the amount of data sent during gameplay as much as possible. To describe that work we can take a look at two particular cases that perfectly encapsulate the issues we had and the work we’ve been doing.
Only sending what can be seen
First-off we have a low hanging fruit with huge gains: all the factories and buildings with inventories.
With the initial replication the whole building was seen as one object, meaning if we replicated any part of it, like the production indicator lamp, we needed to replicate all of it. In turn this meant we ended up having all buildings’ inventories replicated to all clients at all times, even if they could not be seen.
For some buildings this was not a big issue, as its state rarely changed (meaning no data or at least very little data was needed). But most buildings have their inventory connected to a conveyor for input and output, meaning it is likely to change all the time. In fact many times a second in many cases.
As you can see in the menu for this building here, there is an inventory for the building that constantly changes and updates. This applies to every building that stores items or produces them and is connected to one or more Conveyor Belts. That’s a lot of data!
It would be a big gain to stop replicating the inventory when it’s not viewed, which is essentially what we did, but the method of doing so was a bit complicated and required a lot of rework. Not only for the underlying code, but also for GUI and other aspects, considering there would now be a small chance that we had a replicated building (or player/vehicle) which, due to latency, had not received its inventory yet. Which all systems both need to be aware of and preferably also inform players about.
Doing this also helps to reduce CPU time, as an inventory is a big state to compare, and look for changes in. If we can reduce that to a maximum of 4x the number of players it is a huge gain, compared to the hundreds, if not thousands, that would otherwise be present in a big base.
There is, of course, a trade-off. As I mentioned there is a chance the inventory is not there when you first open to view it, as it has yet to arrive over the network. However, this time should be very short, as long as we keep the buffer bloat to a minimum, and you are playing with a reasonable ping.
This all works very well at the moment, but there is always more we can do. In the future we could work to prefetch inventory data, trying to predict potential targets before you interact with them and fetch them just in case, minimizing or even removing latency. But that is work for another update and can cause serious data spikes if we are not careful.
Minimizing data size and rate
So that was related to the “Minimizing what data is sent, by figuring out when data is actually needed” method that was mentioned before. So how about the other methods?
To cover this, the next biggest issue is a perfect fit; The conveyor belts. There are literally thousands of them that send and receive many items a second each. Often you can clearly see them with all their items even on a great distance. There is no avoiding sending data in this case, since everything is visible, unlike with the inventory, so we need another solution.
I mentioned that the code at release was not very good, but it was not all brute force or anything like that. It was already trying to reduce the amount of data used by conveyors in many ways. So, before we cover the improvements and changes we did, let’s see what we were already doing.
The conveyors have a lot of items and data, but only a small set of that changes every time in a non-simple predictable manner (items entering and leaving). This is a clear case for something called delta serialization. This is not normal serialization like most objects use. The key word is delta. Delta means that we use a previous state and only extract the difference, which usually is a lot smaller, and send that instead. Think of it like this: instead of sending the 25 items currently on the conveyor, we only send the 2 new items that were added and the 1 item that was removed since the last update. Meaning we use only 12% of the data. On top of that we don’t replicate the movement of items on a conveyor, and instead simulate that on each client (only sending the initial position when the item is added).
This is a very simplified view of it all. In reality we have to send a little extra data and have a few more systems on top of it to ensure everything stays in sync, but that’s not what we want to look at right now.
This might already sound like it could be good enough, but in our measurements the conveyors still made up a clear majority of all network data and often resulted in the buffer getting bloated and potentially overflowing. Meaning it was not enough. We had to go deeper.
In this case we have to write our own custom serialization, looking on and evaluating every bit and byte we write. It’s a very time-consuming task, and it’s always full of compromises and trade-offs. We want to minimize data, but using more data always gives better accuracy, so the trick is to find the line where we can get just enough accuracy for as little data as possible. On top of that, if we can predict some behaviours or patterns, we can skip sending any data at all, which is an even bigger gain.
However much I would like to, I won’t go in on the details of what we actually write in the delta packets here, as it’s a really complicated system and could easily cover a whole page just explaining it. In fact, the functions making up the delta serialization is over 2000 lines of code, so it’s quite a lot to explain in there.
It’s hard to give concrete examples here, but we can use an average case packete for a standard conveyor under standard load to compare the size difference. In this case the old system actually varied in size but landed around 48 bytes per delta, compared to the new system of just 3 bytes. This is just the average case, the difference is smaller in other cases, though our new method should always be much smaller than the old method.
On top of this, we also reduced how often a conveyor tries to send an update to just 3 times a second compared to the previous of over 20. This makes them accumulate a few more changes as one packet/delta, meaning we spend less data on the packet overhead. Overhead being a constant cost which doesn’t really contribute to the end result but is needed for the system to work. After our reduction, this overhead turned out to be the largest part of most of the packets. So reducing how many of the packets are sent (how many of the overheads we “pay” for) ends up saving us a lot of data, at the low cost of a little extra latency on items’ movement on belts.
This graph demonstrates a pretty common test scenario we set up, and how much data it used before and after we applied some of the optimizations we were testing. As you can see, the results of our changes were quite promising.
But nothing is without trade-offs, and the reduction in data had some costs as well. For example, the accuracy of item placements on the conveyors took a small hit, but we have added complicated systems in order to compensate for that. These can recognize and use common patterns to avoid accuracy problems entirely, or send a little extra data in the cases it is needed.
Overall the conveyors should not only use less data compared to before, they should appear more accurate, be a closer match to the server’s state, as well as feel more responsive, even though there is a bit of built-in latency. Since that latency is compensated for on the client side, it should not affect your playing experience and rather make things feel even more responsive than before.
A lot of these optimizations are things we could only do by knowing the problem area in-depth and designing everything specifically for this case and this case only. In general, data optimizations like this is manual work that will take a lot of time and is generally done at the end of a project where the problem area is well defined and tested.
There is some future work remaining on the conveyors, which could help reduce the data a great deal more, and there are some things we could improve in the looks department. Overall the new system should be a lot cheaper for both the network as well as the CPU, while looking and feeling better than before. Essentially: wins all around. Though, like with all new and complicated systems, there can be some undiscovered issues and bugs that we will try to iron out as they pop up.
This is probably all optimization-related talk I can cover in this blog, but there is a lot more work that we have done, and we will spend a few more weeks after Update 2 looking over these systems to make sure everything is running as it should.
There are also two other systems that we are considering to add, but which we didn’t dare to add so close to the patch going live. So we’ll see when we get time for that. But if you are a player and still experiencing a lot of lag, delays, and/or other issues when joining into a game with a large factory, know that it is being looked at. It’s all related to too much data that floods the send buffer and creates a prolonged time of bufferbloat and overflow on startup. There is no easy way to manage this in the engine, so we have to write our own system for a gradual streaming of the initial world-state to clients.
For this update we prioritized everything that affected the moment-to-moment gameplay after the loading is already done, but the start up/loading is next on the list.
After this we will not focus on network optimizations for a while, as we’ve noticed that the biggest issue for running smooth multiplayer in large factories is not the network traffic anymore, it’s rather the general performance of the PC acting as a server, which is were we will focus our optimization efforts next.
For more info about the game, and potentially more blogs, please visit satisfactorygame.com or follow the Satisfactory twitter: @SatisfactoryAF
For questions, feedback or more info about the topic here, you can use the comments below, or reach me at twitter @Gafgar_D.
Thanks for reading!