Every programming language has its quirks and its flaws, just as every operating system and hardware architecture. When developing a big project such as LiveBox, and having to manage a strong private cloud system installed on client machines all this has to be taken into consideration.

A customer might need to handle hundreds of users with hundreds of files taking up dozens of gigabytes, whereas another one might have thousands of users with thousands of files scaling up to terabytes.
Nobody can really predict how much the system usage or load will escalate in the long term, but the code and the infrastructure can be built to be scalable and to bear heavy load situations.
Recently PHP upgraded to version 5.6 allowing file uploads over 2 GB in size, meaning LiveBox too will have to prepare itself to manage this new potential overload.
Inside a LiveBox server there is a difference between the logical structure of folders and file and the way they are actually stored on the filesystem. We are going to analyze how the storage is structured and what has been implemented to manage data in the smoother and safer way.

Each user has a logical root folder which is actually mapped on the filesystem as well. Since LiveBox is built out of several sub-applications (we’ll not discuss those here) this root folder will have a logical sub-folder to act as root for each application. One of these applications is the actual storage, and will be the home where the user will be able to create new folders and manage his files.
Every file and folder is mapped as a logical entity in a tree structure, with each element pointing up to a parent until the home root folder is reached. This way we are always able to show a view of the contest of a specific path, and also render an actual tree map of the folder structure.
For each file element we have a match on a versioning table, allowing us to manage file versioning and storing the actual path on the filesystem. While the user explores his folders, lists his files, looks at their informations, moves them around, creates new folders and so on, only the database is called, because the storage disposition is not relevant for folder tree information. This drastically decreases the execution time of many operations, since the data is not really read or written, and we know we might have to handle huge data. When we need to access the contents of the file, then the versioning table is queried to get the physical path and the file is actually accessed.

While folders are mapped on the database, they are not real folders on the filesystem, they exist only at a logical level, to help us map the tree structure of the user, but nothing more. The only important folders (for this post) you will find inside a LiveBox storage are the root folder and the single file folders. Yes, because each file is physically stored inside a separate folder, and those folders are where the database versioning records are pointing.
Why use a folder for each file? The server must sometimes adapt to client needs and one of those needs was the necessity to get file contents in a base64 encoded form. Then came the encryption and the need to encrypt huge files. Then the requirement for resumable downloads came along, sometimes of a encrypted base64 encoded file.
So why a folder for each file? We’re almost there. The LiveBox development team decided to split every file into smaller parts (10 MB) to manage atomic operations in a smoother way, leaving the door open for operation parallelization in future development.
As soon as a file is uploaded correctly, a script starts splitting it into chunks, storing them with numerically progressive names inside a dedicated folder. When the file is needed the server starts reading the chunks in order into the target folder, outputting the content sequentially so that on the receiving side the data stream will contain the whole file as a result. This allows resumable downloads should the download request drop before it is completed. The client only needs to check how many bytes it has downloaded, truncate to the 10 MB lower multiple and then ask the for the file again, adding a parameter stating from which chunk the download should start.
When encryption comes into the picture, each chunk is encrypted separately after the file has been split, so again we can start a download from any point because we don’t need the whole encrypted file to start the decryption. Each chunk is decrypted and outputted, providing once more a seamless stream of file data.

Since some of these feature have been added in different moments of LiveBox history, a high grade of backwards compatibility had to be ensured.
When file split was introduced in the server had to manage the old whole files storage along with the new split ones. Each time a file was read and written, if it was not yet in the split format the server used a little extra time to manage the issue by converting it to the new format, making the filesystem more compatible updated as each request is processed.
Aside from the standard user storage we have a couple of other folders storing mostly temporary files, for example when a user wants to send an email link pointing to a file. Since this feature can lead to a rapid exaustion of disk quota (even if the temporary files are cleaned out regolary) there are some implementations to improve performance. For example when a file has already been shared via mail and a request is made share it again, the server detects that it has already been copied in the temporary folder (by checking the file hash) and only the database record pointing at it is duplicated.
All considered the balancing between database load and hard drive load helps us to keep LiveBox as fast and stable as possible, and we are currently evolving our server code to fix each script whose performance doesn’t meet our high standards and expectations.
Every innovation in the hardware and software we use is a jumping platform to improve our code, both server-side and client-side, to ensure the best performance we can give.

(my daily quote: “We may not be perfect, but we’ll get closer each day.” – me, now)


Performance, splitting and encrypting

Leave a Reply