Migrating from Nextcloud to Azure S3

Hello 👋

Back for a new MyUnisoft technical article, this time with the help of my colleague Nicolas MARTEAU. Today, we will share our journey to completely refactor our document management architecture and how we migrated from Nextcloud to Azure S3 as our storage technology.

We weren’t able to cover every detail—both for security reasons 🛡️ and to protect sensitive data 🔒—but I hope you will enjoy what I could share. 😊

👀 Why moving away from Nextcloud ?

Performance 🤖

Until now, we have managed several tens of millions of documents with Nextcloud. However, stability and performance had become an issue, with regular downtime 🕒 and delays ⏳ of several minutes for a simple document upload at times.

💬 These upload delays sometimes led to misunderstandings among users. For example, in certain integrations, it was not uncommon for users to delete their Accounting entries 🧾 after a few seconds because they thought the attachment was missing.

The complexity and limited functionality of the existing APIs quickly became a significant obstacle 🛑. Simply making a document available in a specific folder could require four or five separate HTTP requests. We needed a more robust storage solution that could scale effectively 📈 and provide consistent, fast response times. 🚀

Infrastructure 🏢

Furthermore, we needed to reduce Nextcloud's impact on our infrastructure. Unlike Azure, Nextcloud doesn't scale well and required too much maintenance from our DevOps team.

😬 Architectural issues

In the past, users accessed documents stored directly on Nextcloud, with some files displayed through the platform’s built-in viewers.

This initial choice was certainly made for simplicity, but it has evolved into a significant architectural challenge as we began exposing storage directly to customers. Changing a storage server without affecting our users has become complex, and it also complicates the management of certain security and observability concerns.

The primary issue is with PDF documents, such as ledgers, which contain hardcoded URLs pointing to specific storage servers. This requires us to maintain these URLs for years to ensure continued access.

As part of our migration to S3, we are addressing these issues by routing all requests through the same service (GED).

This approach enables us to resolve several issues and enhance the product’s functionality:

Requiring authentication for some sensible documents.
Providing full observability over who uploads or downloads specific documents.
Enabling updates to storage capabilities without impacting customers.
Integrating new storage technologies seamlessly and transparently—for instance, through potential future integrations with services like Microsoft OneDrive.

📢 The plan

The first step was to draft an action plan 📝 and thoroughly document the existing setup. After several weeks of work, we established the key steps:

Route all document downloads through the GED service.
Route all document uploads through the GED service.
Migrate all existing documents to our new Azure storage, ensuring zero impact 🚫 on the end user.
Managing the Nextcloud links found in the PDFs already exported by our clients before the migration. Since these links pointed directly to our Nextcloud servers💀, we had to find a reliable solution to route these calls through the GED.

Each stage comes with its own set of challenges, which we’ll examine in detail later in the article.

Our primary concern, however, was to correct previous architectural missteps 🔍.

1️⃣ Download

The first step was to re-abstract downloads and previews, routing them through our backend. This required us to manage both legacy documents 📜 still stored on Nextcloud and new documents that would be hosted on Azure storage.

One challenge we’re facing is that the token generated by Nextcloud lacks any information about the tenant associated with the document. Without this, our backend cannot identify the relevant database cluster and tenant.

To resolve this, we created a new opaque token that embeds the tenant ID:

import crypto from "node:crypto";

const token = `${tenantId}-${crypto.randomBytes(16).toString("hex")}`;
console.log(token); // => 1-f82158a508b8bfbed82b601e2ed60edd

🔮 Previews

Nextcloud offered automatic previews of uploaded files, a feature we relied on extensively, so we needed to re-implement an equivalent ourselves.

We decided not to generate previews at upload, as this would have added significant complexity and cost, along with the challenge of handling asynchronous generation.

For PDFs, we just return an optimized preview of the first page, and for images, we use the Sharp library.

function getImageTransformer(
  query: { x?: string; y?: string; },
  ext: string
): sharp.Sharp {
  const { x, y } = getDimensions(query);
  const transformer = Azure.isAlphaImage(ext)
    ? sharp().png()
    : sharp({ failOn: "none" }).jpeg({ mozjpeg: true, quality: 50 });

  return transformer.resize({
    fit: "inside",
    withoutEnlargement: true,
    height: y,
    width: x
  });
}

📦 Headers and encoding

When returning documents, it’s essential to set the correct HTTP headers and apply proper encoding to values like file names.

const contentType = mime.contentType(
  path.extname(request.body.document)
);
const { body, contentLength } = await getFileFromAzure(request);
// pipe body to reply/response

reply.header(
  "Content-Disposition",
  `attachment; filename="${encodeURIComponent(filename)}"`
);
reply.header("Content-Type", contentType);
reply.header("Content-Length", contentLength);

I frequently see editors forget to re-inject the file name.

Monitoring

Being able to monitor developments and misuse is critical to guaranteeing the stability of our infrastructures.

Built-in file viewer

Since Nextcloud could display multiple documents within a viewer, we chose to re-implement a minimal yet functional viewer to retain this capability.

While our front-ends offer more advanced display modules, this lightweight viewer remains useful in several scenarios:

Replacing or rewriting legacy URLs in PDFs.
External links shared via APIs.
Providing quick access for debugging purposes.

2️⃣ Upload

Successfully prototyping an upload wasn’t as complex as expected… but, as always, the devil is in the details.

For inter-service uploads between Node applications, another Fastify plugin was added to our workspace package, providing methods to interact with the GED API 🔀.

📉 Optimize PDFs and images

Many of the PDFs and images submitted by our users are quite large and can be optimized. For this, we use Ghostscript 👻 to optimize PDFs and the Sharp package for images.

To date, we’ve reduced the size of received PDFs and images by an average of 50%, with no loss in quality ✨.

Compression is performed asynchronously using setImmediate to ensure fast server response times.
A compression value of "null" indicates that the compression ratio is below 5% 🤷‍♀️, making the update to the file on Azure negligible.
Otherwise, the file is updated in the cloud ✔.

Most of these optimizations are carried out via streams, so that the file or image is never completely buffered.

However, we needed to remain vigilant about rising CPU consumption and enhance our infrastructure setup 🏗️ to handle increased workloads effectively.

🖼️ HEIC/HEIF

Apple's proprietary HEIC format 📱 presented a significant challenge, often requiring conversion to JPG or PNG for compatibility.

Given that Python bindings to libheif showed much better performance, we initially opted to create our own N-API Node.js binding for libheif, using low-level libraries for rapid JPG and PNG conversion.

HEIF-converter

For maintenance reasons, we chose to use Sharp by building libvips directly on our machines and installing the necessary tools (libheif, mozjpeg, libpng, etc.).

🔒 Security

When managing file uploads and storage, vigilance is essential 🕵️‍♂️ in several areas:

Monitor for spoofed HTTP headers 🛡️, such as altered content-type headers.
Scan files for viruses and malicious content 🦠.

Otherwise, an attacker could misuse your brand and storage capabilities to distribute malicious content and compromise users 🚨.

Make it a habit to consult the OWASP cheat sheets to ensure maximum protection against errors and oversights: OWASP File Upload Cheat Sheet.

We used clamscan (which relies on ClamAV) to scan the files 👁️, and file-type to accurately identify the file type instead of relying solely on the request headers 🧨.

📊 Monitoring

As we regain control, it’s essential not to overlook usage monitoring through logs and other metrics.

3️⃣ Migrating Nextcloud documents

To gradually phase out our Nextcloud servers, we developped a temporary Node.js API 🦾, responsible for transferring resources from Nextcloud to Azure. This service handled upload concurrency, which we've limited to 64 simultaneous uploads to avoid overloading the server 🔥.

Without detailing every feature of this internal tool, it was designed to support key functionalities such as pausing and resuming the migration process, as well as monitoring the status of each transfers (successes ✅, errors ❌, totals, etc.).

Step 1: Data Extraction

We extracted from the Nextcloud database all the tokens 📄 (used to retrieve document data from the database) and the file paths on the server (to transfer the resources), saving them into .csv or .txt files.

Step 2: Environment Setup

We then set up a NAS server to run the Node.js tool and directly access the file system 🦄, bypassing the Nextcloud API. This approach was chosen to maximize performance and enable efficient stream-based, parallel processing of the document transfers.

Step 3: Create the DB and go 🧨

All that remained was to create the SQLite databases (we chose to generate one database per firm to avoid excessively large files), using the Nextcloud exports that contained tens of millions of rows, and then start the transfers ✅.

Let’s just say we ran into a few surprises along the way 🤫, and the migration ended up taking us several days 🤭.

4️⃣ Legacy URLs in PDF

Some URLs are permanently embedded in PDFs 📄, so we need to consider strategies for rewriting them using the information available.

Since Nextcloud tokens didn’t contain any tenant information, we created a minimal API (microservice) supported by an SQLite database to maintain the relationship between a token and its corresponding tenant ID.

PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA locking_mode = EXCLUSIVE;

CREATE TABLE IF NOT EXISTS "tokens" (
  "token" TEXT PRIMARY KEY NOT NULL,
  "schema" INTEGER NOT NULL
) WITHOUT ROWID;

We can manage thousands of tokens within just a few milliseconds ⏱️ using purely synchronous I/O. Additionally, we implemented an LRU cache to ensure that repetitive requests are handled even more quickly.

The final step is to configure HAProxy 🔀 to redirect nextcloud viewer requests to a specific GED endpoint, where the URL is parsed to retrieve tokens and correlate them with their respective tenants, using the project setup described above.

const { link } = request.query;

if (!URL.canParse(link)) {
  // Problem
}
// Other URL validation here

const tokens = link.match(
  /(?<=\/)([1-9]{1,4}-\w{15,32}|\w{15})(?=\W|$)/g
);
// Correlate tokens with our microservice database

reply.redirect(`/ged/document/view?tokens=${correlatedTokens.join("|")}`);

This is only a partial overview of the implementation. We use a combination of the WHATWG URL API and regular expressions to extract tokens, ensuring sufficient security to mitigate any ReDoS attack vectors.

We then redirect the request with all tokens to our built-in viewer.

🔬 What we've learned

This project taught us that errors in URLs saved within PDF documents are hard to forgive. Due to some technical debt and a lack of foresight, we ended up with an unintended /ged/ged prefix. Today we're having a bit of a laugh about it, and if you see this prefix you'll know it wasn't meant to be 😆.

Managing files with proper streaming while handling errors proved far more challenging than anticipated, plaguing us for weeks with ghost files, memory leaks, and other unexpected bugs. At this level of usage, it’s technical excellence or nothing.

❤️ Credits

A migration project of this scale doesn’t happen overnight—it took us well over a year to complete all the steps outlined above. A big thank you 🙏 to everyone involved for their dedication and effort ❤️.

Nicolas 👨‍💻, for leading the project development from A to Z.
The infrastructure team 🏗️ (Vincent, Jean-Charles, and Cyril) for their consistent support throughout the project.
Aymeric, for managing and leading the migration of downloads and uploads for his team services.
Many others 👥 for their reviews and support 📝.

This project was incredibly rewarding 🏆, both for its challenges and the range of architectural issues it addressed 📐.

Thank you, see you soon for another technical adventure 😉😊

👋👋👋

Thomas.G @fraxken