MinIO Tiering Warning: Data Loss and Fault Tolerance Issues

Introduction

In the world of big data, storage solutions are the backbone of any architecture. As an expert solutions architect, I have relied on various storage solutions to ensure data integrity and availability. Recently, I discovered a critical flaw in MinIO's tiering feature, introduced in RELEASE.2022-11-10T18-20-21Z, that poses a significant risk to data integrity.
This feature is inspired by AWS S3 lifecycle transition

MinIO is very simple to use, which is why it is frequently chosen for testing purposes to simulate S3 in CI/CD pipelines. In these cases, data loss is not an issue. On the other hand, MinIO is also highly appreciated in on-premises environments for providing an S3 API that is very similar to AWS S3 and much simpler to set up than the alternative Ceph + Rados Gateway. In such cases, MinIO is often a critical part of the enterprise data lake and is operated with very high resilience and durability targets. Data loss is not tolerable.

Note: This is an ongoing investigation, and I have not received any insights from the MinIO team. However, I experienced the issue on a production cluster and was able to reproduce it in a test setup. Two different versions of MinIO were tested without success.

About tiering

Minio doc on ILM

Important Facts About the Tiering Feature:

Once tiering is configured, all requests must go to the unique hot tier.
Backend tier(s) cannot be used, even for read-replica purposes. Under the hood, MinIO uses UUIDs to name transitioned objects.
The hot tier holds the metadata of all objects (from all tiers). As such, there are no lookups to cold tiers on LIST requests. This is particularly powerful for on-premise setups, where one can have a small SSD cluster as the front tier and a large HDD cluster as the cold tier, achieving great performance.
The transition strategy is dictated by AWS algorithms, and transitions are made in daily batches starting at 00:00 UTC.
The transition delay cannot be shorter than one day.

More details about the philosophy around the tiering feature of Minio in github issue 18821

The Issue at Hand

MinIO's tiering feature was designed to optimize storage by transitioning objects between different storage classes. However, enabling this feature can lead to severe data loss. The root of the problem lies in the MinIO scanner's inability to repair metadata of transitioned objects. This flaw is akin to a lack of anti-entropy algorithm (hello Cassandra friends !), where data consistency cannot be repaired once compromised.

The Consequences

Without the ability to repair metadata, every outage or drive replacement risks losing quorum. Over time, this inevitably leads to data loss. For organizations relying on MinIO for critical data storage, this flaw could have catastrophic consequences.

Technical Details

To provide a comprehensive understanding, I will delve into the technical specifics of the issue as documented on GitHub (Issue #20559).

MinIO uses Erasure Coding to ensure fault tolerance. Without tiering, replacing faulty drives is a routine operation. Once a faulty drive is replaced, MinIO immediately detects that it is empty and initiates its healing process to rebuild the data.

Recently, I had to replace a faulty server in an on-premise cluster using tiering. To my surprise, many users complained about LIST consistency issues, which prompted this analysis.

The smallest test setup involves 5 virtual machines (VMs). The hot tier needs to effectively use erasure coding, requiring a minimum cluster size of 4 nodes. The cold tier only needs to exist to configure the transition, so I used a single VM, although it could have been a path to an AWS S3 bucket.

These two terminal captures show the version used and the files stored in the test bucket named "buc." As can be seen, all objects from the directory "adir" transitioned to the cold tier. We can still find the metadata of all these transitioned objects by searching for the "xl.meta" files directly in the MinIO drives.

There are 21 objects in this bucket "buc" separated in 2 directories (adir: 11 objects cold + bdir: 10 objects hot), but the the scanner sees :

11 grey objects : corresponding to the metadata file of transitioned objects. 1 metadata file per object.
14 green objects : for sure 10 of these objects are the recent objects in the the "STANDARD" hot tier. The other 4 objects found is a non-sense.

Sidenote: Minio uses data inlining of small objects: in order to prevent loosing too much IOPS on small files, if the data size is smaller than 256KiB, then Minio does not create a datafile and the data is inlined in the metadata file "xl.meta".

In the next step, the file named "testfile_241028_idx20" is directly removed from the MinIO drive of the node u20-1 to simulate a faulty drive. In a non-tiered setup, the healing would be immediate upon the next read of this object. In this example, I forced a heal and read this object, but as can be seen, the "xl.meta" was never healed on u20-1. At this stage, there is still no data loss.

In the next step, I performed the same action on the node u20-2 and observed a catastrophic failure: data was lost.

At this stage, I am still able to list the object despite using the MinIO setting "list_quorum=strict." However, I am no longer able to read the data.

After performing the same action on u20-3, the data is completely gone, and I am not even able to list the lost object. If you do not use a delta table or some other catalog, you might never notice that data was lost.

Once you reach this point, there is no solution to heal the cluster. You are forced to use a backup and rewrite the data to the hot tier to ensure consistent metadata across all nodes.

This situation is even more frustrating because the data remains intact on the cold tier! However, since MinIO uses internal names and directory structures on the cold tier, it is impossible to heal the data using the cold tier as can ben seen below.

Shell snippet

rm testfile_*idx*
for i in {11..20}; do dd if=/dev/urandom bs=1K count=$i of=testfile_$(date '+%y%m%d')_idx$i; done
mc cp testfile_*idx* local/buc/adir/
mc ls -r local/buc/
mc admin info local | tail -12
date
f='testfile_241028_idx20'
mc ls local/buc/adir/$f
find /data/1/buc -type f | grep $f
ll /data/1/buc/adir/$f/xl.meta
mc admin heal -r --scan=deep -a --force local/buc
for i in range {1..10}; do printf "$(mc ls -r local/buc/ | wc -l) "; done; echo ""
mc ls local/buc/adir/$f; date; sudo systemctl stop minio; date ; sudo rm -f $(find /data/1/buc -type f | grep $f); ll /data/1/buc/adir/$f/xl.meta; sudo systemctl start minio ; date ; ll /data/1/buc/adir/$f/xl.meta; mc ls local/buc/adir/$f;
for i in range {1..10}; do printf "$(mc ls -r local/buc/ | wc -l) "; done; echo ""
mc admin info local | tail -2
ll /data/1/buc/adir/$f/xl.meta
mc stat local/buc/adir/$f
mc admin heal -r --scan=deep -a --force local/buc
ll /data/1/buc/adir/$f/xl.meta
mc ls local/buc/adir/$f
mc cp local/buc/adir/$f /tmp/
mc stat local/buc/adir/$f

Speculations

In this section, I will allow myself to go beyond the facts presented earlier.

Firstly, bugs of this nature should not occur in a production grade storage solution. Their existence suggests that the developers did not adequately test the product.
Technically, the fact that the tiering feature uses scheduled batches rather than streaming complicates matters: you must wait (or manipulate the clock) until 00:00 UTC the next day for the transition to occur. Regardless, resilience and durability tests must be conducted!

Secondly, from an architectural perspective, I believe the tiering feature has a major flaw on the operationnal side. In MinIO, the anti-entropy mechanism is the "scanner" which runs continuously to perform a full scan of all objects on a regular basis. Additionally, there is a read-repair mechanism triggered when some shards are unavailable during read queries. This system works well when tiering is not used. However, when tiering is enabled, the hot tier only holds metadata of transitioned objects. Consequently, the scanner perceives these as "grey objects" which means "unrecoverable objects". This makes monitoring for data loss nearly impossible: you can no longer monitor your hot tier for data loss, as transitioned objects in normal state appear with the the same status as lost objects.

Thirdly, minio does not use semver and does not support rollback of versions. There is no point in trying to bissect versions in order to try to identify a non broken version.

Recommendations

While it's crucial to highlight this flaw, it's equally important to offer solutions or workarounds. Here are some recommendations for MinIO users:

Disable Tiering: Until a fix is released, consider disabling the tiering feature to prevent potential data loss.
Regular Backups: Ensure that regular backups are in place to mitigate the risk of data loss.
Monitor Updates: Keep an eye on MinIO's updates and patches for a resolution to these issues.

Conclusion

Raising awareness about this critical flaw is essential for the community. While MinIO has been a reliable storage solution, this issue underscores the importance of thorough testing and validation of new features. I urge MinIO to address this flaw promptly to try restore user confidence.

However, from my perspective Minio is not for production workloads where dataloss is not acceptable. Minio release process with no SemVer and non extensive tests is a problem. In addition, Minio does nothing to inform operator of potential data loss. In my case, this was made obvious only after loosing ~10% of our data and having a lot of client issues.

Edit November 4th 2024

A few days after the initial post of this article, Minio promptly proposed a bugfix to this issue : thanks to them !
However, I did not test it yet since it is merged but not released in any official version yet. The issue on github is now closed and unfortunately I will not be able to share the results of my own tests there.

Scaring point : Minio tests for healing were known to be broken since more than 6 months (opening of regression ticket 19797). Despite that, minio continuous release cycle was not interrupted and multiple versions were rolled out to production without appropriate tests.
This is a red flag for production usage.

Edit November 15th 2024

MinIO released version RELEASE.2024-11-07T00-52-20Z which fixes the healing of transitioned objects and also fixes their internal testing suite, as stated in the release note:

add tests for ILM transition and healing (#166) by @harshavardhana in #20601

The bugfix of the heal algorithm is not mentioned in the release note, but I verified this version myself and it is getting better but there still is room for improvements:

If less than parity shards on the hot tier were lost, the metadata can be healed
If more than parity shards on the hot tier were lost, the object is lost.
even if the data is still present in the cold tier, MinIO is not able to recover it from another tier.
For example, I was not able to heal the file 'testfile_241028_idx20' (the only file with size 200KiB) from the first example even though it is still present in the cold tier (right side of the terminal there is a file with size 200KiB).

BUG Warning ! the heal process was not even able to mark the missing object as GREY. The heal process marks all objects as GREEN. Minio does not tell when data is lost.
Warning ! I did not took time to validate that Minio fixed the list consistency issue reported on github but they decided to close the issue very quickly before I could even test it.

Edit December 9th 2024

The bug on inconsistent LIST is still there in RELEASE.2024-11-07T00-52-20Z

Edit January 9th 2024

No news from Minio and the bug on inconsistent LIST is still there in RELEASE.2024-12-18T13:15:44Z

Alternative object storage solutions with S3 API

Open source alternatives:

ceph radosGW -> prod ready and battle tested
apache ozone -> S3 gateway not mature (as of 1.4.0). I do not recommend except if your clients are using Hadoop API (Hadoop, Spark, etc.)
garagehq -> young project but AGPL license
SeaweedFS -> young project but my personal favorite
RustFS -> not production ready, but very promising and Apache licence

Julien Laurenceau @julienlau