sorry my data is too dirty for your model
Datasets for pre-training large models have been expanded to the volume of (partial) internet, with the idea of “scale averages out noise”, these datasets scrape whatever is available on the internet then “cleaned” with a human-not-in-the-loop, cheaper-than-cheap-labor method: heuristic filtering... Heuristics in this context are basically a set of rules came up by some software engineers with their imagination and estimation that are “good enough” to remove “dirty data” of their perspective, not guaranteed to be optimal, perfect, or rational...
If we know, partially, what is considered as dirty, doesn’t it mean that we can make our data “dirty” and get it filtered out by following their rigid estimation� Can we opt out from being trained by becoming unqualified� sorri my data is too dirty for your model came up with a set of anti-heuristic heuristics based on 23 datasets to have our texts and images mingle and stay close to “dirty data”, purity is never an option 😈