Data centers contain 90% crap data

170 points by billybuckwheat 14 hours ago

danpalmer 13 hours ago

> One organization I knew of had 1,500 terabytes of data, with less than 2% ever having been accessed after it was first stored.

On a related note, probably a similar percentage of people claim on their car insurance. If only the rest realised they had "crap insurance" and were paying for nothing, they could save so much money!

This is obviously sarcasm, but I think it's important to remember that much of the data is stored because we don't know what we will need later. Photos of kids? Maybe that one will be The One that we end up framing? Miscellaneous business records? Maybe those will be the ones we have to dig out for a tax audit? Web pages on government sites? Maybe there will suddenly be an interest in obscure pages on public health policy if a global pandemic happens.

Complaining that data is mostly junk is not a particularly interesting conclusion without acknowledging this. Is there wastage? Yeah sure, but accuracy on what needs storing is directly traded off with time spent figuring that out, and often it's cheaper to store the data.

ivraatiems 12 hours ago

This is exactly the problem. Consider an order processing system that processes a million orders a day, and retains them for ninety days. What percentage of those 89-day-old orders are actually needed at 89 days old? It could be quite low, maybe a couple thousand out of a million.
But if those orders aren't there, shit hits the fan. PCI compliance audits fail. The ability for customers to reconcile their charges with their purchases breaks. In that 0.01% of cases where the order was fraudulent, placed by mistake, or just didn't have what the customer thought it had in it, not having that data makes the order processor read as, if not malicious, at least incompetent.
The real question is, how much data do we need to store inefficiently, in a way that uses a lot of power and space?
- makeitdouble 10 hours ago
  
  > The real question is, how much data do we need to store inefficiently, in a way that uses a lot of power and space?
  This is indeed the critical question, and it's far from being trivial.
  One issue we all hit is moving the data from the higher tier storage to the cheaper and more efficient one, which requires sync and paying for the transfer most of the time, but also handling two separate access and authorization process, backup and recovery system for data that absolutely needs to be accessible for the few years of legal retention, and can/must completely disappear afterwards.
  In most orgs I've seen the cost of going through all that complexity is just not worth it, compared to "just" paying for the higher tier storage for the few years long lifetime of the data.
- numpad0 3 hours ago
  probably a similar percentage of people claim on their car insurance. In that 0.01% of cases where the order was fraudulent, placed by mistake, how much data do we need to store inefficiently, in a way that uses a lot of power and space?
  I'm feeling, the real real question, as Sci-Fi as it gets, is, is the winning ticket data even data, OR is it more like a thumbnail of "the whole data" that is 98%+ worthless, than standalone piece of data?
  The winning ticket ID, e.g. a "0x-636e-7461-4e7b", only makes sense in the context as one among the entire cohort of contestants; I can make one up like I did, but I can't walkout with the payout unless the rest of the lottery didn't exist.
  Statistically, philosophically, technically, and all sorts of *-cally speaking, is the 2% data, the winning ticket datum, even data?
- AStonesThrow 10 hours ago
  
  I was just pondering this today, in terms of how much data and objects I create in a Google or Microsoft account, for example, and then they create a burden of cost and maintenance for me down the road. Especially deleting old emails and photos. That's arduous and sometimes poignant as I flush away my personal life story.
  Cloud services make difficult and sometimes even byzantine processes for deleting stuff, and it's often impossible to operate en masse in order to clean up swaths of stuff quickly and efficiently. It's in their interest to retain everything at all costs, because more used storage can mean more profits. Cloud services also profit from unused storage, because if they're charging $20/year to 100,000 users who use 2% of their storage space, ka-ching!
  It irks me to this day that standard or even advanced filesystems don't include "expiration dates" or "purge dates" on file objects. Wouldn't it be logical, if an organization has a "data-rentention policy" that mandates destruction after X date, that the filesystem simply purges it automatically? Why does this always get delegated to handmade userland cron jobs? Moreover, to my knowledge, nobody is really interested in devising a way to comb through backup media in order to [selectively] destroy data that shouldn't exist anymore. Not even the read-write media!
  Google is now auto-deleting stuff like OTP SMS messages. I'd love it if auto-delete could be configurable account-wide, for more than just our web histories and Maps Timeline and stuff. Unfortunately, to "delete" cloud data means it still exists on backups anyway. But without deleting data in your cloud account, it becomes a juicier hacker target as it ages and accumulates personal stuff that shouldn't fall into the wrong hands. Likewise for any business, it behooves them to delete and destroy data that shouldn't be stolen. At least move it offline so that only physical access can restore it?
  I will say that modern encryption techniques can make it easy to "destroy" data, simply by destroying the encryption keys. You can quickly render entire SSDs unreadable in the firmware itself with such a command. Bonus: sometimes it's even done on purpose!
  But even deleting data presents a maintenance cost. So if 90% of an org's data is indeed crap, then 90% or more of your processing resources are going to be wasted on sifting through it at some later date. Imagine when your file formats and storage devices are obsolete, and some grunt needs to retrieve some record that's 30 years old, and 90% of your data was always crap. That grunt is hopefully paid by the hour. We really had this happen at a few of my jobs, where we had old reel-to-reel backup tapes and it was difficult enough to load the data into a modern SunOS machine.
  https://m.xkcd.com/1683/
  - jl6 5 hours ago
    
    Unfortunately a lot of data retention policies aren’t so mechanical, but are of the form “delete after 7 years unless there is a legal hold in force”, which is usually just rare enough of an edge case that orgs evaluate it manually and hence only do periodic manual purges. But probably the main reason auto-delete isn’t popular is because a process that can delete your old data is one bug/misconfig away from deleting your new data too.
  - faust201 8 hours ago
    
    > Especially deleting old emails and photos.
    Google or apple can put a big button delete everything in their phones/accounts but then some prankster will do it to a family member and this gives bad PR. Let's be pragmatic.
    > It's in their interest to retain everything at all costs, because more used storage can mean more profits.
    As an user, I get the reverse. When I had local NAS then I dumped anything and everything - assuming.. this costs not much. Will clean up later. Once I moved to cloud that changed to If I put crap then it will cost me money! Keep it clean
    Once upon a time Google had given generous unlimited to all education and workspace. They stopped it in the last 2 years and you can see most educational and companies are running a tight ship.
    Due to backup costs in our organisation we are forcing people to use max of 100GB for emails.
    > advanced filesystems don't include "expiration dates" or "purge dates" on file objects. Wouldn't it be logical, if an organization
    Totally agree.
    Outlook email service has some kind of keep only the last newsletter from this service etc.
    > auto-delete could be configurable account-wide, for more than just our web histories
    - features are designed with majority in mind - most users have some sort of nostalgia for reading or keeping old emails SMS etc
    
    ghaff 2 hours ago
    
    Assuming your somewhat methodical about labeling email in your non-primary tabs in Gmail, you can pretty much do periodic purges while not worrying much about your primary which, in my case at least, gets relative little traffic.
  - zanecodes 8 hours ago
    
    I would love a feature like retention policies at the filesystem level. If it could somehow take data provenance into account, that would be even better, i.e. data that I've created directly is irreplaceable and should never be deleted unless I say so and should always be backed up (photographs, writing, project source code, art, game save files, my list of installed applications, etc.); data that I've created indirectly is high priority but may be deleted under certain circumstances or may be omitted from backups (browsing history, shell history, recently opened file lists, frequently used app lists); data that can be easily replaced is low priority and may be cleaned up at any time and need not be backed up at all, contingent on how replaceable it is (application files, cached data).
    
    palata 6 hours ago
    
    Not at the filesystem level, but Android does something that feels like that: an app can write files in its own space, and when you delete the app, it deletes its files.
    Except when the app writes files outside of its own space, which is meant for stuff that should stay (like pictures).
    Of course, some apps store the pictures in their private space, and you lose the pictures when you remove the app. And some apps write crap in the shared space. But that seems like a fundamental limitation (even if it was done at the filesystem): how do you make sure that the apps/software you use is doing it right?
  - nn3 10 hours ago
    
    I suspect for most cloud providers you deleting data is cheaper because the data is not charged by the byte. But then they like having data, maybe just to train their AI models or for bragging rights to their investors.
    For the expiration dates most modern file systems have the concept of arbitrary extended attributes per file. It's quite easy to add meta data like this yourself.
    
    AStonesThrow 8 hours ago
    
    Unused services are always pure profit, and storage space is no exception. Providers can offer 100GB for like $24/year, because only like 2% of the subscribers will ever approach the limit, and so the extra space is never wasted but can be allocated to someone or something else.
    It's like gym memberships, ISP/telco service bundles, and amenities at your apartment complex. Anyone not using every possible service is wasting money, but it's impossible to purchase a bespoke service, so essentially everyone will waste money because they're chipping in money for services that someone else uses more than they do.
    Here at home I don't ever use the gym, the racquetball courts, the doggy-doo supplies, the laundry room, or a parking space, and yet my rent (everyone's rent) includes upkeep for all of those things. I'm subsidizing all my neighbors and all the wear-and-tear damage they put on those common amenities. Likewise, everyone who's paying $24/year for storage, or any business that purchases big multi-terabyte storage media, they're paying for unused storage space and giving profit. It's practically impossible to rightsize your storage media, and you never want it undersized, and you can't simply shrink them and reclaim the resources you invested, you just keep adding on new ones and replacing the malfunctioning ones. So nearly everyone always owns or rents more space than they can realistically utilize.
    Furthermore, you'll notice that I specified "automatic" destruction of data by expiration date. Of course it is trivial to tag any arbitrary file with arbitrary metadata, but the challenge is to create a filesystem that executes automatic data purges on schedule, rather than pushing it into a rickety old handmade cronjob in userland. I've never ever seen a filesystem with such a feature, nor does it seem that anyone's interested in doing so.
    And here I thought that computers were useful for automating business logic and easing the burden on human effort. And this is me, manually sifting through emails and photos in order to manually delete each one with 3 dialog boxes intervening. It takes hours, days, weeks.
InsideOutSanta 5 hours ago

The article touches on this issue by pointing out that:
>The Cloud is what happens when the cost of storing data is less than the cost of figuring out what to do with the crap
But I think that's wrong. The actual issue is that you often can't figure out "what to do with the crap" because the difference between useful data and crap data is determined at the point in time when you need it, not when you store it.
I'm relatively careful with deleting data, but even so, there were countless instances where I thought something was no longer needed and deleted it, only to figure out that, a month later, I needed it after all.
Also, the article has a few anecdotes like this:
>Scottish Enterprise had 753 pages on its website, with 47 pages getting 80% of visits
But that is completely orthogonal to the question of what is "crap." The fact that some data is rarely needed does not mean it's crap. If anything, the fact that it is rarely needed might make it more valuable because it might make it less likely that the same data can be found elsewhere.
- ghaff 5 hours ago
  
  I do cull photos and docs but it takes effort and you will make mistakes. On the other hand, intelligent culling makes it easier to find things.
  It’s a trade off especially with digital where there’s not that much cost associated with holding onto things relative to the physical world. My whole house is basically in storage because of a kitchen fire and I’m planning to be pretty selective about what I move back in.
  - InsideOutSanta 4 hours ago
    
    >intelligent culling makes it easier to find things
    That's true, but things get easier to find over time. I have thousands of digital photographs from the 1990s that I burned on CDs and then copied to a NAS later. Today, for the first time, they're actually searchable because they're now indexed in Immich. So they're sorted by the person in them and searchable by their contents.
    If I had culled the photos back then, I would have lost many photos that have become searchable (and thus valuable) only recently.
    
    ghaff 3 hours ago
    
    And geoencoding on especially phone photos makes manual metadata entry less important. I still there’s value to culling redundant and bad photos.
mrexroad 12 hours ago

I have ~2.5TB of photos in iCloud, via Apple Photos. Excluding the various sized previews, I doubt many originals have been accessed in quite a while. I also have about 1TB Lightroom library archived to a different service; representing countless hours of photo processing work spanning over a decade. Haven’t touched that one in years. Neither are crap. (Yes, both have other backups and, yes, I’ve probably forgotten to sync one of them).
- eloisius 10 hours ago
  
  Sometimes I wish there was a feasible system to audit and reduce redundant photos in my iCloud. I have an embarassing ton of pictures that I never look at just like everyone else. I wouldn’t mind a tool that said “here’s 10^x photos that are the most similar to all your other photos would you like to put them in the trash?”
  - mrexroad 5 hours ago
    
    If you’re using Apple Photos, the feature is there. It will detect both exact copies as well as (nearly) identical visual duplicates. Look under utilities.
    
    eloisius 2 hours ago
    
    Maybe it's different on the desktop, but on the iPhone it only detects the same photo that may have been saved at different resolutions or things like that. That's useful, but I really want something that can say use an embedding or something to group all 10 photos I shot of friends around the campfire and help me select one keeper and delete the other 9.
  - makeitdouble 10 hours ago
    
    Thing is, the tool needs you to pick the one instance that survives the purge. Which means looking through 10^x of mostly similar photos to pick one.
    Removing duplicates isn't that complex already, and these tools exist so anyone can try it. It's just a truely grueling process.
    
    secabeen 9 hours ago
    
    Yeah, there have been a number of companies that say they'll offer AI Photo Culling, but I have yet to find one that actually does it.
- prawn 7 hours ago
  
  I have about the same in iCloud and then dozens of drives of video footage. Beyond the "keep it in case you need it" aspect, another angle is that it's often cheaper to just keep everything than spend all the time it would take to do a thorough cull. Video is much slower to review, but also chews up far more space, so maybe that balances out.
- snapplebobapple 12 hours ago
  
  Crap is in the eye of the beholder...
mlinhares 9 hours ago

If I knew the data i was going to need in the future i'd be in the future predicting business.
But there's an important piece there that is about data that should not have been stored in the first place. All the big data bullshit made people believe every single piece of data you could possibly store from a user is data that you should store and this is both a huge waste of resources and a huge liability, because now a data leak with useless to you PII could completely bankrupt your business.
userbinator 7 hours ago

"Better to have and not need, than need and not have."
Storage is cheap, very cheap.
2muchcoffeeman 8 hours ago

Photos of kids? Maybe that one will be The One that we end up framing?
Look through the pictures with your kids (and with your kids) and pick out the best ones. Delete the rest.
- stephen_g 7 hours ago
  
  I go back and look at old photos all the time (like when I think of a trip, and look back to it, and then just keep scrolling for a while). The pictures bring back more memories and that's to me a great joy - and on the other hand I find no value in deleting the photos that are not "the best" (I'm talking about good enough photos, I delete blurry, useless or duplicate ones immediately) - what would it gain? A few dollars less storage cost over the next decade? The satisfaction of digital cleaning?
- philips 8 hours ago
  
  The photo apps really don't have a good "sorting through the shoe box" UX. It would be great if I could in Google Photos, for example:
  1. Bulk favorite photos in Google Photos for long term archiving
  2. Set a retention policy of months/years for photos based on metadata.
  3. Have a UX to quickly sort the week or months photos via a swipe left/right UX.
- Double_a_92 3 hours ago
  
  The problem with that is that what you might find important or beautiful changes with time.
  That random photo of your messy living room might be trash right now, but beloved in 20 years when you want to remember your old home.
- foobahify 5 hours ago
  
  [flagged]
Pooge 5 hours ago

Are you going to be able to find the data you're looking for if judges demand it?
Data that ends up in that sea of crap is very often poorly labeled.
Data that you cannot find again is useless.
If you take more than 3 minutes to find a picture you wanted to show me, it doesn't deserve to be showed anymore.
palata 5 hours ago

I think you miss the point of the article, which precisely says that we store a ton of data we don't need to store in the first place.
Photos of kids are obviously not part of that "crap" (even if we have too many of them and it would be worth triaging for our own sake).
The question is: what makes for most of that data? Is that all business records, or is that storing stuff because we can? I've worked in multiple startups, all were tracking users as much as they could. Adding tools that collect data is easy, and storing that data is cheap. "We may need it later". Never needed any of that crap, and it was invading our users' privacy.
In any case, I think the article raises a good point: it's so cheap to store crap that we don't even think about it. And it's bad for the environment. Just like it's so cheap to take a plane that we don't think about it, even when taking the train takes almost the same time.
jongjong 10 hours ago

True, when it comes to data, there is a strong case to be a compulsive hoarder. It's cheap to store; often free and data is often marginally more useful and interesting than most generic mass-produced old junk which sits in garage gathering dust.
- InsideOutSanta 5 hours ago
  
  Technology for sorting through unordered piles of data also gets better over time. When I started storing digital photographs, face detection wasn't a thing, but it is now, so those tens of thousands of photos from the 90s suddenly became more useful to me.

manyturtles 13 hours ago

About a decade and a half ago I worked on a large data migration project at a FAANG. Multi-exabyte scale, many clusters across many countries. Once everyone was moved the old storage platform wasn't completely empty, because the number of migrations was large and users were (naturally) more focused on ensuring their data was in place and available on the target platform rather than ensuring every last thing was deleted on the legacy platform. We weren't initially concerned about it because it would all get deleted when we turned down the old setup.

As we were gearing up to declare victory and start turning down the several dozen legacy storage clusters someone mused that given some users were subject to litigation holds -- not allowed to delete any data -- that at least some of the leftover data on the old system might be subject to litigation hold, and we'd need to figure that out before we could delete it or incur legal risk. IIRC the leftover 'junk' data amounted to a few dozen petabytes spread across multiple clusters around the world, in different jurisdictions. We spent several months talking with the lawyers figuring that out. It was an interesting dance, because on the one hand we were quite confident that there was unlikely to be anything in the leftovers which was both meaningful and not migrated to the new platform, while on the other hand explaining that it wasn't practical to just "go and look" through a few dozen PB of data. I recall we ended up somewhere in between, coming up with ways to distinguish categories of data like caches and working data from various pipelines. It added over six months to the project, but was quite an interesting problem to work through that hadn't occurred to any of us earlier on, as we were thinking entirely in technical terms about infrastructure migration.

isaacremuant 5 hours ago

That does sound very interesting. Any insights on what would you do differently if you had to do it again? Any way to accelerate things now that you know the pain or do you think it's quite unavoidable and "legal Time"?

mrb 5 hours ago

Some fun math: according to some estimates there is 175 zettabytes of data worldwide. Assuming 20-terabyte harddrives, this could be stored on 8.8 billion drives. Assuming 10 drives per rack unit, 42 RU per rack cabinet, 16 square feet per cabinet (including aisle space), that means you need about 330 million square feet of data center space to host this data. If it was hosted in a single square data center, it would be 3.5 miles long and wide. (I always like to picture the physical space something would occupy.) And energy wise, assuming 5 watts per drive, it would consume 44 gigawatt, so it could be powered by about two large hydro dams similar to the Three Gorges Dam (22 gigawatts capacity). I am assuming a PUE close to 1.0 for simplicity. Of course one would not be able to spin up all these drives at once, since a drive spinning up consumes about three times more power (15 watts). So you would definitely staggered spin up when booting the servers :-)

If 90% of this data is "crap" and could be cut down, it would still be just a drop in the bucket compared to worldwide energy use.

andyp-kw 2 hours ago

I wonder how it compares to electric car usage, or street lights in the USA.

kdamica 5 hours ago

To repurpose an old saying: "90% of my data is crap. I just don't know which 90%"

bnewbold 10 hours ago

I agree with the general sentiment here, but don't like the examples. 200 photos per person per year isn't very much! That is all fine.

What really bloats things out is surveillance (video and online behavioral) and logging/tracking/tracing data. Some of this ends up cold, but a lot of it is also warm, for analytics. It bloats CPU/RAM/network, which is pretty resource intensive.

The cost is justified because the margins of big tech companies are so wildly large. I'd argue those profits are mostly because of network effects and rentier behavior, not the actual value in the data being stored. If there was more competition pressure, these systems could be orders of magnitude more efficient without any significant different in value/quality/outcome, or really even productivity.

boznz 14 hours ago

Don't forget emails.. I have everything I ever sent or received, and I have it backed up. I expect 90% of my inbox is the jpg signature logo they attach to the bottom of my clients email rather than hyperlink.

duxup 14 hours ago

I ended up working on some software and I was deemed the email guy (it's a very small % of my job but it is the biggest pain).
"I need an email when this happens.. and when this happens."
The requests are endless and I'm convinced there are people who if they could would do their entire job from their inbox and get everything and anything an application can do via email.
The insidious problem is that it never solves anything. "I didn't get the email!" is a constant refrain (yes they did, they always did). "Oh someone didn't do the thing so can you add this to the email too." and so on.
It is such an abused tool.
- mrweasel an hour ago
  
  We had a similar request for a sales team once: "If this fails we want an email".... Okay, but we never seen this fail, so you'll never get the email. As expected, they never got an email, because no failures. So instead they wanted an email every time the job ran (once every night), so they'd know that the job had failed, if they DIDN'T get the email. Only time that happened was because the email got trapped in a spamfilter I think.
  There must be thousands of copies of that email sitting in inboxes say: Job X ran successfully @ 04:30.
- Cthulhu_ 6 hours ago
  
  I can see how that would work (email centric workflow), not unlike how some people now try to have a chat / Slack centric workflow.
  - saalweachter 2 hours ago
    
    Yeah, everyone knows the proper thing to do is orchestrate it all through Emacs.
- Dylan16807 13 hours ago
  
  > The requests are endless and I'm convinced there are people who if they could would do their entire job from their inbox and get everything and anything an application can do via email.
  That sounds like a reasonable goal for a whole lot of job duties. And yes some entire office jobs. (Excluding some direct human communication but a lot of jobs already have too much of that in situations that could have been an email.)
  > "I didn't get the email!" is a constant refrain (yes they did, they always did).
  Well having to manually check wouldn't improve that, would it?
HPsquared 13 hours ago

It's probably deduplicated on the server though, so the millions and millions of messages with that logo likely share the same piece of disk space. Probably one reason why free providers don't tend to offer End-to-End encryption. It prevents deduplication (and otherwise compressing redundant information).
- palata 6 hours ago
  
  > Probably one reason why free providers don't tend to offer End-to-End encryption.
  I would think that email providers tend to not offer E2EE just because it fundamentally isn't practical with email. Providers like Proton try to do it, but it works only if you talk to someone else on Proton (at which point you may as well do it on Signal, which has better encryption).
- justsomehnguy 12 hours ago
  
  Nope.
  Back in the day Exchange offered SIS but in 2010 they ditched it. It's plainly not effective any more. Even regarding the OP' "the jpg signature logo" - it's a part of multipart in the message, not a separate file.
  And one more thing - you can't just turn the dedup and be dandy, now you need to check against the hashes to determine if this chunk is unique or you already have it. And with TBs of data you need TBs of hashes to check. Until you have like 99% dedup efficiency, ie 99% of your incoming data is literally the same data you already have - it doesn't worth it.
  https://techcommunity.microsoft.com/blog/exchange/dude-where...
  - Sprite_tm 11 hours ago
    
    Not sure about Microsoft solutions, but modern file systems like zfs and btrfs can do this on the filesystem level, no support from the mail server needed.
    
    tored 2 hours ago
    
    NTFS and ReFS has both deduplication support (only on Windows Server).
  - LegionMammal978 11 hours ago
    
    Also, I can imagine such schemes getting ditched for the same reasons of "now people can detect whether we already have a copy of the file, which is a privacy hole!" that killed cross-origin caching in browsers.
    Or if everything in the year 20XX gets pushed into using E2E encryption, since that's pretty much antithetical to deduplication.
jeffbee 14 hours ago

Well "the cloud" will generally store exactly one logical copy of your static jpg, one of the reasons why clouds are pretty good for efficiency.
There is really no sense at all in the article's claims that "we are destroying the environment" to do x y and z thing the author whines about. We are destroying the environment to drive a Dodge Ram to the Circle K to buy a 52oz Polar Pop. Information sector doesn't even show up on top ten lists of things that are destroying the environment.
hobs 14 hours ago

That's not even crap data - that's archived data that might be useful someday (though de-dupe is probably a great idea and email sigs are definitely wasteful trash) - most of the crap data is things that would never have been useful under any circumstances.
I have cleaned up dozens of product databases in cost management efforts and have found anywhere from 50-99% of data stored in product databases is crap, because they are not well managed and any single mistake can lead to a huge outsized impact on storage.
What to log all those http requests for just a day? Might as well turn that on for all time...

otterley 5 hours ago

90% of libraries consist of books that are never opened. These books were all produced by destroying and processing trees, sometimes with toxic chemicals, and their information density is orders of magnitude lower than that of a hard disk or SSD. Same with photo processing, where 90% of photos taken are discarded, and the toxicity of the chemicals is even higher.

So the question isn't simply whether storage is wasted; it's how much waste there is relative to the environmental impact. Granted, books and photographs don't need to be continuously fed energy to make the information available. However, the cost of storage is now so cheap that even with 90% waste, it's economically viable to keep it online. So the problem, if you can call it one, is that energy is too cheap, and externalities are not accounted for in the cost.

mrweasel an hour ago

Now we're also getting into the topic of "what is waste". Because the majority of the books that are opened are rehashes of the same murder mystery over and over and over.
I'd guess that 75% of all new books sold here are variations on "Someone is murdered in a brutal fashion. A old drunken cop from somewhere in Scandinavia is assigned the case. He's helped by a young woman, who may be his daughter or who he'll a father daughter relationship to. They solve the case, maybe, the end. You just tweak the details a little, but it's the same bloody story over and over.
That seems like such a waste of paper in my mind.
mvc 3 hours ago

> 90% of libraries consist of books that are never opened.
Citation required. But don't bother because it's a meaningless statistic, or at least one designed to make it look like there's a lot more wastage in libraries than there actually is.
The statistic could be true, and yet still be the case that the vast majority of library books are well utilized.
- otterley an hour ago
  
  I should have been less specific than “library” as such. Libraries aren’t only institutional. Consider every book, magazine, and newspaper that sits today in everyone’s home, office, and in every institution in the world. The vast majority of them have been read once and then left on a shelf or in a box somewhere, taking up more or less valuable space.
plasticeagle 3 hours ago

> 90% of libraries consist of books that are never opened
I'm reasonably certain that this statistic is completely made up. The best number I can find for the proportion of library books that are never borrowed was from a university library, and was 25%.
foobahify 5 hours ago

Yes. Rather than compare to paper, compare to yesterday's HDDs. Do we need more HDDs today than then?

somat 13 hours ago

Isn't this just a specific case of sturgeons law?

https://en.wikipedia.org/wiki/Sturgeon%27s_law

BrenBarn 12 hours ago

Yeah it seems like Sturgeon's law combined with a bit of Pareto principle.
foobahify 5 hours ago

[flagged]

tbrownaw 12 hours ago

Storage being cheap enough that it's not worth policing doesn't seem very consistent with it being expensive enough to include much energy use (what I assume the "destroying the environment" hyperbole is referencing).

Jean-Papoulos 4 hours ago

The other day I had to go through 15 years old PowerPoint files to grab the originals of pictures made by a guy extremely proefficient at creating detailed artwork from PowerPoint forms that's now retired. Was can now render them to full-hd PNGs instead of the 256x256 BMP files we were using before.

Storing "useless" data makes financial sense.

sali0 10 hours ago

When traveling, a funny thought I always have is watching other tourists take the same photos, from the same exact location, knowing it will be backed up to iCloud. I can't help but imagine how much disk space is taken up by duplicates of the exact same photo.

pants2 9 hours ago

That and videos of concerts / fireworks shows.
For a fun example, every time I'm on the Las Vegas strip I see dozens of people taking videos of the Bellagio Water Show.
There are 30 shows per night, if 50 people take videos in 4K 60fps (default on new iPhones), that's around 60 GB of data per show or ~600 TB per year of just videos of the Bellagio Fountain Show!

srmarm 4 hours ago

Some fair points, but some 'crap data' has storage requirements orders of magnitudes from each others.

One short video can equal a year worth of emails for someone. Similarly those many webpages that don't get viewed often probably require only a negligible amount of resources to keep online and might help someone who'd otherwise be faced with linkrot.

Best to focus on the low hanging fruit.

nameless912 13 hours ago

There's another dimension to this, that storage is so cheap that being wasteful with it isn't really disincentivized. I know for example at work of a portal that accepts uploads of large files from external clients that stores both the initial upload and every subsequent transformation of the file (of which there are 4-6) permanently. It's extremely useful for debugging, as one of the bits of metadata we shove on the zip archive is the git hash of the code that was running, so it's trivial to pull down any failed step and diagnose what happened.

We are using 4-6 times as much storage as we need to, and these are often not small files (on the order of 100 MB - 5 GB, several dozen times a day) but fixing this overuse is so far down the priority list that I don't think it survived the great Jira purge of mid-2024.

eadmund 10 hours ago

> storage is so cheap that being wasteful with it isn't really disincentivized
I think another way of phrasing that is usage is correctly incentivised. In the example you give, the value to debugging is more than the cost of storage — and even if that’s not the case it’s so low-priority that it might not even be on your list of priorities anymore!
That literally means that it’s worth your limited, valuable time to do something else.

hannob 6 hours ago

I have seen takes about the environmental footprint of unused data before, and I have severe doubts this is a relevant issue.

I mean, sure, there is some impact. Storage media has to be produced. But there's a reason storage is cheap, it's not a whole lot of resources going into it. And hard drives that are idle in some data center without being accessed don't consume a lot of electricity.

There are very real and concerning problems with the environmental impact of IT. But they are primarily found in other areas. Energy consumption is mostly a function of "how much you compute with data", not "how much data you have".

In other words: be concerned about so-called "AI", be concerned about Bitcoin. Don't worry about unused data too much.

alright2565 10 hours ago

It's pretty simple: sorting through the data to determine which 90% is crap is more expensive and uses more of our scarce resources than just storing it all.

mproud 10 hours ago

There are a few arguments here, and yeah, I get it, some of the shit being stored is shit — or shitty versions of content already stored. But the argument that “these pages haven’t been reviewed in 20 years!” is exactly the reason for preservation! We want to be able to read, listen, and review content that is rare.

wvenable 8 hours ago

I hate searching for vintage computer stuff only to find broken hyperlinks to lost pages that once contained useful information or software but is now gone forever.

Borg3 5 hours ago

Yeah, because we do NOT mirror usefull information. Only crap :) I remember years ago (20 years, even 10 years ago) it was easy to mirror stuff. These days, web pages are loaded w/ JS junk.. Is hard, so I just extract interesting text and put it into my storage.

umutisik 12 hours ago

Make the cost of operating data centers reflect the damage to the environment and see how quickly people optimize. I don’t know how damaging storage is, but that’s the only way to make a difference.

jeffbee 10 hours ago

The cost to operate a data center is already, to a close approximation, the cost of its electricity. The data center operator already pays for every joule.
Data centers used ~4% of USA electricity last year, about .15 quads, while the chemical industry used 10 quads, virtually all in the form of fossil fuel. If you are going to start reflecting the externality of energy consumption in the price of goods, information sector will probably not need to adjust anything, while the chemical sector will be fundamentally impaired.

jodrellblank 12 hours ago

It's a problem for us when it comes to the GDPR and rights to be forgotten; companies will say they store your data carefully, will say they have shown you everything they have, will say they delete it, but "the company" in aggregate has no idea there's a thousand SharePoint sites and ex-employees mailboxes and filestores, and copies of old filservers from before a migration, and test databases containing copies of real data from a half-abandoned project where new management fired the contractors and then never got around to hiring new ones.

otterley 4 hours ago

The right to be forgotten was never a practically enforceable right to begin with. It’s a right that burdens others with perpetual effort and cost, unlike most rights, which only require others not to do things. I’m not sure why Europeans are so obsessed with it.

tromp 5 hours ago

> the cost of storing data is less than the cost of figuring out what to do with the crap.

Or the cost of figuring out that it's not worth saving...

zekenie 11 hours ago

Not my area of expertise but I don’t think storage is a big deal, relatively speaking. It’s all about the compute

SturgeonsLaw 10 hours ago

90% of everything is crap. I think there's a law about that or something

smetj 8 hours ago

> Why were they created?

Proof of work. Look at all this data I/we created.

And the article didn't talk about logs and other operational data yet

RobotToaster 13 hours ago

I imagine there's a similar statistic for book shops and public libraries.

_Algernon_ 8 hours ago

Surprised it isn't higher

donatj 9 hours ago

Burn the libraries, they're full of books no one's read in years /sarcasm

floppiplopp 2 hours ago

That's a pretty optimistic estimate, tbh.

matt-p 13 hours ago

Must be more, surely.

hoseja 5 hours ago

"We’re destroying our environment to (...)"

No we're not. I really dislike this "environmental" anti-technologist angle. A single steel plant in china has tenfold "environmental impact" than all photos stored on a platter everywhere.

Would you prefer the photos are a cocktail of weird chemicals on a negative and printed on glossy photo paper?

Digital data is the most ephemeral we are able to make it through vast effort.

cyberjerkXX 10 hours ago

I wonder what this guy thinks about Internet Archive?

maigret 7 hours ago

This is the least of problems. The Wayback Machine is 100 petabytes. https://en.wikipedia.org/wiki/Wayback_Machine . "Every day, we create more than 350 million terabytes of data." https://www.rankred.com/largest-data-centers-in-the-world/ , so 3500x the Wayback Machine every day.

qingcharles 12 hours ago

Photos not seen by humans again, but plenty of value for the AI overlords to examine. These things have value again.

Didn't Facebook start to move most of their least-used data onto optical arrays a long time ago?

ctoth 13 hours ago

Surely a few more nines if the base rate is already Sturgeon.

the_real_cher 4 hours ago

> were destroying our environment

citation needed

imcritic 3 hours ago

WON'T SOMEBODY, PLEASE, THINK ABOUT THE ENVIRONMENT!?

ein0p 13 hours ago

A gross underestimation, IMO. When I was in big data, fewer than 5% of data written was ever touched again, and only a single digit number of our large customers (out of tens of thousands) actually made real use of their "big data", and created most of the load. That's the trouble with "checkbox driven development" - 10 years ago you were required to have a "big data strategy" for anyone to take you seriously, even if your strategy boiled down to just ETL-ing a bunch of crap you're never going to need into the cloud and never touching it again. Now I'm in AI, and the same thing is happening to AI. It's great if you're selling shovels, so to speak, but not so great if you plan on selling them for an extended period of time.

This, by the way, has implications on storage systems design. You want something that's cheap yet dense to encode, potentially at the slight expense of decode speed. Normally people really lose sleep about decode speed first and foremost, which, while important, does not minimize the overall resource bill.

bocytron 6 hours ago

It seems you're all missing the point here: it's not about storing useless data, it's about destroying the environment in the process. I understand you all want to keep everything, just in case, because it's cheap and you don't see the externalities. But there are externalities, and they are big.

palata 5 hours ago

This.
We need to think about the data we need to store before we store it, only store the data that we need to store, and only store it for as long as needed.
It reminds me of CIs. It's now so easy to throw 40 jobs on GitHub actions that people don't think about them. I have been in a startup where people would debug in CI: they wouldn't have e.g. Windows on their machine (maybe they should have, given that their product was supposed to run there) and were fixing compilation issues by sending patches and patches to the CI. Every single time it would trigger the 40 jobs. Sometimes you could see a patch sent every 5 min for 3 days (where reproducing the issue locally would actually take 3s and not 5min). They did not even bother disabling the 39 uninteresting jobs.
For open source projects, it's just wasted energy, for private repos it was costing the company a lot. This was just malpractice. But nobody cared. The finance person would say "GitHub is expensive", the CEO that "well we need it" and the engineers that "I don't want that Windows crap on my computer", I suppose.
andybak 5 hours ago

In which case we should be reading an article about how important it is to correctly price externalities. And that is not this article.
- palata 5 hours ago
  
  Well that article clearly says "it's hurting the environment, and in my experience the vast majority of that data is useless".
  Which I believe is not uninteresting, given the amount of answers here where people say "Is that data useless? I don't know, I could imagine that it's not, I think it's a hard problem". Well here we have one person saying "I have experience in that, and I can tell you that most of it is useless". Just a data point, but that's still interesting.

ltbarcly3 11 hours ago

> We’re destroying our environment to store copies of copies of copies of stuff we have no intention of ever looking at again

While I agree that most of the stuff in data centers is probably crap, but that's because most of everything people do is crap. That's not for me to decide though, people save things because they find value in them. Most of what has value to another person won't have value to you. Most of what people treasure in their life is thrown away after they die because nobody wants it, even their closest family members. Who gets to tell everyone the bad news, that objectively their memories are trash and they don't have a right to keep them anymore? Gerry Fuckin' McGovern?

Secondly, we aren't destroying the environment for any of this. Data centers use like 5% or less of the overall electricity use. It's a lot, but we don't have to put datacenters in random locations, we can (and do) put them where electricity is cheap. That generally means that the 5% of electricity used for data centers, kwh for kwh isn't as impactful as an average kwh of end use. Large companies like Meta and Google claim to have zero net carbon by obtaining offsets. So in general we aren't "destroying the environment" to store copies of photos.

cyberjerkXX 10 hours ago

I couldn't agree more - the data stored in data centers have "some" value, otherwise people wouldn't pay to host it. This author anecdotally comments about his experience while pearl clutching about the environment, typical.

kkfx 4 hours ago

Not too related but... Did you ever tried to imaging a world where computing is back to interconnected desktops/homeservers? Well, if you see trends like {fog,edge}-computing or recent ideas to distributed LLM computing (BrianknowsAI DCI Network for instance) it's clear that giants well known they can't keep up the modern mainframe model preferring to makes users paying the iron and bandwidth while they handle the software stack.

Now it's clear the new deal could be implemented only in homes/sheds with domestic p.v. and storage, smart cities keeps to fail since the ancient Fordlandia, see Neom, Songdo, Masdar, PlanIT Valley, Lavasa, Ordos, Santander city, Toronto Quayside (Google Sidewalk Labs), Amazon HQ2, Egypt new Cairo still nameless, Modi's Indian 100-smart city program, Arkadag, Innopolis, Nusantara, Proton City, ... and can't be powered with a smart-grid at such scale.

So well, new well insulated buildings, with ventilation of course, with p.v. and storage, with room for a domestic rack(s), with FTTH. Anyone with such settlements could have his/her own "datacenter" at home, following the same trend for medical devices more and more cheaper and smaller. A LOM? Well a NanoKVM PCIe or an external JetKVM cost MUCH less than classic LOM and do much more. We have all the gear to makes such "datacenter at home" assemblies, anyone holding preferred crap and participating in distributed computing networks to pay at least a bit gear and bandwidth.

It's not for all of course, some will be trapped in dense cities while some large owner dream an obviously not possible conversion from offices to apartments and datacenters like https://finance.yahoo.com/news/southern-californias-hottest-... or https://www.euronews.com/next/2024/02/29/madrid-to-convert-u... and https://www.theguardian.com/society/2025/jan/05/office-to-ho... or https://czechdaily.cz/half-of-pragues-office-buildings-are-a... etc for all over the developed world. That's while we admit https://doi.org/10.1073/pnas.2304099120 we need a full-remote DISTRIBUTED shift.

Food, meds, general retail distributed by a single integrated logistic platform for maximum efficiency in a spread society, the IT evolution makes Distributism possible.

Doing so erase the big amount of concentrated energy, dense network, heat handling and water problem of datacenter and also reduce much the crap, because anyone keep it's own personal and being not free keeping it they'll learn to be storage continuous.

cactusplant7374 13 hours ago

Fortunately with LLM's we can generate data and not store it. If you think about it that means articles could always be relevant with additional details added as time goes on.

tylerrobinson 13 hours ago

No, the reality of this situation is that people need to save a snapshot of what the LLM said when they read it, just in case they need to substantiate it or blame someone for what they read at the time.
- cactusplant7374 an hour ago
  
  No, it requires a shift in thinking. There are no guarantees with information one reads as it is.
irjustin 13 hours ago

At the moment, I don't like this for fear of revisionist history. The equvilent would be if wikipedia didn't have to have a source reference.
not saying that it's infallible but yeah we need some chain of audible at least 1 or 2 layers deep.

xyst 11 hours ago

> because when it comes to technology, these managers exist on a whole other level of stupid vanity and narcissistic pursuit of their own selfish agendas

Regardless of what you think about the article. This rings so true at many Fortune 500 companies.

The number of times I have seen teams work through pointless bullshit to push some meaningless objective for the company. Just so the middle manager (aka “Director of SVP of X product of Y branch”) can get a bullet point(s) in the quarterly “all hands”.

Oh and those 10 developers/off shore people that were just hired? It was all to pump his/her “head count” number to get to the promotion to next grade/level.

Then when that person gets promoted, those people get scattered throughout the firm or just let go.

It’s truly just weaponized incompetence.

chiggsy 12 hours ago

> Having to deal with senior managers has always been the most unsavory part of my job, because when it comes to technology, these managers exist on a whole other level of stupid vanity and narcissistic pursuit of their own selfish agendas.

Whoever sent this dude made a mistake. People who don't share your worldview need to be persuaded, not insulted! Some dude stomps in, thinks all the snaps in the cloud are crap, things the big bosses are stupid for not instantly deleting the pictures they saved into the cloud.... and then what? Download Lisp? Thought we got over this, pal.

WORSE IS BETTER.

P.S. do not erase our porn. WORSE IS BETTER.

Bluglionio 5 hours ago

[dead]

edwardoelliott6 12 hours ago

[dead]

nimchimpsky 6 hours ago

[dead]

eadmund 10 hours ago

> We’re destroying our environment to create and store trillions of blurred images, half-baked videos, rip-off AI ‘songs’, rip-off AI animations, videos and images, emails with mega attachments, never-to-be-watched-again presentations, never-to-be-read-again reports, files and drawings from cancelled projects, drafts of drafts of drafts, out of date, inaccurate and plain wrong information, and gigabytes and gigabytes of poorly written, meandering content.

Storing the files for Mr. McGovern’s website requires plastics, metals, power and physical space, yet I assume he believes that environmental effect is worthwhile. Who is he to decide for others that their choice to pay for the storage of data is not equally worthwhile to them?

That’s the beauty of a price system: each of us gets to decide what we will buy, and what we will not buy.

Now, perhaps his argument should be that the price of storing digital data does not adequately reflect the true cost. Perhaps there are unaccounted-for externalities. If so, then he should make that argument, perhaps arguing for a tax to align prices with costs.

Someone else might argue that data is a liability as well as an asset. That’s another argument he could make.

But haranguing folks for spending their money in ways he doesn’t like doesn’t seem likely to produce the outcome he appears to wish.