Locating your Data

Robert Anderson’s tongue is only partly in his cheek when he says, “The greatest thing ever for the HDD industry was the smartphone.”

Extending the life of a technology that has been around since the 1950s was probably not the primary motivation when the companies put a content creation device in the hands of, potentially, every person on the planet.

But all those selfies and cat videos have contributed greatly to the 64.2 ZB of data created or replicated last year, according to IDC’s DataSphere and StorageSphere reports, which also predict 23 percent CAGR in data growth between now and 2025.

Amazingly, less than two percent of this data was actually stored and retained into 2021, but the StorageSphere of installed storage still hit 6.72 ZB of capacity last year and is set to grow at 19.2 percent CAGR to 2025. And, IDC warns, organizations – and individuals – should consider retaining even more data as they embark on digital transformation, increase resiliency, and accelerate data analytics initiatives.

So data creation is growing, and a bigger percentage of data will be stored permanently. But where? If you took a look inside end user devices, and the high-performance data arrays that make the headlines, you’d be forgiven for thinking that flash will soak up this tsunami of data.

But you would be wrong, argues Western Digital marketing director Robert Anderson. In fact, almost two thirds of data is stored on hard disk. That’s because most of the data being created and retained is heading to datacenters and the cloud – and things get cold there. The lifecycle of data is illustrated by what happens to those vast oceans of consumer-generated content that have been created over the last decade and a half by the smartphones, Anderson says. “When you’re getting it ‘for free in the cloud’, why would I ever delete anything?”

But of course, the cloud providers are monetizing this content, via advertising or by developing insights that can be sold on. The problem is, over time, that data is rarely if ever accessed. As Anderson points out, consider how much cell phone footage there is of concerts “that no one ever watches.”

The same cycle applies to corporate data. The deluge of information created in the normal run of business is swollen further by the flood of data from sensors, smart manufacturing, and video. Organizations assume that at some point, the right analytics software will come along and allow them to make sense of it and create value. So, they shrug and say, “Why delete it. Maybe I’ll use it later.” In the meantime, Anderson continues, they reason that “if it’s cheap enough to store it, I don’t want to spend the time to figure out what to keep and what to delete. That’s a waste of somebody’s time.” The cloud is where your data goes to cool off Likewise, the cloud provider reasons: “I’ll take it all. And I’ll hold it for you. Because I can do that relatively inexpensively, and I can now monitor what you access, how frequently you access it, and I can then get more optimized around the structure of how I store your data … but if you want it back, then I’ll charge you [because] then I lose my monetization capabilities.”

The result, Anderson explains, is “this ever-growing anchor of legacy data sitting in these big public clouds.”

Anchor might have negative connotations, but really it just means that this data is moving from being a profit center to being a cost center, as providers face the challenge of reducing the cost of holding this data that maybe, one day, might need to be accessed again.

Traditionally, data that needed to be kept in perpetuity had a safe, low-cost home – tape. “There’s still a robust market for tape for archive purposes, especially for businesses,” says Anderson. “Government uses a lot of tape.”

The problem for the cloud services – and commercial and research organizations – is that there is always the chance someone will want to access that ancient data and will want it quickly. Tape doesn’t lend itself to this, even when it’s not stored down a salt mine. And for this reason, says Anderson, “it breaks the implied service level agreement.”

TAPE’S LONG TAIL

Instead, it is hard disk drives that are soaking up cooling data. The evolution of hard disks towards ever larger capacities is further cementing them into this role, Anderson explains.

As HDDs get bigger, “your absolute performance on a spec sheet is faster. But your access density is lower… I used to have five 4 TB drives working simultaneously to mine through 20 TB of data. Now I’ve got one 20 TB drive.”

If this means it might take seconds or minutes to retrieve data, that’s still a world away from the hours or days it may take to retrieve a tape from a remote location, never mind retrieve the data on it. “So, disk is starting to fill more and more of that gap down closer to tape,” says Anderson. “Does it eat tape for lunch? Not really. Because tape’s cost effectiveness is another tier.”

Does that mean that enterprises will no longer think in terms of traditional storage archives. “I’d say the exact opposite,” says Anderson. And although tape is the cheapest media for cold data, large capacity HDD also delivers TCO advantages at scale in the datacenter, whether that’s in the enterprise or at a cloud hyperscaler. “Assuming the operator needs 100 units of storage – terabytes, petabytes, whatever – well, if I can fit that in half as many racks, I’ve got half as much metal,” Anderson says. Then, by extension, “I’ve got fewer node systems in the rack, I’ve got half as much memory dedicated to each of them. I’ve got less networking because I’ve got fewer total appliances connected and less wiring? Even though the power per device goes up, my power per terabyte declines, so my total power declines and the space I take up declines.”

Counterintuitively perhaps, he says, “the higher your non-storage costs are, the more impactful these TCO savings are.”

WHAT DOES PERFORMANCE MEAN NOW?

This also explains why the traditional premium tier of enterprise hard disk drives – 10K RPM and 15K RPM devices – are effectively “an endangered species”. It is harder to grow capacity with these devices, Anderson says. At the same time, their advantage in performance is eaten away by SSDs.

Although SSDs might be taking more of the performance end of the market, they will not displace HDD technology overall, even in the medium-term. The performance demands of AI and HPC do indeed favour of SSDs – but only while they’re running simulations or training or inference. When it comes to that rapidly cooling data, SSDs remain a hugely expensive option, says Anderson.

“The idea that I use flash for mainstream, large capacity, long term storage? It’s just not true. I mean, you can, and do some people do it? Yes. But it’s very, very expensive. Anything that needs a large dataset that you don’t want to pay a fortune to maintain storage of will end up on HDD. You’ll store your big AI and ML datasets on HDD, and then you’ll migrate them while you do your computation and move them back once you’re done.”

And although SSD shipments might be growing faster that HDD, it’s coming from a much smaller base, while the easy “binary” switches, such as in client devices have already been made, he says.

“So in absolute terms, the gap still widens for a long, long time,” says Anderson. “And it takes an incredibly long period of time for that to close.”

And during that period, both flash and HDD are likely to evolve. Shingled magnetic recording drives, for example, are still not mainstream. But with HDD increasingly soaking up colder, unchanging, rarely accessed data, the advantages of a format geared towards sequential data are obvious. The advantages are even more apparent when it comes to object data formats such as video, which almost by definition, are sequential.

So, flash might be the hot storage format that grabs the headlines, while tape, like a diamond is forever.

But whether it’s selfies, corporate data, video streams, or the data lakes that enable recommendation engines or climate simulations, it’s the hard disk that will be soaking up the bulk of our data for years and possibly decades to come.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link