The Kyoto University in Japan has lost about 77TB of research data due to an error in the backup system of its Hewlett-Packard supercomputer.
The incident occurred between December 14 and 16, 2021, and resulted in 34 million files from 14 research groups being wiped from the system and the backup file.
After investigating to determine the impact of the loss, the university concluded that the work of four of the affected groups could no longer be restored.
All affected users have been individually notified of the incident via email, but no details were published on the type of work that was lost.
At the moment, the backup process has been stopped. To prevent data loss from happening again, the university has scrapped the backup system and plans to apply improvements and re-introduce it in January 2022.
The plan is to also keep incremental backups - which cover files that have been changed since the last backup happened - in addition to full backup mirrors.
Supercomputing is expensive
While the details of the type of data that was lost weren't revealed to the public, supercomputer research costs several hundreds of USD per hour, so this incident must have caused distress to the affected groups.
The Kyoto University is considered one of Japan's most important research institutions and enjoys the second-largest scientific research investments from national grants.
Its research excellence and importance is particularly distinctive in the area of chemistry, where it ranks fourth in the world, while it also contributes to biology, pharmacology, immunology, material science, and physics.
We have requested Kyoto University to share more details on the incident and its impact on research groups, but we haven't heard back yet.
Japan leading the field
Japan happens to have the most powerful supercomputer in the world at the moment, called "Fugaku", operated by the Riken Center for Computational Science, in Kobe.
Fugaku is an exascale system made by Fujitsu, capable of computational performance of 442 PFLOPS. The second in the global list, IBM's "Summit", can reach a much smaller figure of 148 PFLOPS.
Fugaku cost $1.2 billion to build and has so far been used for research on COVID-19, diagnostics, therapeutics, and virus spread simulations.
Comments
midimusicman79 - 2 years ago
Is there maybe someone else than the Kyoto Univerity of Japan's IT Administrators who knows which backup and disk imaging software that the said University used?
And, which alternative similar software they will purchase and use?
Because otherwise, these details will remain secret?
But anyway, given the severity of the said error, the said university will highly likely sue the software vendor for many millions of money compensations.
yak_ex - 2 years ago
Hewlett Packard Japan, G.K. published a report (written in Japanese), which is linked from the official release (https://www.iimc.kyoto-u.ac.jp/ja/whatsnew/trouble/detail/211216056978.html) by the university.
The report says "While a bash script was running, it was overwritten by update process. It causes removing /LARGE0 data as invoking a find command with undefined variables."
It says they, Hewlett Packard Japan, G.K., are 100% responsible for this incident, also.
I assumes the backup process is proprietary for the university supercomputer system.
CapHenning - 2 years ago
You would hope the researchers developing the data have their own copies of it that can be used to rebuild the server.
midimusicman79 - 2 years ago
Thank you, yak-ex and CapHenning! :)
chadf - 2 years ago
"... due to an error in the backup system ..."
Are we sure a BOFH didn't just symlink the tape device to /dev/null again, to simplify backups? =)
PK88 - 2 years ago
You can never have too much backup. It looks like they learned their lesson a little late.
The_Ffakr - 2 years ago
"Fugaku is an exascale system made by Fujitsu, capable of computational performance of 442 PFLOPS."
It's not an exascale system if it can't do an ExaFlop.
Edit: .. and as soon as I hit post on this, I saw that Fugaku apparently hit an Exaflop running a benchmark.. so I stand corrected but I still think we shouldn't be calling systems ExaScale until they can achieve that with real workloads. :-P
@CapHenning .. from my experience, the researchers probably don't have additional copies of the data. It's possible since 14 researchers were affected so maybe some of them bothered to download their own data down to somewhere else for additional safekeeping.
But, we're talking about 5.5 TB per researcher (on average) and it takes a Long time to exfiltrate TBs of data over the wire, especially when it's stored in millions of files.
My guess is, most would have thought there was really no point in having it anywhere else because you need it 'near' the Supercomputer to do anything particularly useful with it.. so they probably just trusted that it was safe there.
If that 77TB was all output, they can probably regenerate it.. but as the author noted, cpu time on a Supercomputer isn't cheap.
It costs a fortune just to keep all that equipment powered up, and keep it all from melting down. Even my rinky-dink data center (NOT designed for HPC) costs hundreds of thousands of USD in power to run every year.
JoeSimonson - 2 years ago
I want to read more news like this.
Ronaldanallen - 2 years ago
>"77TB of research data due to an error in the backup system of its Hewlett-Packard supercomputer" - 77TB of research data is not a joke. Once, we lost 12TB of our data due to providing a case study writing service to our clients because of a backup error (https://edubirdie.com/case-study-writing-service), and we lost a lot of money and customers.
>Hewlett Packard Japan, G.K. published a report (written in Japanese), which is linked from the official release (https://www.iimc.kyoto-u.ac.jp/ja/whatsnew/trouble/detail/211216056978.html) by the university.
Thank you for the link; even we are not using supercomputers, it's helpful to know what to expect from "backup" data.