Americas

  • United States

Asia

Oceania

Peter Wayner
Contributing writer

11 technologies improving database security

Feature
Jul 12, 20218 mins
Data and Information SecurityData PrivacyEncryption

The database does not have to be a security and privacy liability. These technologies can reduce risk and help ensure regulatory compliance.

database data center futuristic technology
Credit: Getty Images

Databases hold vast amounts of personal information including some very sensitive tidbits, creating headaches for the companies that must curate them. Now, sophisticated tools and technologies are making it possible for database developers to have their cake and, to stay in metaphor, not count the calories by keeping the information private.

The solutions depend upon a clever application of math. Some of the simplest mechanisms are just modern versions of secret codes, essentially digital versions of the classic decoder wheel. Others are more complex extensions that push the math to deliver more flexibility and accountability. Many are practical versions of ideas that have been circulating in labs for decades but are finally stable enough to be trusted.

The algorithms are becoming the foundation for cementing business relationships and ensuring accurate and fraud-free workflow. These approaches are making it simpler for companies to deliver personalized service to customers while protecting their secrets. And they’re enabling better compliance with regulations that govern the flow of data without hampering the delivery of service. 

Here are eleven tools and technologies that are making it simpler to trust databases.

1. Basic encryption

The simplest solution is sometimes sufficient. Modern encryption algorithms lock up data with one key so it can only be read by someone who possesses the key. Many databases can encrypt data using standards like AES. These solutions are strongest against the loss of hardware, perhaps by theft. Without the right encryption key, the data remains secure.

There are limits, though, to how much symmetric encryption algorithms can protect running computers if an attacker is able to sneak in. The same key that allows the database to process legitimate operations could be found by the attacker. Many databases offer an option to encrypt information “at rest.” Oracle, for example, calls its option “transparent data encryption” to emphasize how little the developer must do.

2. Differential privacy

This technique deploys math in a different way. Instead of locking up the information in a digital safe, it adds a carefully tuned amount of noise to make it hard to figure out which record corresponds to a particular person. If the noise is added correctly, it won’t distort many statistics like averages. If you add or subtract a few years at random from the ages from a dataset, the mean age will remain the same, but it can be difficult to find a person by their age.

The utility of the solution varies. It’s best for releasing data sets to untrusted partners who want to study the data, usually by calculating averages and cluster sizes. Many algorithms do a good job of adding noise in a way that doesn’t distort many of the aggregated statistics. Understanding which machine learning algorithms can still work well with distorted bits is an active area of research.

Microsoft and Google offer tools for integrating the algorithms with data stores and machine learning algorithms. Google’s Privacy-On-Beam, for example, integrates the noise-adding mechanism with the Apache Beam pipeline processing.

3. Hash functions

These computations, sometimes called a “message authentication code” or a “one-way function”, boil down a big file to a smaller number in a way that makes it practically impossible to reverse them. Given a particular result or code, it will take much too long to find a file that will produce that particular code.

These functions are an essential part of blockchains, which apply them to all changes to the data in a way that can track and identify tampering. They prevent fraud in cryptocurrency transactions, and many are applying these techniques to other databases that need the assurance that the data is consistent. Adding these can help with compliance challenges.

The Secure Hash Algorithms (SHA) from the National Institute of Standards and Technology (NIST) are a collection of standards that are widely used. Some of the earlier versions like SHA-0 and SHA-1 have known weaknesses but the newer versions like SHA-2 and SHA-3 are considered very secure.

4. Digital signatures

Digital signature algorithms like RSA or DSA are more sophisticated computations that marry the tamper-detection properties of hash functions with a particular person or institution who certifies the information. They rely on a secret key that only the responsible party knows. Cryptocurrencies, for example, tie ownership of the wealth to the person who knows the right key. Databases that track personal responsibility can include digital signatures validating particular transactions.

5. SNARKs

A succinct non-interactive argument of knowledge (SNARK) is a more sophisticated version of digital signatures that can attest to complex personal information without revealing the information itself. This sleight of hand relies on more sophisticated mathematics that sometimes is called a “zero knowledge proof” (ZKP).

Databases incorporating SNARKs and other similar proofs can protect the privacy of users while ensuring that they’re complying with regulations. A very simple example, for instance, might be a form of digital driver’s license that certifies a person is old enough to drink alcohol without revealing their birthdate. Some are exploring applying the technology to vaccine passports.

SNARKs and other non-interactive proofs are an active form of research. Dozens of implementations of algorithms using various programming languages make a good foundation for new projects.

6. Homomorphic encryption

The only way to work with data locked up with traditional encryption algorithms is to decrypt it, a process that can expose it to anyone with access to the computer doing the work. Homomorphic encryption algorithms are designed to make it possible to perform computations on encrypted information without unscrambling it. The simplest algorithms permit one arithmetic operation like, say, adding two encrypted numbers. More elaborate algorithms can do arbitrary computations but often at a dramatically slower rate. Finding the most efficient approach for a particular problem is an area of active research.

IBM, one of the pioneers of research in this area, has released a toolkit for integrating its homomorphic encryption with applications for iOS and MacOS.

7. Federated processing

Some developers are splitting their data set into smaller parts, sometimes dramatically smaller, and then distributing them to many independent computers. Sometimes the locations are scrambled so it can be impossible to predict which computer will hold which record. These solutions are often built upon software packages designed to speed up work with so-called big data by running the search or analysis algorithms in parallel. The original intent was speed, but increased resiliency to attack can be a side-effect.

8. Fully distributed databases

If splitting a data set into several pieces can protect privacy, why not a billion or more pieces? A more common solution is storing data directly where it is created and used. A user’s smartphone often has plenty of extra computational power and storage. If there’s little need for centralized analysis and processing, it can be faster and more cost-efficient to avoid shipping it to a server in the cloud.

Many browsers, for example, support local storage of complex data structures. The W3C standards include local storage for document-style models with keys and values as well as an indexed version for more relational models.

9. Synthetic data

Some researchers are creating fully synthetic data sets built by generating new values at random but in a way that follows the same pattern and is essentially statistically identical. A research think tank known as RTI, for instance, created a version of the 2010 US Census data filled with random people living at random addresses. The people were completely imaginary, but their home addresses and personal information was chosen to have the same basic statistical profile as the real values. Researchers can test algorithms and generate solutions that can be just as accurate as working with real data in many cases.

10. Intermediaries and proxies

Some researchers are building tools that limit data collection and preprocess the data before storing it. Mozilla’s Rally, for instance, tracks browsing habits for researchers who want to study the flow of information through the internet. It installs a special add-on for the duration of the investigation and then removes it at the end. The tool formalizes the relationship and enforces rules about collection and aggregation. 

11. No data

Stateless computing is a basis for much of the web, and many drives for efficiency succeed when they reimagine the work in a way that requires as little record keeping as possible. In some extreme cases when compliance makes it possible and the users are willing to accept less personalized service, deleting the database can do the most for privacy.