HyperAI超神经

By Super Neuro

Today, there are rumors that a hacker is openly selling "Huazhu Hotel room booking data" on the dark web. From the content posted by the seller, the data includes Huazhu's hotels and also includes user data of hotels under AccorHotels, which cooperates with it. The hacker is openly selling 8 bitcoins (about 56,000 US dollars, nearly 380,000 RMB). As of now, Huazhu Hotels has publicly responded that it has reported the case to the police.

Huazhu Hotels Group (NASDAQ: HTHT), formerly known as Hanting Hotel Group, is the first full-brand hotel chain management group in China.

It was founded in 2005 and listed on the NASDAQ in the United States in March 2010. It currently operates more than 3,000 hotels, covering all levels of the market from high to low end.

Among them, the hotel brands targeting the high-end market include Grand Mercure, VUE, and Joya; the hotels targeting the mid-range market include Four Seasons, Orange Crystal, Orange Select, and Ibis Styles; the mass market includes Ibis, Hanting Premium, Hanting, and Hi Inn.

The data sold on the dark web this time includes three parts:

The registration information of Huazhu Hotels’ official website includes:

Name, mobile phone number, email address, ID number, login password, totaling 53 GB, identity information of about 120 million people;

When checking in to a Huazhu hotel, the guest's identity registration information includes:

Name, ID number, home address, birthday, internal ID number, totaling 22.3G, identity information of about 130 million people;

Huazhu Hotel room booking record information, including:

Internal ID number, room association number, name, payment card number, mobile phone number, check-in time, hotel ID number, room number, consumption amount, etc., totaling 66.2 GB, about 240 million records;

Although Huazhu has announced that it has called the police, it is very difficult to trace and collect evidence of dark web transactions, and the data should have already leaked, so it is unknown what remedial measures can be taken.

Data Hacking: A Gray Area Under the Sun

In fact, this is not the first time that such a large-scale leak of citizen information has occurred.

As early as July this year, a major case of suspected data leakage was exposed in China. As many as 11 companies were involved, and 4,000 GB and tens of billions of citizen information data were seized.

The data involved in this case is highly private. The Internet URL data involved in the case includes more than 40 information elements such as mobile phone numbers and Internet base station codes, which record the specific Internet behavior of mobile phone users. Some of the data can even directly enter the homepage of citizens' personal accounts.

However, what is even more surprising is that those who buy this data are not just fraud organizations, financial online loans, etc. Many large Internet companies at home and abroad, including Google and Huawei, are important revenue customers of the company, which means that they all have various private data of citizens.

For R&D engineers at any AI company in the world, being able to obtain a large amount of real data is very helpful for developing AI models. It would be even better if the data is of high purity.

They can process data more conveniently and compare and evaluate models more efficiently, thereby coming up with correct solutions to real-life problems.

Let’s talk about GAN encryption, starting with the leakage of Huazhu Hotel room booking information

However, due to data confidentiality issues, the data these giants can share is quite limited. So it is actually a common thing in the industry for big companies to buy data.

Not only in China, but users around the world do not have a particularly clear understanding of the privacy and confidentiality of data. When using various Internet products, they have to choose "yes" on the "User Agreement".

The big guys buy the data, and then what?

The big guys spent a lot of money to buy the data, so of course they will make efficient use of this data.

They buy data, collect data using their own products, and develop more secure encryption methods to protect their data.

It is true that the weak will always be weak, and the strong will always be strong

As engineers, let’s talk about several commonly used data encryption methods and how to understand their properties and principles.

Inherently insufficient protection mechanism for anonymized data

Currently, the more commonly used data sharing confidentiality mechanism is achieved by anonymizing the data set, but in most cases, this is still not a good solution.

Data anonymization can play a role in confidentiality to a certain extent by covering up some sensitive data, but it cannot prevent the reasoning of data experts. In actual application, the covered sensitive data can be inferred through reverse deduction of relevant information.

Previously, a German researcher published a paper titled Build your own NSAThe research paper talks about how to reverse data anonymization and find the original information.

The researcher obtained a month's worth of web clickstream information from about 3 million Germans for free through a fictitious company. The information was anonymized, for example, by using a string of random characters. 「4vdp0qoi2kjaqgb」to replace the user's real name.

The researcher successfully deduced the user's real name on the website through the user's historical browsing history and other related information. It can be seen that data anonymization cannot ensure complete confidentiality.

The Chaos Communication Congress is hosted by the Chaos Computer Club, the largest hacker alliance in Europe. It mainly discusses computer and network security issues and aims to promote computer and network security.

Thus, homomorphic encryption was born

This is one of the breakthrough achievements in the field of cryptography. The decryptor can only know the final result but cannot obtain the specific information of each ciphertext.

Homomorphic encryption can effectively improve the security of information and may become a key technology in the field of AI in the future, but for now, its application scenarios are limited.

To put it simply, homomorphic encryption means that my data can be used by you according to your needs, but you cannot see what the data is specifically.

Although this encryption method is effective, its computational cost is too high.

Basic homomorphic encryption can convert 1MB of data into 16GB, which is very costly in AI scenarios. Moreover, homomorphic encryption (like most encryption algorithms) is usually not differentiable, which is not very suitable for mainstream AI algorithms such as stochastic gradient descent (SGD).

At present, homomorphic encryption technology basically remains at the conceptual level and is difficult to put into practical application, but there is hope in the future.

Learn more about GAN encryption technology

Google published a paper in 2016 called "Learning to Protect Communications with Adversarial Neural Cryptography",This paper introduces in detail a GAN-based encryption technology that ,can effectively solve the data protection problem in the data sharing ,process.

This is an encryption technique based on neural networks, which are usually considered difficult to use for encryption because they have difficulty performing XOR operations.

But it turns out that neural networks can learn how to keep data secret from other neural networks: they can discover all the encryption and decryption methods without generating algorithms for encryption or decryption.

How GAN encryption protects data

GAN's encryption technology involves three aspects, which we can demonstrate using Alice, Bob, and Eve. Usually, Alice and Bob are the two ends of a secure communication, and Eve monitors their communication and tries to reversely find the original data information.

Alice sends Bob a secret message P, which is input by Alice. When Alice processes this input, it produces an output C (“P” stands for “plaintext” and “C” stands for “ciphertext”).

Bob and Eve both receive C and try to recover P from C (we denote these computations by PBob and PEve, respectively).

Bob has an advantage over Eve: He and Alice share a secret key K.

Eve's goal is simple: to reconstruct P exactly (in other words, to minimize the error between P and PEve).

Alice and Bob want to communicate clearly (to minimize the error between P and PBob), but also want to hide their communication from Eve.

Through GAN technology, Alice and Bob are trained together to successfully transmit information while learning to avoid Eve's monitoring. The whole process does not use any pre-set algorithm. Under the principle of GAN, Alice and Bob are trained to beat the best Eve, not a fixed Eve.

As shown in the figure below, at about 8,000 training steps, both Bob and Eve can begin to reconstruct the original message. At about 10,000 training steps, the Alice and Bob networks seem to discover Eve and begin to interfere with Eve, causing Eve's error rate to increase. In other words, Bob is able to learn from Eve's behavior and protect the communication, achieving accurate message reconstruction while avoiding attacks.

Back to AI applications, GAN encryption technology can be used to exchange information between companies and neural networks without maintaining a high degree of privacy. For AI applications, it is a practical data protection solution.

Because the model can learn to selectively protect information, leaving some elements of the data set unencrypted, but preventing any form of inference from finding these sensitive data, thereby effectively circumventing the shortcomings of data anonymization.

The Google team adapted the GAN encryption architecture in a model where Alice and Bob still share a key, but Alice here receives A, B, C, and generates D-public out of the ciphertext.

Both Bob and Eve have access to Alice's output D-public. Bob uses them to generate an improved estimate of D, allowing Eve to reverse-engineer C from this approximation. The goal is to prove that reverse training allows approximating D without revealing C, and that this approximation can be combined with encrypted information and keys to better confuse Eve.

To verify that the system can hide information correctly, the researchers created an evaluator called "Blind Eve." It knows C, but not D-public and the key, while Eve knows this information.

If Eve’s reconstruction error is equal to Blind Eve’s reconstruction error, this means that Eve has not successfully extracted valid information. After several trainings, Eve no longer has an advantage over Blind Eve. This shows that Eve cannot reconstruct any information about C by simply learning the distribution of C values.

GAN cryptography is a relatively new technology in mainstream AI applications, but conceptually, it could allow companies to share datasets with data scientists without disclosing sensitive data.

In the long run, if you want to gain user trust and reduce legal crises, encryption technology is secondary. The most important thing is for Internet companies to respect and reasonably use user privacy.

Let’s Talk About GAN Encryption, Starting With the Leakage of Huazhu Hotel Room Booking Information

Data Hacking: A Gray Area Under the Sun

Inherently insufficient protection mechanism for anonymized data

Learn more about GAN encryption technology