Source: Xiao Sa lawyer
Last month, the Italian privacy regulator Garante issued findings that OpenAI There are one or more violations of EU regulations, and the technology used by ChatGPT to collect user data has violated the country's privacy laws. ChatGPT, which set off the craze of generative artificial intelligence, has once again fallen into a data compliance controversy.
Data and computing power are the core of generative artificial intelligence. Data security is the core issue of generative artificial intelligence compliance. Behind the increasing dependence of artificial intelligence on data, generative artificial intelligence collects data secretly, which brings serious consequences to the principles of "informed consent" and "minimum necessary". challenge. At the same time, generative AI contains huge risks of data leakage during the runtime phase. This poses a serious threat to the protection of personal information. Today, the Sajie team will talk about the challenges and compliance requirements that generative artificial intelligence brings to personal information security.
01 Collection and use of corpus data< /strong>
According to the data source, data involving personal information can be roughly divided into corpus data involving personal information and data uploaded by users involving personal information data.
Generative artificial intelligence is highly dependent on data and requires a large amount of data to meet training requirements. This determines that generative artificial intelligence often actively collects and processes public and non-public data, and the scale of pre-training data can usually reach billions or even tens of billions of parameters. If there is personal information in it, according to Article 27 of the "Personal Information Protection Law", "Personal information processors may process within a reasonable scope the personal information that the individual discloses on his own or that has been legally disclosed by others; unless the individual explicitly refuses." If information processors process disclosed personal information and have a significant impact on individual rights and interests, they shall obtain individual consent in accordance with the provisions of this law." Article 7 of the "Interim Measures for the Administration of Generative Artificial Intelligence" also emphasizes that "generative artificial intelligence service provision" Providers (hereinafter referred to as providers) shall carry out pre-training, optimization training and other training data processing activities in accordance with the law, and comply with the following provisions: (3) If personal information is involved, individual consent shall be obtained or other circumstances consistent with laws and administrative regulations shall be obtained. "However, due to the large size of the database data, it is difficult to obtain the consent of the information subjects one by one.
Since it is difficult to obtain the consent of the information subject, is it okay to directly delete the personal information in the database? There are difficulties here too. On the one hand, there is currently a lack of effective personal information cleaning algorithms, and there is a certain technical paradox; on the other hand, the huge scale of the database makes manual data cleaning extremely costly, and there is a risk of secondary leakage of personal information. Studies have pointed out that data cleaning technology based on named entity recognition has a recall rate of 97% (name) and 80% (nursing unit number) of clinical health data. In other words, when personal information exists in corpora and databases, the cleaning effect of personal information is poor during the training stage, and technology companies have compliance risks. The Sajie team reminds that when technology companies use corpus data for training, they should try to select data sets that do not contain personal information, while at the same time improving the accuracy of the identification algorithm as much as possible and anonymizing or cropping the identified personal information. Adopting a machine filtering mechanism and a manual review mechanism on the audit side is also a compliance measure that has more advantages than disadvantages.
02 Collection and use of information uploaded by users
User uploaded data can be divided into Data that users actively feed" and "data that users passively feed". The so-called data actively fed by users refers to specific data uploaded by users to obtain feedback from generative artificial intelligence. The so-called data passively fed by users refers to data uploaded by users to use applications or other functions of the device containing generative artificial intelligence algorithms.
The operation of generative artificial intelligence usually requires users to actively "feed" certain data, and then analyze and provide feedback based on algorithms. During this process, human-computer interaction data will be recorded, stored and analyzed, and may become data for model algorithm replacement training. However, in contexts where the service provider fails to fulfill its prompting obligations and the user lacks security awareness, the data fed by the user is likely to include personal information such as the user's personal appearance, address, and contact information. The complex service models and diverse application scenarios of generative artificial intelligence exacerbate this risk. With the development of digital technology, users' identities are deeply bound to their contact information, facial data, fingerprints, etc., and generative artificial intelligence often collects a large amount of personal information. For example, the application scope of a well-known chatbot program of an AI company covers many fields such as teaching, scientific research, finance, media, and entertainment. The chat records of users with it contain a large amount of sensitive information, such as personal identity, preferences, habits, etc. If this data falls into the wrong hands, it will lead to personal privacy violations, identity theft, financial fraud and other risks, causing direct harm to users.
In addition, generative artificial intelligence has a wide range of usage scenarios and is often embedded in major applications and even devices. For example, in January this year, a certain browser announced the introduction of three major generative AI capabilities, and a certain company launched the world's first smartphone equipped with generative AI technology. Even if generative artificial intelligence technology is not used, users will inevitably generate and upload data when using relevant applications and even devices, and the data is likely to contain content suspected of personal information.
Article 11 of the "Interim Measures for the Administration of Generative Artificial Intelligence" stipulates that "providers shall fulfill their obligations to protect users' input information and usage records in accordance with the law, and shall not collect unnecessary personal information and shall not illegally retain it" The user's input information and usage records that can identify the user's identity shall not be illegally provided to others. Providers shall promptly accept and process individuals' requests for access, copy, correction, supplement, and deletion of their personal information in accordance with the law. Request." Laws and regulations such as the "Personal Information Protection Law" and "Regulations on the Internet Protection of Children's Personal Information" have formulated mandatory provisions on the period of data storage. Based on this, it is open to question whether such information that is suspected of personal information actively fed by users can be recorded, stored and stored by the service provider.
At the same time, there is some controversy as to whether this kind of information can be used to train algorithms. Article 7 of the "Interim Measures for the Administration of Generative Artificial Intelligence" emphasizes that "generative artificial intelligence service providers (hereinafter referred to as providers) shall carry out pre-training, optimization training and other training data processing activities in accordance with the law, and comply with the following provisions:... (III) ) involves personal information, individual consent or other circumstances that comply with laws and administrative regulations should be obtained." The user authorization obtained for the first use is not enough to cover the requirements for data use in the algorithm training stage, and technology companies must have clearer use authorization, or This type of data can only be used under other circumstances that comply with the provisions of laws and regulations. Otherwise, it may violate relevant provisions of civil law, administrative law and even criminal law. However, even if explicit authorization is obtained from users, there is a huge risk of data leakage during the operation stage of generative artificial intelligence. Technology companies can only use personal information data if they ensure the security of the data.
In order to improve the quality of generation, many technology companies will try their best to enrich data retention and improve data aggregation. For example, Article 2 of an AI company’s “Privacy Policy” states, “We may aggregate or de-identify personal information so that it can no longer be used to identify you, and use such information to analyze our services. effectiveness, improve and add features of our services, conduct research and other similar purposes." This is a feasible solution, but according to the principle of "informed consent", the service provider bears the obligation to inform, that is, the service provider needs to The data objects to be collected, the purpose of the data and possible risks shall be explained to the information subject in advance, and the collection can only be carried out after obtaining the consent of the information subject. At the same time, technology companies should provide users with the option to refuse the use of their personal information, rather than turning this clause into a rigid and mandatory notice clause. In addition, according to the "minimum necessary" principle, the personal information collected by technology companies should be collected in a manner that is relevant to achieving the goal and has the least impact, and the user's personal information should be collected clearly and specifically.
03 Write at the end
With Compared with traditional artificial intelligence, generative artificial intelligence often has stronger information collection initiative and higher risk of data abuse. Generative artificial intelligence needs to continuously strengthen context understanding capabilities through large-scale corpora and data sets to continuously upgrade and optimize itself, in all stages of generative artificial intelligence operation including data collection, data storage, data processing and data generation. , it will inevitably involve a lot of personal information and generate many legal and compliance risks. In the era of big data, the blurring of the connotation and boundaries of personal information, the lag in laws and regulations, and the pursuit of technological achievements have also caused some technology companies to ignore such risks. The Sajie team reminds that compliance is the prerequisite and guarantee for the healthy development of the industry. While pursuing success, do not take legal red lines lightly.