What Should an Ordinary Engineer Consider Before Opening an Open Source Project?

By Super Neuro
Before OpenAI released GPT-2, it probably couldn't have imagined that its open source behavior would cause a stir in academia and industry. Of course, this is also largely due to their great research results and high level of scientific research.
As an ordinary developer, what are the risks and benefits of open source? This article lists several issues that need to be considered before open source, as well as some of the authors' experience.
OpenAI is open source, what is the result?
Before OpenAI released GPT-2, it probably couldn't have imagined that its open source behavior would cause a stir in academia and industry. Of course, this is also largely due to their great research results and high level of scientific research.
As an ordinary developer, what are the risks and benefits of open source? This article lists several issues that need to be considered before open source, as well as some of the authors' experience.
OpenAI introduced GPT-2, the most advanced text generation model in the field of NLP last week, but they finally decided not to make all the data public. The reason given is:
“We will not release the trained models due to concerns about malicious applications of the technology.”

From the time OpenAI released GPT-2 to the time it announced that only part of the results would be open source, it caused a huge controversy. Some people believe that if all data is open source, it will definitely be used maliciously or even lead to crime; while those who support openness believe that not making all the data public will make it difficult for other researchers to reproduce the results.
Anima Anankumar works on the coordinated development of machine learning theory and applications. She responded to OpenAI’s decision to release the model on Twitter:

This is a black and white issue. You are using the media to hype language models. There are a lot of studies on this topic. You claim the results are amazing but only let reporters know the details. The researchers should be the ones to know, not the reporters.
Stephen Merity summarized the response on social media by lamenting that the machine learning community doesn’t have much experience in this area:

Summary of the day (about OpenAI): We don’t have any consensus on responsible disclosure, dual use, or how to interact with the media. This should be of great concern to each of us, both inside and outside the field.
I believe that many people have benefited from open source. So, as independent engineers or engineers attached to companies or institutions, should we open source our own models?
Someone has summarized a guide that can guide you to think one step further when you are hesitant.
Hardcore open source advice for ordinary engineers
Should you consider open sourcing your own model?
Of course!
Regardless of the final result, consider the possibility of open source models, and don’t avoid open source completely. However, if your model involves private data, you must consider the risk that criminals may obtain the original data through decompilation.
What should I worry about if the model comes entirely from public datasets?
Even if they all come from public data sets, the differences in research directions and purposes from others may bring new impacts.
Therefore, a question needs to be asked: Even if only public datasets are used, will different research directions have any impact on the data or models?
For example, during the Arab Spring, some areas were often blocked due to unrest, and local young people complained on Twitter. Relevant organizations used the content of Twitter users to monitor and analyze the enemy's military routes.
A single piece of data may seem useless, but once the data is combined, it may produce many sensitive results.
So, consider this question: Is the data combined in the model more sensitive than a single data point?

How to assess the risks after open source?
Considering security, which one is more serious, "not open source" or "open source but abused", when weighing the impact?
The cost of security measures to treat each policy as "changeable" may be higher than the value of the data being protected. For example, some information involves privacy, but it has a time limit as a prerequisite. Once the time limit is exceeded, the information is no longer private, but it still has great research value.
Therefore, bad security strategies must be abandoned in a timely manner to efficiently identify and maintain the value of data sets.
In addition, weigh the complexity of using the model and the threshold for bad guys to exploit it. Which is easier? After confirming this impact, decide whether to open source it.
In OpenAI’s case, they may have thought that not opening up the entire model would be enough to prevent malicious use on the internet.
However, it must be admitted that for many people in the industry, even if all models are open, it is not certain that they can reproduce the paper, and it will also cost a lot for those who intend to use it maliciously.
Should I believe what the media says about the risks of open source?
No.
Media descriptions always guide public opinion. Journalists want higher readership, and sensational headlines and opinions are more attractive. Journalists may prefer open source because it is easier for them to report. On the other hand, the decision not to open source may lead to sensational rumors (as in the case of OpenAI, whether it is open source or not will be exaggerated by media reporters).
Should we trust the opinions of relevant departments on open source risks?
Obviously not.
Of course, you must first ensure that your research is legal and reasonable. The staff of those government agencies may not be professional. They may be more concerned about the pressure of public opinion. As the saying goes, "no trouble is a good thing", so their opinions are not the key to judging whether to open source.
However, like journalists, we should also regard the government as an important partner while also realizing that each side has different demands.

Should we think about solutions to negative use cases after open source?
Yes!
This is where OpenAI didn’t do a good job this time. If the model can be used to create fake news, then fake news can also be further detected. For example, a text classification task can be created to more accurately distinguish between what is written by humans and the output of the OpenAI model.
Facebook, WeChat and various media websites have been working hard to combat fake news and rumors. This research by OpenAI can obviously provide help. Can the output of this model be detected in a relevant way to combat fake news?
Logically, OpenAI could have come up with a solution in a short period of time, but they didn't.
Should we pay attention to balancing the negative and positive use cases of the model?
yes.
By releasing models with positive applications, such as healthcare, security, and environmental protection, it is easy to contribute to every aspect of social operation.
Another failure of OpenAI is the lack of diversity in their research. The research OpenAI publishes only applies to English and a few other languages. But English only accounts for 5% of the world’s conversations. What’s true for English may not be true for other languages, in terms of word order in sentences, standardized spelling, and how “words” are used as atomic units of machine learning functions.
As a pioneer in scientific research, OpenAI also has a responsibility to try research in other language types and help languages and regions that are more in need. Q
To what extent should the data be anonymized before the open source model?
It is recommended to perform desensitization to the field level, or at least start the evaluation from the field level.
For example, when I was working at AWS, I was responsible for the named entity recognition service, and I had to consider whether to recognize the street-level address as an explicit field and whether to map specific coordinates to the address.
This is essentially very sensitive private information, especially when it is productized by commercial companies, it should be taken into account. Therefore, this should be considered in any research project: Has the key data been anonymized?
Should I open source my model when others say it can be open sourced?
No, you should use your own judgement.
Whether you agree with OpenAI's decision or not, they make the final decision themselves instead of blindly following the opinions of netizens.

Original article: Robert Munro
Compiled by: Nervous Miss