Should I Open-Source My Model?

I ultimately decided to take my current position for reasons unrelated to the role itself — I think that fighting fake news on Facebook is one of the most important things that anyone could do right now — and this additional study from OpenAI would help.

Even better, if you can create a pool of models that can identify generated content, it is going to be harder to create generated content that defeats all models and gets past an automatic detection system.

If you can demonstrate, quantitively, that a negative use case for the data is easier or harder to combat, then that will be one factor in your decision-making process.

Is this a new problem in Machine Learning?No, and you can learn a lot from past experience.

In 2014–2015, I was approached by the Saudi Arabian government on three separate occasions to help them monitor social media for dissidents.

At the time I was CEO of Idibon, a ~40 person AI company who had the most accurate Natural Language Processing technology in a large number of languages, so we were naturally seen as technology that could be the best for their use case.

We were first approached directly by a Saudi Arabian ministry, and then indirectly, once through a boutique consulting company and once through one of the five biggest consulting companies in the world.

In every case, the stated goal was to help the people complaining about the government.

After careful consultation with experts on Saudi Arabia and Machine Learning, we decided that a system that identified complaints would be used to identify dissidents.

As Saudi Arabia is a country that persecutes dissidents without trial, often violently, we declined to help.

If you are facing a similar dilemma, look for people who have the depth of knowledge to talk about the community who would be most affected (ideally people from within that community) and people who have faced similar Machine Learning problems in the past.

Is fake news a new problem?No.

Propaganda is probably as old as language itself.

In 2007, I was escorting journalists reporting on the elections in Sierra Leone when we kept hearing reports of violence.

We would follow those reports to find no actual violence.

It turned out that a pirate radio station was broadcasting fake news, some of which was picked up by legitimate radio stations, and the intention of the fake news was to portray the supporters of one or more political parties as violent, and possibly to scare people away from voting altogether.

In the most recent elections in Sierra Leone, I saw messages going around social media with similar types of fake news about violence and election tampering.

The people responsible for fake news at large social media companies have all quietly admitted to me that they can’t identify fake news in the majority of languages spoken in Sierra Leone, and many other countries.

So, the propaganda has been here for a long time, and it has used every technology available to scale the distribution of the message.

The biggest gap is in ways to fight the propaganda, and this means better AI outside of English for the majority of cases.

Should I focus on balancing the bad use cases for Machine Learning with ones that are more clearly good?Yes.

It is easy to have a positive impact on the world by releasing models that have mostly positive application areas.

It is difficult to have a positive impact on the world by limiting the release of a model with many negative application areas.

This is the other failing of OpenAI, their lack of diversity.

More than any other research group, OpenAI has published models and research that only applies to English and (rarely) a handful of other high privilege languages.

English only makes up 5% of the world’s conversations daily.

English is an outlier in how strict the word order in sentences needs to be, in standardized spellings, and in how ‘words’ are useful as atomic units for Machine Learning features.

OpenAI’s research relies on all three of these: word order, words as features, consistent spellings.

Would it even work for the majority of the world’s languages?.We don’t know as they didn’t test it.

OpenAI’s research tells me that we need to worry about this kind of content generation for English, but it tells me nothing about the risk in 100s of other languages where fake news circulates today.

To be frank, OpenAI’s diversity problems run deep.

When I was among dozens of people who noted that an AI Conference featured 30+ speakers who were all men, and that OpenAI’s Chief Scientist was the first featured speaker, OpenAI ignored the complaints.

Despite several public and private messages from different people, I am not aware of any action that was taken by OpenAI to address this problem in diversity representation.

I personally decline all speaking invitations where I believe that the conference lineup is perpetuating a bias in the Machine Learning community, and I know that many people do the same.

It is likely that OpenAI’s more relaxed attitude to diversity in general is leading to research that isn’t diverse.

I generally don’t trust English-only results for 95% of the world’s language in practice.

There is a lot of good fundamental research at OpenAI, like how to make any model more lightweight and therefore usable in more contexts, but their English-language focus is limiting the positive use cases.

If you don’t want to step into the grey area of applications like fake news, then pick a research area that is inherently more impactful, like language models for health-related text in low resource languages.

How deeply do I need to consider sensitivity of the use case?Down to the individual field level.

When I was running product for AWS’s Named Entity Resolution service, we had to consider whether we wanted to identify street-level addresses as an explicit field, and potentially map coordinates to that address.

We decided that this was inherently sensitive information and shouldn’t be productized in a general solution.

Consider this in any research project: are you identifying sensitive information in your models, implicitly or explicitly?Should I open source my model, just because everybody else does?No.

You should always question your own impact.

Whether or not you agree with OpenAI’s decision, they were right in making an informed decision rather than blindly following the trend of releasing the full model.

What else should I be worried about?Probably many things that I haven’t covered here today!.I wrote this article as a quick response to OpenAI’s announcement yesterday.

I will spend more time sharing best practices if there is demand for it!.

. More details

Leave a Reply