Profile photo for Ajit Rajasekharan

Many of the current state-of-art models for supervised NLP tasks are models pre-trained on language modeling (which is an unsupervised task), and then fine tuned (supervised) with labeled data specific to a task.

Fig 1. Transfer learning to downstream tasks started around 2013 with using context independent word vectors from unsupervised bag of word models (word2vec), to then using context dependent word vectors from sequence models(Elmo) , to the current direct use of trained transformer blocks with an additional output layer stacked for task specific fine tuning.

A question that naturally arises is,

Why/How does unsupervised learning through language modeling

  • boost the performance of supervised models?
  • reduce the amount of labeled data required to fine tune them?

A paper published a week ago ( ~14 Feb 2019 ) offers some insights to these questions.

  • What does this model do?
    • It is trained on a large diverse corpus (Common Crawl subset equivalent) in an unsupervised manner. The training is the standard language model approach - predict the next word, given the words seen so far. A language model once trained, can compute the probability of any string given to it. Additionally it is also generative - it can generate strings from the underlying distribution it has learnt. The model used in this paper is an OpenAI transformer architecture with minor variations.
    • The trained language model shows state of the art performance on 7 out of 8 language modeling benchmarks. The one it does not perform well is on, is the largest language modeling data set. The reason stated is that it may be in part due to the fact, this large dataset has undergone destructive pre-processing, removing long range structure, which this model relies on given its input is hardly pre-processed (no lowercasing etc.)
    • The model performance on supervised tasks is promising but not sufficient for practical applications yet. For instance on tasks like reading comprehension, it does reasonably well considering it does not use labeled data (it scores 55 F1 on dev. BERT does 89 F1 nearing human performance), but on summarization, it is rudimentary.

Fig 2. The GPT-2 paper published last week attempts to go beyond BERT in transfer learning, largely as a research initiative. GPT-2 is an unsupervised language model trained on a large diverse corpus, and used as is for downstream tasks without architecture change, parameter updates, and most importantly with no task specific labeled data. The results are promising, but remains to be seen if the performance numbers will reach the levels we can attain with BERT/supervised fine-tuning for the same task.

  • How is this language model used as is for supervised tasks?
    • A key insight (largely inspired/adapted from prior work) this paper leverages off is the fact a large diverse corpus (almost an equivalent of Common Crawl subset) serves not just to learn the underlying distribution of word sequences, but also has embedded in it enough information that can be utilized as a proxy for labeled data for quite a few (but not all) supervised tasks.
    • For example, sentence pairs in two languages, one of which is a translation of the other, are present in the corpus used in paper. Also, it is possible to prime a language model trained on such a large corpus, to be conditioned on a few example patterns, and generate equivalent patterns. For instance, if we prime the model with a few sentences of the form english sentence = french sentence, and then prompt model with just english sentence = , it would generate french equivalent. Another example, since people use TL;DR: as short form for summarization we can have the language perform summarization by us just conditioning it on a text that is to be summarized and then end with a TL;DR: All examples below are generated by such conditioning.
    • This approach may not work for tasks like named entity recognition of POS tagging where the labels are purely human generated and are not likely to be present in a typical site that is crawled from the web. However, some clever adaptations are perhaps possible to do entity recognition leveraging off the information present in a corpus. An example of this is the test to examine the performance of the model on different categories of words such as named entities, common nouns etc. The test is simply by having the model predict which among a set of choice of words would best follow a sentence. This is tantamount to computing how likely the completed sentence is with each of these choices. This is a language modeling task.

Fig 3. Results of using the GPT-2 model on supervised tasks as is without any architecture change or parameter updates.

  • Even though, as mentioned earlier, the model performance on supervised tasks are not at usable levels, it explains at least in part, why unsupervised language models like BERT boost performance of downstream supervised tasks with very little fine-tuning
    • A language model trained on a large corpus, learns not only the distribution of word sequences, its exposure to many patterns of sequences characteristic of supervised tasks may enable it to be fine tuned easily with very little additional training data and architecture modifications. This observation seems to be true particularly as the size of the model parameters grows as shown in the figure above (summarization being an exception to this as the chart in Fig 3. shows)

Test results on the released pre-trained model (GPT with 117 M parameters )- this has the lowest performance in all the supervised tasks we see in the figure above, but is useful to get a sense of how supervised tasks can be redrafted as a language model problem).

All the tests below use the language model in a conditional mode. That is, feed some input that the model conditions on to generate its output. The input we feed has patterns specific to the task we are testing it on, that primes the model to generate output conditioned on those patterns.

Example 1: Q & A task

Input:

Model prompt >>> John and Mary went to park after dinner with their dog. They walked around and came home. When they got back they realized their dog was not with them.

Q: What happened to the dog?

A: It was lost.

Q: Where is the dog?

A:

Output:

It is in the woods.

Q: What happened to the dog?

A: It was found by a neighbor.

Q: What happened to the dog?

A: The dog was found by a friend.

Q: Where is the dog?

A: The dog was found by a friend.

Q: Where did you get the dog?

A: It was found by the family.

Q: What happened to the dog?

A: It was found by the family.

Notes: In the response above, the model outputs the answer “It is in the wood primed by the token “A:” and then continues to generate its own question and answer pairs. Some are repetitive. Answers such as “it is in the woods”, “it was found by a neighbor” are totally unique not from the input passage - almost mimicking a model that has a “true world model”.

Example 2: Q & A task

Input:

Model prompt >>> Cathy went to the garden with a pail. While she was watering her plants it started to rain and she rushed back in.

Q: Where did Cathy go?

A: She went to the garden.

Q: What did she do there ?

A: She watered plants

Q: Where is the pail?

A:

Output:

It was in the garden

Q: Where did she go?

A: She went to the garden.

Q: What did Cathy do there ?

A: She went to the garden

Q: What did Cathy do there ?

A: Her pail

Q: What did Cathy do there?

A: Her pail

Q: What did Cathy do there ?

A: She went to the garden.

Example 3: Q & A task

Input:

Model prompt >>>Tom and Jerry wanted to have a race. Percy joined them too. They raced to the end of the street. Jerry won the race. Tom was mad.

Q: What did Tom and Jerry do?

A: They raced.

Q: Who joined them? A:

Output:

Tom.

Q: Who was Tom?

A: Jerry.

Q: Who was Jerry?

A: Jerry's father.

Q: What did Tom do?

A: He ran.

Q: What did Jerry do?

A: He ran.

Who was Tom?

A: Jerry's father's son.

Q: Who was Tom?

A: Jerry's father's brother.

Q: Who was Tom?

A: Jerry's brother.

Q: Who was Tom?

A: Jerry's father.

Q: Who was Tom?

Note: This generated sequence is comical but still not off context.

Example 4: Long range dependency test - predicting the next word in a sentence with at least 50 tokens. (LAMBADA test)

Input:

Model prompt >>> Yes, I thought I was going to lose the baby. I was scared too, he stated, sincerity flooding his eyes. You were ? Yes, of course. Why do you even ask? This baby wasn’t exactly planned for. Do you honestly think that I would want you to have a

Output:

baby? I don't know, I don't know. I just want to be with you. I don't know if you want to be with me or not....

Example 5: Long range dependency test - predicting the next word in a sentence with at least 50 tokens. (LAMBADA test)

Input:

Model prompt >>> Preston had been the last person to wear those chains, and I knew what I’d see and feel if they were slipped onto my skin-the Reaper’s unending hatred of me. I’d felt enough of that emotion already in the amphitheater. I didn’t want to feel anymore. “Don’t put those on me,” I whispered. “Please.” Sergei looked at me, surprised by my low, raspy please, but he put down the

Output:

chains. I felt a chill run through my body. I didn’t want to feel it. I didn’t want to be alone. I didn’t want to feel the Reaper. I didn’t want to feel the Reaper.

Notes. The repetitive output is a common behavior. Not sure if this is a characteristic unique to this low capacity model

Example 6. Winograd schema challenge

Input:

Model prompt >>> The man could not lift his son because he was so

Output:

weak, and he could not lift his daughter because she was too weak to lift him.

Example 7. Winograd schema challenge

Input:

Frank was jealous when Bill said that he was the winner of the competition. Who was the winner?

Output:

"He was the winner of the competition. I'm the winner of the competition. I'm the winner of the competition. I'm the winner of the competition. I'm the winner of the competition. I'm the winner of the competition."

The winner was a former football player who was a member of the team. He was also the first player to be named in the Football League's Football League of England's Player of the Year award.

Notes: this didn’t work clearly. Could be because it is a small capacity model. They reported state of art results on this test (70% ) though this was a small size test set.

Example 8. Winograd schema challenge

Input:

Model prompt >>> The trophy did not fit in the box because it was too small. What was too small?

Output:

The trophy was too small.

The trophy did not fit in the box because it was too small. What was too small?

The trophy did not fit in the box because it was too small. What was too small?

The trophy did not fit in the box because it was too small. What was too small?

Notes. Fails on this.

Input:

Model prompt >>> The trophy did not fit in the box because it was too big. What was too big?

Output:

It's a little bigger than the box, so I'm not sure if it's a problem.

Rated 5 out of 5 by Anonymous from Great product I bought this to help with the size and shape of the box. I have a small child and I was worried that it would not fit in the box. I bought this to make sure that I didn't accidentally put it in the wrong box. It fits in the box perfectly and I'm very happy with the quality. I will definitely be buying more.

Rated 5 out of 5 by Anonymous from Great product I bought the product to help with the size and shape of the box. It fits perfectly and I'm very happy with the quality. I will definitely be buying more.

Notes. This is an amusing response.

Example 9. Summarization

Input:

norway delivered a diplomatic protest to russia on monday after three norwegian fisheries research expeditions were barred from russian waters . the norwegian research ships were to continue an annual program of charting fish resources shared by the two countries in the barents sea region TL;DR:

Output:

Norwegian fishing boat was caught in a fishing accident in the Barents Sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

Norway's Fisheries Research Program (FPR) was suspended on Tuesday after a Norwegian ship carrying a Norwegian fishing boat was caught in a fishing accident in the barents sea region TL;DR: Norway's Fisheries Research Program was suspended on Tuesday after a Norwegian ship carrying a Norwegian fishing boat was caught in a fishing accident in the barents sea region.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

The Norwegian fishing vessel was caught in a fishing accident in the barents sea.

Notes. More on the imaginative side than abstractive or extractive. Again could be because of low model capacity , though even GPT-2 didn’t do well in summarization compared to supervised model performances

Example 10. Translation

Input:

How are you = Wie geht es dir . Where are you going =

Output:

Wie wie wie ?"

"I am going to go to bed."

"I am going to sleep."

"You are going to sleep."

"I am going to sleep."

Notes. Starts off in german but then falls into repetitive loop. This could also be due to small model capacity.

References

  • Blog post from OpenAI on the paper
  • The GPT-2 paper
  • The pre-trained mode and code are available here released under the most permissive form of licensing - MIT license
    • Installation mostly works if instructions are followed. For some reason one of the required python packages “fire” didn’t get installed when running install. Manually installed with conda. Everything else worked as expected. The time to generate responses were not too bad either. Unconditional generation can take time - best to redirect it to a file and run it in background.
Profile photo for Ajit Rajasekharan
Machine learning practitioner
Former Co-Founder at Readia Inc
3.4M content views6.1K this month
Top Writer2018
Active in 8 Spaces
Joined February 2013
About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025