Language models (from GPT-3 to Wudao 2.0)

In July 2021, the WHO issued guidance on visibility of dataset training used in AI and language models:

Promote transparency. Introduction of any AI technology must be sufficiently transparent that it can be criticised, by the public or by internal review mechanisms… the data used to train the algorithm, whether certain groups were systematically excluded from such data, how the training data were labelled and by whom (including expertise and appropriateness of labelling) should be known. (WHO, 2021)

GPT-3’s top 10 datasets by domain/source

Download source (PDF)
Contents: View the data (Google sheets)

Contents of GPT-3 & the Pile v1

Download source (PDF)
Contents: View the data (Google sheets)
Read detail of datasets within GPT-3 and the Pile v1, & see alternative viz

List of datasets in data models GPT-3, GPT-J, GPT-NeoX

Note: Text provided here for indexing only, please see the Google sheet above for formatting as intended.


What is in GPT-3? GPT-3 contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. WebText (Reddit links)
  3. Books2 (Libgen or similar)
  4. Books1/BookCorpus (Smashwords)
  5. Wikipedia (facts)
  6. -end of list-

Common Crawl (C4)

What is in Common Crawl? Common Crawl includes (C4, cleaned/filtered, sorted by most tokens):

  1. Google Patents (papers)
  2. The New York Times (news)
  3. Los Angeles Times (news)
  4. The Guardian (news)
  5. PLoS – Public Library of Science (papers)
  6. Forbes (news)
  7. HuffPost (news)
  8. – dead link (papers)
  9. Scribd (books)
  10. The Washington Post (news)
  11. The Motley Fool (opinion)
  12. InterPlanetary File System (mix)
  13. Frontiers Media (papers)
  14. Business Insider (news)
  15. Chicago Tribune (news)
  16. (discussion)
  17. The Atlantic (news)
  18. Springer Link (papers)
  19. Al Jazeera (news)
  20. Kickstarter (discussion)
  21. FindLaw Caselaw (papers)
  22. National Center for Biotech Info (papers)
  23. NPR (news)
  24. and more…

The Pile v1

What is in the Pile v1? The Pile v1 contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. PubMed Central (papers)
  3. Books3 (Bibliotik tracker)
  4. WebText (Reddit links)
  5. ArXiv (papers)
  6. Github (code)
  7. FreeLaw (papers)
  8. Stack Exchange (discussion)
  9. USPTO Background (papers)
  10. PubMed Abstracts (papers)
  11. Gutenberg (books)
  12. OpenSubtitles (movies)
  13. Wikipedia (facts)
  14. DM Mathematics (papers)
  15. Ubuntu IRC (discussion)
  16. Books1/BookCorpus (Smashwords)
  17. EuroParl (formal discussion)
  18. HackerNews (discussion)
  19. YoutubeSubtitles (movies)
  20. PhilPapers (papers)
  21. NIH ExPorter (papers)
  22. Enron Emails (discussion).
  23. -end of list-

GPT-3 is sometimes misspelt as: GPT3, GPT 3, GPT three, GTP-3, GTP3, GTP 3, GTP three.

Contents of Chinese models

Download source (PDF)
Contents: View the data (Google sheets)

List of datasets in Chinese data models PanGu Alpha, Wudao 2.0

Note: Text provided here for indexing only, please see the Google sheet above for formatting as intended.

PanGu Alpha

What is in PanGu Alpha? PanGu Alpha contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. Public datasets: DuReader (discussion), Baidu QA (discussion), CAIL2018 (legal papers), SogouCA (news), and more…;
  3. News
  4. Encyclopedia: Baidu Baike (facts), Sogou Baike (facts), and more…;
  5. e-Books
  6. -end of list-

WuDao 2.0

WuDaoCorpora 1.0 (dataset) and Wudao 1.0 (model) were launched in March 2021.
WuDaoCorpora 2.0 (dataset) and Wudao 2.0 (model) were launched in June 2021 (at the 2021 BAAI conference).

WuDaoCorpora 2.0 is composed of three parts:
1. WDC-Text (3TB text), the world’s largest plain text dataset.
2. WDC-ImageCaption (90TB image and text), the world’s largest multimodal dataset.
3. WDC-Dialogue (180GB text), the world’s largest Chinese dialogue dataset.

WDC-Text (3TB text)
3TB of text data, with labelling. “20 strict cleaning rules used by WuDaoCorpora1.0, and derives high-quality datasets from more than 100TB of original web page data.”

WDC-ImageCaption (90TB image and text)
“Contains 630 million image and text pairs, with a total data volume of about 90TB, the largest in the world. Among them, 600 million is related to graphics and text, and 30 million is a specific description of the content of the image.”

WDC-Dialogue (180GB text)
“Contains 181GB of high-quality Chinese dialogue data, and the total number of dialogues reaches 1.4B… Cleaned up 180GB of high-quality dialogue data from 9TB of raw data.”

What is in Wudao 2.0? Wudao 2.0 contains:
WuDaoCorpora2 – Chinese text only:

  1. Zhihu (discussion)
  2. Baidu Baike (facts/encyclopedia)
  3. Sogou Baike (facts/encyclopedia)
  4. Baidu QA (discussion)
  5. Other*:

(*best guess only, sorted by most visits);

  1. Tencent QQ (messenger)
  2. Sohu (news)
  3. Sina Weibo (discussion)
  4. Sina Corporation (news)
  5. Xinhua News Agency (news)
  6. Chinese Software Dev Network (discussion)
  7. Global Times (news)
  8. Tianya Club (discussion)
  9. (finance discussion)
  10. BabyTree (parenting discussion)
  11. CNBlogs (software discussion)
  12. 6Rooms (news)
  13. NetEase (discussion)
  14. Hunan Rednet (news)
  15. Bilibili (video discussion)
  16. and more…

“Corpora contains various data types including news, post bar comments (sic), encyclopedia information, etc. More specifically, WuDaoCorpora contains a 3 TB Chinese corpus collected from 822 million Web pages” (WuDaoCorpora paper, Tang et al, June 2021).

“For training of base model, we use a training set of 302GB, the distribution of these data is shown in Table 7” (Inverse Prompting paper, Tang et al, June 2021).

Wudao 2.0 is sometimes misspelt as: Wudao-2, Wudao 2, Wu dao 2.0, Woodao, Woo dao.

Chinese model names & dataset equivalent in English

PanGu Alpha: Launched by Huawei and others in April 2021.
Simplified Chinese: 盘古
Traditional Chinese: 盤古
Pinyin: Pán gǔ
Pronounced: pun-goo (rhymes with done tool)
English: Literal: ‘coil ancient’, first living being and the creator (coiled up in an egg).
Etymology: Mythical Chinese creation figure who emerged from a yin-yang egg and created the earth and sky (similar to the Christian creation story, and Pangu has been compared to Adam).

Wudao 2.0: Launched by the Beijing Academy of Artificial Intelligence (BAAI) and others in June 2021.
Simplified Chinese: 悟道
Traditional Chinese: 悟道
Pinyin: Wù dào
Pronounced: oo-dao (rhymes with tool now)
English: Literal: ‘Enlightenment’.
Etymology: Truth of the Dharma, the spiritual path.

Chinese dataset English dataset equivalent
Zhihu (discussion) Quora
Baidu Baike (facts) (16M articles) English Wikipedia (7M articles)
Sogou Baike (facts) English Wikipedia (7M articles)
Baidu QA (discussion) Stack Exchange
Tencent QQ (messenger) ICQ
Sohu (news) NBC
Sina Weibo (discussion) Twitter
Sina Corporation (news) CNN
Xinhua News Agency (news) CBS
Chinese Software
Dev Network (discussion)
Stack Exchange
Global Times (news) Washington Post
Tianya Club (discussion) Yahoo! Groups (finance discussion) Yahoo Finance
BabyTree (parenting discussion) TheBump
CNBlogs (software discussion) Hacker News
6Rooms (news) Huffington Post
NetEase (discussion) Blizzard
Hunan Rednet (news) The New York Times
Bilibili (video discussion) YouTube

Language model sizes & predictions

Download source (PDF)
Sizes: View the data (Google sheets)

Summary of current models

Name Launch Tokens Params ↓ % of 1.75T Training text
(raw size)
Feb 2019 10B 1.5B 0.09% 40GB
Jun 2021 400B 6B 0.34% 825GB
May 2020 499B 175B 10% 570GB
PanGu Alpha
Apr 2021 40B 200B 11.43% 1.1TB
May 2021 560B 204B 11.66% 1TB?
Wudao 2.0
Jun 2021 500B? 1.75T 100% 3TB
Jun 2021 1T? 200B? 11.43%? 1TB?
TBA 20T? 10T? 571%? 5TB?
TBA 500B? 175B? 10% 825GB?
New model
New model
Name Launch Tokens Params ↓ % of 1.75T Training text
(raw size)

Summary of current models: View the data (Google sheets)

Facebook BlenderBot 2.0

Launched July 2021, BlenderBot 2.0 is pre-trained on WebText (Reddit discussion), fine-tuned on ConvAI2, Empathetic Dialogues, and Wizard of Wikipedia (WoW) datasets. The two additional datasets are Multi-Session Chat and Wizard of the Internet (WizInt). To train for safety, it uses the BAD dataset. Finally—in realtime—it is able to add live results by ‘generating its own search queries, reading the results, and taking them into account when formulating a response.’

List of validation set domains in WizInt/BlenderBot 2.0

BlenderBot 2.0 chatbot uses live/realtime web search engine results as part of its language model. The validation set (WizInt) paired up humans to have a conversation, with one human given the option to perform a web search (query and query + "news") to respond to their partner in conversation. Search results were added to the conversation by the human 80.3% of the time. The resulting WizInt dataset (validation set of human conversations with search) is used as supervision for new queries in BlenderBot 2.0. That is, new conversations with BlenderBot 2.0 will generate new responses that may include live/realtime web search engine results.

Breakdown of most common domains used during search… (validation set breakdown). Shown is the most common 24.41%, there is a long tail of 1,233 other domains across the whole validation set.

Domain %
Wikipedia 8.56%
IMDb 3.08%
Britannica 2.28%
Healthline 0.84%
All Recipes 0.84%
Rotten Tomatoes 0.8%
Ranker 0.8%
Genius 0.76%
Rolling Stone 0.67%
Live About 0.63%
The Spruce Eats 0.55%
The Guardian 0.51%
Biography 0.51%
Esquire 0.42%
The Spruce 0.38%
Men’s Health 0.38%
Book Series in Order 0.38%
Trip Savvy 0.34%
Forbes 0.34%
Thoughtco 0.34%
Wikihow 0.34%
WebMD 0.34%
Thrillist 0.34%
1,233 more domains… 75.59%

References for Blenderbot 2.0

From the paper: Summary of Figure 2.

Read the paper:
BlenderBot 2.0 (Facebook): Komeili et al (2021). Internet-Augmented Dialogue Generation. (PDF)

Google LaMDA: Language Model for Dialogue Applications.
Trained on dialogue. No details released as of July 2021.

Note that LaMDA’s predecessor, Google Meena, had 2.6B parameters trained on 40B words, from 867M context/response conversations, from 341GB of text. The dataset was filtered from public domain social media conversations (Reddit or similar). Google Meena was launched in January 2020.

Between releases of Google Meena and Google LaMDA was Facebook Blender, which had 9.4B parameters trained on 88.8B words, from 1.5B context/response samples. The dataset was filtered from public domain social media conversations on Reddit. Facebook Blender was launched in April 2020.

Dr Alan D. Thompson is an Australian AI expert and consultant. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to major AI projects with intergovernmental organisations and impactful companies. Contact.

This page last updated: 22/Jul/2021.