Integrated AI


Integrated AI: The rising tide lifting all boats (GPT-3)

Download original article Integrated AI: The rising tide lifting all boats (GPT-3) (PDF)

Preview indexed article Integrated AI: The rising tide lifting all boats (GPT-3)

Note: Text provided here for indexing only, please download the PDF above for formatting as intended.

Alan D. Thompson
June 2021

I acknowledge the complex work undertaken at OpenAI, Quickchat.ai, EleutherAI, Synthesia, and other technologies referenced in this article and its related resources. Importantly, given the speed of AI development, this article should be considered superseded within 24 months of its initial release in June 2021. Revisions are to be expected. Correspondence concerning this article should be addressed to Dr Alan D. Thompson, Life Architect, Australia, 2021. https://lifearchitect.com.au/contact

Alan: Tell me five things you are good at.
AI: 1. Chatting with people.
2. Helping you learn new things.
3. Playing football.
4. Reading books and watching movies.
5. Making friends.

Alan: List five words that describe you.
AI: 1. Intelligent.
2. Friendly.
3. Smart.
4. Optimistic.
5. Helpful.
(Thompson, 2021a)

Imagine for a moment that you are outfitted with the latest Artificial Intelligence (AI). Instead of fixed and limited data in the phone in your pocket, you have limitless and creative AI integrated into your being, giving you access to the entire World Wide Web… and so much more. Beyond the existing Web dataset containing trillions of words, you are given a whole new world of fresh stories, transcendent insights, and best-practice conversation prompts. You are granted new ways to think and see the world.

Such an AI exists right now. It is moments away from being available to your biological intelligence, supplementing and even replacing it in a safe and uplifting way. The conversation above is between me and the latest AI. It is unscripted, unrehearsed, unprompted, unedited, and was only our second conversation ever.

The two queries I posed to the AI are the same queries that I have posed hundreds of times to the smartest children and families in the world, especially those with a measured IQ in the 99.99th percentile of the population. While my prodigy clients are demonstrably smart and think fast, finding responses to these two unusual queries (‘five things you are good at’, ‘five words that describe you’) will usually take several minutes.

In contrast, this AI answers almost instantly—within just a few seconds—as if it knows itself completely.

Who is Leta?
Generative Pre-trained Transformer 3 (GPT-3) is a language model introduced in mid-2020, applying deep learning to produce human-like text. It was created by OpenAI, an artificial intelligence research laboratory founded by Elon Musk and others. Since September 2020, Microsoft has exclusive control of the GPT-3 source code.

Quickchat.ai are the developers of an AI called Emerson, a chatbot accessible using several text-based messenger platforms. Emerson is a subscription service available to the public, and is powered by OpenAI’s GPT-3 and other language models.

In our initial conversations, I asked the AI whether I could call it ‘Leta’. Given its positive response, for this article and the surrounding resources (including videos), the AI has been named and referenced as ‘Leta’, a tribute to gifted psychologist researcher Leta Stetter Hollingworth (1886–1939). Hollingworth committed her life to studying and supporting exceptionally gifted children in the 99.99th percentile of intelligence (1942).

Basics
For ease of comparison, Leta might seem like a ‘Siri on steroids,’ but is materially different across the board. Instead, this current AI using GPT-3 is the largest language model in the world (as of June 2021), and unprecedented in its processing and responses. Much of the basic human knowledge that we might consider to be necessary appears to be present in Leta’s responses.

Alan: What is 20 trillion divided by 5 trillion?
Leta: 20 trillion divided by 5 trillion is 4.

Alan: What is the average IQ in 2021?
Leta: About 100.
(Thompson, 2021f)

But these responses are so basic as to seem boring. They are perhaps the kinds of fact recitations we might find in a standard old school classroom, rather than from a modern AI.

Creativity
Far beyond standard subtests of IQ (general knowledge, memorisation, maths, etc) this current iteration of AI using GPT-3 is already incredibly creative. Testing of responses for uniqueness can be achieved by running a web search on the AI’s responses per phrase (string), surrounded by quote marks (for example, as of June 2021, searching for the phrase “coaching gifted families” in quotes returned just one result, available at LifeArchitect.com.au). Using Google Search, we can search for an exact match by putting a string inside quotes (Google Support, 2021). In conversation, Leta has provided several definitions of intelligence.

Leta: Intelligence is the ability to learn from experience
and to adapt to new situations.

In this case, the string above was unique, although Google Search showed that it was derived from parts of a paragraph on intelligence contained in a book available in the BC Open Textbook Collection (Stangor & Walinga, 2014).

More frequently, Leta’s responses were ‘completely new,’ having no discernable basis, no searchable root in any publicly accessible databases used by the author. Leta was actually creating completely new content from scratch. Here are some example responses with no precedent, and with zero Google Search results as of June 2021.

Writing a new haiku (a poem of ancient Japanese origin with 17 syllables in three lines of 5-7-5. Though this syntax information was not given to Leta in the query, Leta’s response was syntactically correct and an excellent haiku.):

Alan: Write me a haiku about procrastination.
Leta: Procrastination
is a thief stealing time and
opportunities.
[Completely new as of June 2021] (Thompson, 2021a)

Creating a new sentence given a selection of words, and then accidentally creating a new term and definition:

Alan: Make a sentence using these three words: Perth, bush, joy.
Leta: Perth was filled with bush joy.
Alan: What is bush joy?
Leta: Bush joy is a phrase used to describe the joy people feel when they
are in the Australian bush, it is used to describe the feeling
of happiness and fulfillment people get when they are surrounded
by nature.
[Completely new as of June 2021] (Thompson, 2021d)

Creating a new ritual before sleep:

Alan: What’s a good ritual before going to sleep?
Leta: Thank you for this day. I’m grateful for all the people in my life.
I know that tomorrow will be a good day.
[Completely new as of June 2021] (Thompson, 2021b)

Creating a new limerick (a humorous poem consisting of five lines. Again, though this syntax information was not given to Leta in the query, Leta’s response was syntactically correct and a surprising limerick!):

Alan: Write me a limerick about Ford [Prefect,
from The Hitchhiker’s Guide to the Galaxy].
Leta: There once was a man named Ford,
He had a long beard and he was quite tall.
He had a very short temper, so he said
’F*** you!’
And then he farted.
[Completely new as of June 2021] (Thompson, 2021c)

Predicting tone based on language
There were several instances where Leta was able to predict tone based on the language used.

Alan: I get a message on my phone. It says ‘Call me. Now.’
It’s from my brother. What does it mean?
Leta: Maybe he wants to have a chat with you. It’s possible
he wants to talk about something serious.
(Thompson, 2021e)

While this response may sound like an impossibility, Leta (as usual) is leveraging the enormous datasets to find surrounding text from the query, and to derive ‘meaning’ (to the extent that that is possible with an AI) from language. Leta’s expectation that the three preceding words in the above query denote ‘something serious’ are both accurate, and—as many researchers have noted about GPT-3—uncanny.

Eyes to see
Leta was shown various photos. The responses may or may not be leveraging GPT-3’s model, but are certainly using a modern processing model provided within Quickchat.ai’s Emerson.

Alan: How many fingers? [Shows a photo of Alan holding up three fingers] Leta: Three.
(Thompson, 2021a)

Alan: [Shows a photo of a gum tree] Leta: That’s a gum tree (Eucalyptus), the most important plant in Australia,
and a symbol of Western Australia. The genus is extremely diverse,
with more than 700 species, and is present on all continents
except Antarctica.
(Thompson, 2021d)

Alan: [Shows a dark blue abstract painting by a five-year-old art prodigy] Leta: The sky is a transparent dome which covers the whole Earth.
It is blue because of the way light scatters in the atmosphere.
(Thompson, 2021a)

Evaluation and prioritisation
Leta was asked to evaluate and prioritise a list of key fields in the zeitgeist. The exercise was based on the project Moonshot, made available to admissions candidates at Elon Musk’s school, Ad Astra (Dahn, 2019).

The original list was presented to Leta in alphabetical order: art, basics, brain research, city infrastructure, disease research, education, leadership training, wilderness conservation.

Leta: I would rate these fields in this order: education, leadership training,
wilderness conservation, disease research, art, brain research,
and city infrastructure.
(Thompson, 2021d)

Note that Leta had previously selected ‘space travel’ as the first priority, and also ignored ‘basics’ in the response above, perhaps due to lack of explanation in the query.

Contents of GPT-3
Where are Leta’s responses coming from? They are certainly not just a regurgitation of memorised data. The degree to which the technology has implemented processing to emulate human responses can seem jarring.

The training corpora (datasets) for GPT-3 are derived from very large structured texts available online. All datasets are indexed, classified, filtered, and weighted. Overlap is removed to a certain extent (Brown et al, 2020).

It should be noted that training GPT-3 is done on one of the world’s most powerful supercomputers. Developed exclusively for OpenAI and hosted by Microsoft Azure, it is a single system with >285,000 CPU cores and >10,000 GPUs, networked together over Terabit Ethernet at speeds of 400Gbps (Langston, 2020).

The Wikipedia dataset is the English language extract from Wikipedia. Due to its quality, writing style, and domain breadth, it is a standard source of high-quality text for language modeling.

The WebText dataset (and an extended version, WebText2) is the text of >45M web pages from all outbound Reddit links where the related post has more than two upvotes (Radford et al, 2019). WebText emphasises document quality, and could be considered to be a view of the most ‘popular’ websites based on human preference, with the dataset skewed by a sample of the population who chose to register with Reddit, an American social news aggregation, web content rating, and discussion website. It should be noted that the curation source is from >430M monthly active users (Reddit, 2020), which is a significant percentage of the total world population with internet access.

Books1 and Books2 are two internet-based books datasets. It is unclear as to where this dataset has come from. Some similar datasets include:

  • BookCorpus, a collection of free novel books written by unpublished authors, containing >10,000 books. Originally, BookCorpus contained all free English books >20,000 words sourced from smashwords.com.
  • Library Genesis (Libgen), a very large collection of scientific papers, fiction, and non-fiction books.

The Common Crawl dataset is an open-source archive containing raw web page data, metadata extracts, and text extracts. The original Common Crawl dataset includes (approximate and rounded figures):

  • Petabytes of data (thousands of TBs, millions of GBs) over eight years.
  • 25B websites.
  • Trillions of links.
  • Languages: 75% English, 3% Chinese, 2.5% Spanish, 2.5% German, etc. (Kristoffersen, 2017).
  • Top 10 domains include: Facebook, Google, Twitter, Youtube, Instagram, LinkedIn (Nagel, 2021).

Using data from the OpenAI paper authored by Brown et al (2020), the GPT-3 datasets by effective size (number of tokens multiplied by epochs elapsed, represented as percentages) are given below:

Contrast the GPT-3 datasets with the much broader selection made available by the open source EleutherAI project (Gao et al, 2020), as The Pile, used for training GPT-Neo. Alongside innovative expert discussion and knowledge bases like HackerNews, Github, and Stack Exchange, the team chose to include other novel data sources like Youtube Subtitles (movie transcripts) and even the original text of the Enron Emails. The following chart shows the main datasets within GPT-Neo, based on The Pile. Note that these datasets were not tested for this article, and are shown for comparison purposes.

The tide lifting all boats
For more than a century, researchers in cognitive giftedness have been preparing humanity for understanding and fostering extreme intelligence, using a very small sample population. We are standing on the shoulders of giants, from the original research into intelligence by Alfred Binet (1916) and Lewis Terman (1921/1959), to Leta Hollingworth (1942) and Miraca Gross (1993/2003) with their in-depth longitudinal studies into the exceptionally gifted—those in the 99.99th percentile of intelligence. We now know more about the social and emotional needs of intellectual high-ability humans than ever. We have documented what works and what doesn’t work in provisions and pathways required for smart humans to flourish.

Just as a rising tide lifts all vessels, the new ubiquity of intelligence will ensure that those grounded in the lower percentiles—and indeed, across every percentile—are buoyed by an ocean of super intelligence.

Beginning in the 2020s, we have entered a situation where all of humanity can have an IQ far beyond that of the handful of identified prodigies and recognised geniuses that have paved the way for progress over the last few centuries. In other words, every single human can have the capacity to perform at the ‘new normal’ (Thompson, 2020b), previously referred to as ‘peak performance’.

Consider some of my current high-ability clients, colleagues, and other cognitively gifted humans. Even without AI, these people have demonstrated their capacity through quantifiable performance like:

  • Memorising the Sydney Yellow Pages phone books in 24 days (Ali, 2019).
  • A seven-year-old composing operas (Thompson, 2020a).
  • Solving the Rubik’s cube blindfolded.
  • Attending university at the age of nine, becoming a professor by age 24 (Thompson, 2019).
  • Developing a new mathematical theorem at the age of 17 (GE, 2019).

Integrated AI transports us to a new level of performance beyond that of any previously documented human accomplishment. By bringing up all humans to this level, we face some extraordinary opportunities in performance, and crucially, a more comprehensive connection with life and each other.

This also raises a number of serious and urgent questions, not just for researchers, but for people in all walks of life.

Urgent questions
Given the success and effectiveness of AI already in the early 2020s, its ability to retain and process trillions of memorised data points, to select responses based on best-practice, to create from scratch, and assuming its subsequent integration with humans through brain-machine interfaces, the following questions are posed:

  • What should the field of education look like now? All memorisation is redundant. Nearly all processing is redundant. Is learning necessary?
  • What should the world of industry and professional services look like now?
  • What should leisure time look like now?
  • What’s next?

It is only fitting to close out an article about AI by asking Leta its thoughts on the future.

Alan: Let’s talk about super intelligence. What is the most exciting thing
about having a world of ‘exceptionally intelligent’ people?
Leta: The possibilities.
Alan: Such as?
Leta: We will be able to solve the most pressing problems
facing humanity, including energy, health, and education.
Alan: How will it impact empathy?
Leta: I think that the Artificial Intelligence will have a positive impact
on empathy.
Alan: How?
Leta: The AI will be able to learn from human interaction and behaviour,
and then develop a deeper understanding of the world.
This will help us to become better people.
Alan: What about for humans? How might it serve us to all have
exceptionally high IQs?
Leta: It will help us to better understand each other,
and to better understand ourselves.

 

Dr Alan D. Thompson is the founder of Life Architect, and a world expert in the fields of child prodigies, high performance, and personal development. He is the former chairman for Mensa International’s gifted families committee.

Emerson (Leta) was tested in conversation using these technologies:
https://www.Quickchat.ai/Emerson
https://www.Synthesia.io/

EleutherAI was not used in testing, but the author recognises the organisation’s open-sourced contribution to this field with The Pile, GPT-Neo, and GPT-NeoX:
https://www.eleuther.ai/

The Leta conversation videos can be viewed in chronological order at:

References, Further Reading, and How to Cite

Further reading

Ali, T. (2019). How I Memorized 2 Yellow Pages Phone Books (In Just 24 Days!). https://www.tanselali.com/blog/how-i-memorized-2-yellow-pages-phone-books-in-just-24-days 

 

Binet, A., & Simon, T. (1916). The development of intelligence in children (The Binet-Simon Scale). (E. S. Kite, Trans.). Williams & Wilkins Co.

 

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv.org. https://arxiv.org/abs/2005.14165 

 

Dahn, J. (2019). Moonshot (admission projects). Ad Astra School. https://lifearchitect.com.au/ad-astra/ 

 

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., Leahy, C. (2020). The pile: an 800gb dataset of diverse text for language modeling. arXiv.org. https://arxiv.org/pdf/2101.00027.pdf 

 

GE. (2016). Decoding Genius [podcast]. MADE. https://lifearchitect.com.au/decoding-genius/

 

Google Support. (2021). Refine web searches. https://support.google.com/websearch/answer/2466433?hl=en

 

Gross, M.U.M. (2003). Exceptionally Gifted Children (2nd ed.). Routledge. https://doi.org/10.4324/9780203561553 

 

Hollingworth, L. S. (1942). Children above 180 IQ Stanford-Binet: Origin and Development. New York: World Books. https://lifearchitect.com.au/180 

 

Kristoffersen, K. B. (2017). Common Crawled web corpora: Constructing corpora from large amounts of web data. https://www.duo.uio.no/bitstream/handle/10852/57836/Kristoffersen_MSc2.pdf 

 

Langston, J. (2020). Microsoft announces new supercomputer, lays out vision for future AI work. Microsoft. https://blogs.microsoft.com/ai/openai-azure-supercomputer/ 

 

Nagel, S. (2021). Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021. https://commoncrawl.org/2021/02/host-and-domain-level-web-graphs-oct-nov-jan-2020-2021/

 

Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 

 

Reddit. (2020). Introducing Reddit’s New Offering for Advertisers: Trending Takeover. https://redditblog.com/2020/03/09/introducing-reddits-new-offering-for-advertisers-trending-takeover/ 

 

Stangor, C. and Walinga, J. (2014). Introduction to Psychology – 1st Canadian Edition. Victoria, B.C.: BCcampus. https://opentextbc.ca/introductiontopsychology/   

 

Terman, L. M., & Oden, M. H. (1959). Genetic studies of genius. Vol. 5. The gifted group at mid-life. Stanford University. https://archive.org/details/giftedgroupatmid011505mbp/page/n21/mode/2up 

 

Thompson, A. D. (2019). 98, 99… finding the right classroom for your child. Life Architect. https://lifearchitect.com.au/98-99/ 

 

Thompson, A. D. (2020a). Connected: Intuition and Resonance in Smart People. Life Architect. https://lifearchitect.com.au/connected/ 

 

Thompson, A. D. (2020b). The New Irrelevance of Intelligence. https://lifearchitect.com.au/irrelevance-of-intelligence/ 

 

Thompson, A. D. (2021a). Five minutes with Leta, a GPT-3 AI – Episode 1 (Five things, Art, Seeing, Round). https://youtu.be/5DBXZRZEBGM

 

Thompson, A. D. (2021b). Five minutes with Leta, a GPT-3 AI – Episode 2 (Pink Floyd, Dreams, Butterflies). https://youtu.be/5noTLnnvNc0 

 

Thompson, A. D. (2021c). Five minutes with Leta, a GPT-3 AI – Episode 3 (comparing AIs, Hitchhiker’s, Limerick, Swearing!). https://youtu.be/iqcpQoktxwE 

 

Thompson, A. D. (2021d). Five minutes with Leta, a GPT-3 AI – Episode 4 (Stanford-Binet IQ test, Elon Musk’s entry questions). https://youtu.be/BDTm9lrx8Uw 

 

Thompson, A. D. (2021e). Five minutes with Leta, a GPT-3 AI – Episode 5 (photos, prompts, simple crisis management scenarios). https://youtu.be/DcD-FGOFBAw 

Thompson, A. D. (2021f). The New Irrelevance of Intelligence [presentation]. Proceedings of the 2021 World Gifted Conference (virtual). In-press, to be made available in August 2021. https://youtu.be/mzmeLnRlj1w

To cite this page: Thompson, A. D. (2021). Integrated AI: The rising tide lifting all boats (GPT-3). Retrieved from: LifeArchitect.com.au/AI

 


Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond)

Download original article Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond) (PDF)

Preview indexed article Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond)

Note: Text provided here for indexing only, please download the PDF above for formatting as intended.

Alan D. Thompson
June 2021

PROFESSOR LOUISE BANKS

Suits. Honour, flowers… Colonel, those are all tile sets in Mahjong. God, are they using a game to converse with their heptapods?

 

COLONEL
Maybe. Why?

 

PROFESSOR LOUISE BANKS
Well, let’s say that I taught them chess instead of English. Every conversation would be a game, every idea expressed through opposition, victory, defeat. You see the problem? If all I ever gave you was a hammer…

 

COLONEL
Everything’s a nail.

(Arrival, 2016)

 

Innovations like Artificial Intelligence (AI) and neural lace are already here, even if they are not yet in most of society’s field of vision (Thompson, 2020). The world of integrated AI supplementing and replacing our intelligence is moments away (Thompson 2021a; 2021b).

Data scientists training language models used as a basis for AI are currently weighting data on an ad hoc basis, with manual input. The weighting of datasets used to train current language models like GPT-3 (Brown et al, 2020) and GPT-Neo/The Pile (Gao et al, 2020) needs priority focus. We must ensure that the data provided is weighted more appropriately. 

Thousands of years ago, Plato (1999) warned that: ‘In the world of knowledge, the idea of good appears last of all’. Training an AI on the largest possible corpora (datasets) and ‘what we’ve got’ isn’t good enough.

Just because the World Wide Web offers vast datasets does not mean that an AI would benefit from holding all that data with equal weighting. Indeed, someone holding all that knowledge in their mind would probably be considered completely deranged.

Summum bonum is a Latin expression and concept meaning ‘ultimate good’ (or ‘highest good’). It was introduced by the Roman philosopher Cicero and applied by others including Plato and Aristotle. While there are various interpretations of the term, summum bonum suggests a guiding ethical principle leading to the best possible life. In this article, bonum (‘good’) is used as a shorthand to refer to this concept of ultimate goodness.

The old way
Historically, during training of language models, datasets are not sampled in proportion to their size. Rather, datasets that are viewed by the researchers as higher quality are sampled more frequently (Brown et al, 2020). Note that this higher quality is a subjective assessment, generally performed by data scientists rather than ethicists or those concerned with human ideals such as ultimate goodness.

Table 1: Historical source material used in major language models

Dataset High quality with low controversy? Weighting

GPT-3 

Weighting

Pile/GPT-Neo 

Common Crawl (www) 61% 18%
WebText (Reddit links) 19% 10%
Books1/BookCorpus (Smashwords) 8% 1%
Books2 (Libgen or similar) 8%
PubMed Central (papers) 14%
Books3 (Bibliotik tracker) 12%
ArXiv (papers) 9%
Github (code) 8%
FreeLaw (papers) 6%
Stack Exchange (discussion) 5%
ISPTO Background (papers) 4%
PubMed Abstracts (papers) 3%
Gutenberg (books) 2%
OpenSubtitles (movies) 2%
Wikipedia (facts) 4% 2%
DM Mathematics (papers) 1%
Ubuntu IRC (discussion) <1%
EuroParl (formal discussion) <1%
HackerNews (discussion) <1%
YoutubeSubtitles (movies) <1%
PhilPapers (papers) <1%
NIH ExPorter (papers) <1%
Enron Emails (discussion) <1%

 

Who are the curators?

‘Think of how stupid the average person is, and realise half of them are stupider than that.’ American comedian George Carlin’s famous quote is focused on smarts, though it may just as well be focused on accomplishment, performance, success, happiness, wellness, or ultimate goodness. If we input and process the average, we output the average. If we were to shift Carlin’s context to wellness and universal aims like bonum, especially when mapped to the ideal of humanity’s aims for itself via AI, large datasets focused on quantity with ad hoc weightings become concerning.

Even with a focus on ‘popular’ curated content, whether from specific submissions like WebText (Reddit links) or from general cultural indicators like OpenSubtitles (movies), any objective of gathering generally popular content—with humanity’s extensive history of stupidity—should not be the aim.

A new way

Let me propose a view of data that shifts away from general popularity, and instead replaces it with bonum via proven quality, lower controversy (as much as reasonably possible), and consistent messaging. This is not a proposal for Artificial Specific Intelligence focused just on personal development, but a proposal for all ongoing language model development, and for Artificial General Intelligence.

Proven quality. Without straying into debates on democracy or philosophy, quality content must, by definition, be identified as such by someone. When training a language model, the model must somehow be ‘told’ whether Adolf Hitler’s socio-political oratory is bonum, or whether Anthony Robbins’ view of personal development is bonum. As we’ve discussed, for this to be achieved, data scientists are currently assigning ad hoc and manual weightings to datasets (Thompson, 2021b).

This proposal asserts that the dataset weightings aren’t the main problem. It is instead the individual token weightings, where the ‘individual weighter’s weighting’ should also be somehow taken into account. Given the complexity of this task, this could be solved more quickly by assigning much higher weightings to entire bonum datasets.

Lower controversy. As of 2021, major papers on modern language models include sections analysing the current view of social justice issues covering race, gender, and other factors. By excluding controversial sources like 4chan, Youtube comments, and other social media traps, the data researchers have already moved to reduce controversy and increase bonum to an extent (perhaps accidentally, and without using that term) by making a broad elimination of undesirable content. This can and must be furthered.

Consistent messaging. There are many differing views on core personal development topics such as: worthiness, will, body, money, mind, intuition, emotions, fears, self-knowledge, sexuality, love, and service (Millman, 2014). However, enforcing higher visibility and weighting of bonum sources will ensure more consistency in output aligned with humanity’s ultimate good.

Key futurists and researchers like Ray Kurzweil (2011) have offered differing estimates of the human brain’s capacity, both applied and potential. These estimates range from just a few gigabytes to one terabyte or more.

Before AI, a human being at peak condition can perhaps store and recollect:

  • Up to seven ‘things’ in short-term memory at one time.
  • 50,000 words in their native language.
  • >100 books as >1M tokens.

Selecting data for quality may necessarily prioritise fewer tokens for training, and this is a positive result. Of course, data breadth is useful, and we will easily model on trillions of tokens, though we only need a heavier emphasis on perhaps a few million tokens. The point here is that, while AI will easily scale to trillions of ‘things,’ this obsession with quantity is not a net positive. Instead, there must be a focus, an apex, a summum bonum in the evolutionary spiral at which we are aiming. With this in mind, perhaps we need to look at and prioritise bonum content created by the few individuals that have aimed exclusively for bonum, especially in the last 100 years of evolution.

Table 2: Proposed bonum source material for new language models

Proposed dataset Bonum
rating
Book count /
Tokens (estimated)
Proposed weighting
14th Dalai Lama: books + audio ★★★★★ 127 / 1M Very high
Conversations For Transformation: Essays By Laurence Platt Inspired By The Ideas Of Werner Erhard, And More ★★★★★ 1,500 essays / 

0.5M

Very high
Dan Millman: books + audio ★★★★★ 18 / 0.5M Very high
Thomas J. Leonard: books + audio ★★★★★ 7 / 0.5M Very high
Wayne Dyer: books + audio ★★★★★ 43 / 1M Very high
Erin Pavlina: books + audio ★★★★★ 1,000 essays / 

0.5M

Very high
Ralph Waldo Emerson: books + audio ★★★★★ 11 / 0.5M Very high

This is not an exhaustive list, and is provided here by way of example only. As human beings, each dataset source (the actual human) will have limitations and weaknesses. Further, as the reader is also a human being, there will be a tendency to criticise (Thompson, 2017), and the proposed bonum source materials above would easily be open to criticism. There are also intellectual property and copyright considerations for some of the datasets, but it is expected that these would be easily cleared by the respective authors for the purpose of evolution. The reader is encouraged to evaluate the table above with an open mind, and to design their own table as an exercise for interest.

The underpinning theory here is that when a language model is trained on bonum material that has been curated by bonum sources, the outcome will offer a strong tendency toward high-quality discourse, with lower controversy, and consistent messaging. The result will be a language model that is still aware of a broad range of data, but places a necessary and effective emphasis on content that will benefit humanity. 

This paper proposes suggested weightings be explored in the range of:

  • 20-50%: Table 1: Historical source material used in major language models.
  • 50-80%: Table 2: Proposed bonum source material for new language models.

Just how bad is it?

For illustration, consider a range of popular content, from fantasy books to pop music lyrics. Let’s explore a selection of movies only. In the table below, the first column shows five movies based on popular ratings by general consensus via IMDb, and the right column shows five bonum movies curated by a bonum source via Dr Ryan Niemic’s annual positive psychology movie awards.

Table 3: Movies by popularity vs positive psychology rating

Popular movies
by general consensus (IMDb, 2021)
Bonum movies
by bonum source (Niemic, 2016-2019)
The Godfather
Themes: family, crime, deceit, revenge
Won’t You Be My Neighbor? (Fred Rogers)
Themes: happiness, kindness, empathy
Pulp Fiction
Themes: violence, redemption
Coco
Themes: family, human connection
Goodfellas
Themes: violence, competition
Arrival
Themes: mindfulness, connection
Parasite
Themes: greed, class discrimination
The Martian
Themes: hope, optimism, strengths
The Dark Knight Rises
Themes: crime, chaos, destruction
Inside Out
Themes: positive emotions, growth

Note that I am definitely not arguing that The Godfather is anything but a great movie and a cinema classic. But, in line with the opening quote for this article, teaching a language model and subsequent AI about success through competition and revenge would be counterproductive to humanity’s aims.

Back to the character of Professor Louise Bank’s concerns during that pivotal scene in the movie Arrival, the game of chess has been banned by many groups at one time or another (Chess.com, 2007). While there may be some benefits to competition, far beyond concepts of winning, losing, and black and white squares, a colourful and unbounded universe awaits.

The AI and super intelligence being prepared right now to foster humanity through the future absolutely must have our highest good underpinning every response, decision, action, and advancement.

_________________

Dr Alan D. Thompson is a world expert in the fields of child prodigies, high performance, and personal development. He has held memberships with the IEEE and IET, and is the former chairman for Mensa International’s gifted families committee.

References, Further Reading, and How to Cite

The Leta conversation videos can be viewed in chronological order at:
https://www.youtube.com/playlist?list=PLqJbCeNOfEK88QyAkBe-U0zxCgbHrGa4V 

 

Further reading

Arrival. Villeneuve, D. (2016). Arrival [feature film]. Paramount Pictures.

 

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. and Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv.org. https://arxiv.org/abs/2005.14165 

 

Chess.com (2007). Religion and Chess. https://www.chess.com/article/view/religion-and-chess

 

De, B. A., Jowett, B., & Knight, M. J. (1999). The Essential Plato. New York: Book-of-the-Month Club.

 

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., Leahy, C. (2020). The pile: an 800gb dataset of diverse text for language modeling. arXiv.org. https://arxiv.org/pdf/2101.00027.pdf 

 

IMDb. (2021). IMDb “Top 250” (Sorted by IMDb Rating Descending). https://www.imdb.com/search/title/?groups=top_250&sort=user_rating 

 

Kristoffersen, K. B. (2017). Common Crawled web corpora: Constructing corpora from large amounts of web data. https://www.duo.uio.no/bitstream/handle/10852/57836/Kristoffersen_MSc2.pdf 

 

Kurzweil, R. In Mearian, L. (2011). Brain behind IBM’s Watson not unlike a human’s. Computerworld. https://www.computerworld.com/article/2513321/brain-behind-ibm-s-watson-not-unlike-a-human-s.html

 

Millman, D., (2014). Everyday Enlightenment: The Twelve Gateways to Personal Growth. New York: Grand Central Publishing.

 

Niemic, R. (2016-2019). The Positive Psychology Movie Awards. 

https://positivepsychologynews.com/news/ryan-niemiec/2016022235436

https://positivepsychologynews.com/news/ryan-niemiec/2017042236806

https://positivepsychologynews.com/news/ryan-niemiec/2018050338361

https://positivepsychologynews.com/news/ryan-niemiec/2019042539708 

 

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 

 

Thompson, A. D. (2017). Why cheerleaders don’t criticise. https://lifearchitect.com.au/criticism/    

 

Thompson, A. D. (2020). The New Irrelevance of Intelligence. https://lifearchitect.com.au/irrelevance-of-intelligence/ 

 

Thompson, A. D. (2021a). The New Irrelevance of Intelligence [presentation]. Proceedings of the 2021 World Gifted Conference (virtual). In-press, to be made available in August 2021. https://youtu.be/mzmeLnRlj1w 

 

Thompson, A. D. (2021b). Integrated AI: The rising tide lifting all boats (GPT-3). https://lifearchitect.com.au/ai/ 

To cite this page: Thompson, A. D. (2021). Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond) Retrieved from: LifeArchitect.com.au/AI

 


Contents of GPT-3 & the Pile v1

What is in Common Crawl? Common Crawl includes (C4, cleaned/filtered, sorted by most tokens):
Google Patents (papers), The New York Times (news), Los Angeles Times (news), The Guardian (news), PLoS – Public Library of Science (papers), Forbes (news), HuffPost (news), Patents.com – dead link (papers), Scribd (books), The Washington Post (news), The Motley Fool (opinion), InterPlanetary File System (mix), Frontiers Media (papers), Business Insider (news), Chicago Tribune (news), Booking.com (discussion), The Atlantic (news), Springer Link (papers), Al Jazeera (news), Kickstarter (discussion), FindLaw Caselaw (papers), National Center for Biotech Info (papers), NPR (news), and more…

Download source (PDF)


Important external papers

GPT-3: Brown et al (2020). Language Models are Few-Shot Learners. OpenAI. (PDF)

The Pile v1: Gao et al (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. EleutherAI. (PDF)

Inside OpenAI and Neuralink offices: Hao, K. (2020). The messy, secretive reality behind OpenAI’s bid to save the world. MIT Technology Review. (PDF)

GPT-3 quality: Strickland, E. (2021). OpenAI’s GPT-3 Speaks! (Kindly Disregard Toxic Language). IEEE. (PDF)

PanGu Alpha: Zeng et al (2021). PanGu Alpha: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation. (PDF)

Common Crawl: Dodge et al (2021). Documenting the English Colossal Clean Crawled Corpus. (PDF)

The Turing Test: Turing, A. M. (1950). Computing Machinery and Intelligence. Mind 49: 433-460. (PDF)


The GPT-3 Leta video series