July 3, 2024

No word begins with a Q and ends with a J, plus other cases and analysis

No word begins with a(n) _ and ends with a(n) _ is a claim I occasionally see on memes and short form videos. Typically, the claims the creator present is false and a counterexample can be found in the top comments. In most cases, the creator of the content makes a false claim on purpose to increase engagement, because that's how the internet works.

In this article, I will analyze an english dictionary and corpus data to see if which combination of letters never, or very rarely start and end words. The data from this analysis is available later in this webpage.

Click here to jump to the results.

Couple of notes

I'm inviting everyone to adapt this into short form video as long as you credit me and follow the guidelines of fair use. It'd be funny.

The data below is derived from all of english usage. Depending on the section, it may contain swear words, racial slurs, and other socially unacceptable words. None of the data is censored, so don't be surprised if you see something weird.

Sources of data

All the analysis below has data from 2 sources:

A list of 370104 english words containing only alphabets from dwyl/english-words on github. The data seems to originate from Info Chimps and some from the Unix word list which is a mirror of The Moby Project. The github repository attaches an Unlicense license to it so the information is public domain.
The Google Web Trillion Word Corpus which is a corpus compiling about 1.02 trillion words on the surface web, the data is published by UPenn's LDC. The specific data I'm using comes from Peter Norvig, where he compiled only 1/3 million of the most frequent words along with their frequencies into a file with 588124220187 total occurances. Link here. I'm not very sure of the copyright status of this dataset (it may be ineligible); Norvig put his code under the MIT license which allows me to use it, also this work is not software so... if you have any concerns contact me.

I will refer to the dwyl word list as dictionary and the trillion word dataset as corpus.

Age of data

Keep in mind that all the data I'm using is pretty old. The Trillion Word Corpus is 18 years old and the dictionary is 10 years old, so some contemporary words aren't on there. For example, "rizz", "goated", and "incel" did not appear in the dictionary nor the corpus.

What counts as a word?

Whatever is in the dataset.

I'm not a linguist nor an aristocrat (more of an engineering major living off Pell grant), I'm going for a descriptivist approach here. There are some nuances which I will address later with basically an arbitrary decision, but generally I'll go with the dataset unless it is true gibberish (arbitrary line).

Since this is a natural language corpus, it contains misspelled words; this shouldn't affect the data too much.

Why not use an actual dictionary?

They're not free. There is a version of webster on Project Gutenburg but it'd've taken some time to write a parsing script.

I originally wanted to also use the NASPA Scrabble word list, but it is copyrighted.

Script for analysis

I'm not gonna go into detail here because it's incredibly boring, even for me. The analysis is done in a Node JS script that outputs a bigGlob of JSON that is then parsed by your browser in this page.

All the code is here. Links to the data is at the top.

If you choose to run it yourself, you'll need to download the data too. Run it with node and it should output a JSON file. You may also uncomment some of the console.table calls for a console readout.

Also here is JSON of the corpus data and JSON of the dictionary in case anyone wants it to avoid parsing headache. All github gist links, they're a couple of megabytes so be careful.

In summary

Raw corpus data (Norvig). Find count_1w.txt
Raw corpus data (mirror)
Raw dicionary data (dwyl). Find words-alpha.txt
Analysis code
JSON corpus data
JSON dictionary data

Anyways, here are the analysis results

Data guide

Each row represents a first letter, columns represent last letter.

For example, in the first table below, row c column a with the value 1731 represents words starting with c and ending with a

On ranking tables, the same condition is represented with "ca"

Click on table entries to see examples and/or more data

Data status: not ready. This may take a while on a slow connection. This page loads about 260 KB of JSON. This page does not work on Internet Explorer or older browsers.

Result: dictionary

Based on dwyl/english-words on commit 94dbea5 on June 16, 2024. Number of entries: 370104.

Number of entries for each beginning and ending letter

Letter combinations with the most and least entries

Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.

Result: corpus

Based on the aforementioned corpus. 333333 words appearing a total of 588124220187 times; truncated from a total of 1.025 trillion words.

Number of occurance in the corpus for each beginning and ending letter

Common log of above table

Rounded to two decimal places.

Letter combinations with the most and least corpus occurances

Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.

Number of words in the corpus for each beginning and ending letter

Again, only the top 333333 unique words

Letter combinations with the most and least number of words in the corpus

Top and bottom values of above table. Click on the letters, not numbers, for the additional info dialog.

Conclusion

There are several combinations of rare beginning and end letters. Most less common ones are occupied with phonetic translations from other languages, typically non-Germanic ones. At the end, it seems like Q...J is the least used combination, with no mentions in the dictionary and only one in the corpus: "qj", which seem to come from an abbrivation of Qianjiang Motorcycle, a Chinese motorcycle company. If the only word is an abbrivation of a proper name in a foreign language, I feel like I can crown Q...J as the least used starting and ending letter; earning the title... of this page I guess.

This was a pretty cool small project. The results probably have no value outside of awe. You're welcome to check back every time you see a meme or short-form video with these claims. I'd enourage you to download the code and explore the data if this is something that interests you. I found some pretty interesting concepts researching these unusual words.

Special thanks to all the provider of the data.

Write an email to me if you have any questions, comments, or concerns. Look for a link to the address on my site homepage.