Phase Zero of Human Name Detection for Border Observer


Objective
Find a person's full name within a press release or news story.

Reason Currently - on average, we are seeing 200 news stories per weekday. Clustering the same or similar news stories is helpful for readers, but too often the headlines don't match very similar stories. As such, with slightly dissimilar headlines, the task of clustering becomes tedious, but there is a solution.

It is well known (by internet search engines) that some internal "markers" (like a person's name) will help with the clustering of similar webpages, press releases or news stories. That is the focus of work ALMOST done.

BOTTOM LINE As we have stated more than once, MOST news stories start with a press release. So, if we start by detecting a person's name in the press release, we can then catch that name early and thereby detect that name in news stories as they go online.

So, we start here — clustering news stories based on the "full name" of the person in the news story.

In addition we plan to publish this list of names and links to the related news stories. On average, we see at least 100 press releases per day from the Federal Government, a fraction are related to the border and immigration.

Method Using an unattended computer with a list of forenames (first names) and a list of surnames (last names), detect the full names (adjacent forename and surname) of people mentioned in a news story or press release. Assuming – of course – that there is at least one person's full name written therein.

For this discussion, we are going to ignore official or honorific titles, middle names, nick names, alias, and hyphenated names.

Technical Details Word lists for testing and development
162,253    surname (aka lastname) from U.S. Census Bureau
 11,925    forename (first name) from U.S. Social Security Admin.

370,105    words in the English Language (from github)
  9,894    10,000 most common English words (from github)


Phase Zero Test Results
The following six (6) words are also surnames: Center, Jeff, Johnson, Too, To, Li

There are 5815 forenames that are on the list of surname, or 48.7% of forenames are also surnames.

  • 821 forenames are on the list of ~10,000 most common English words~, or 8.29% of forenames are common English words.
  • 3,096 surnames are on the list of ~10,000 most common English words~, or 31.29% of surnames are common English words.


The list with 370,105 English Language should not be considered complete or encompassing.

Twitter's Grok informs us (about ~how many words in the English language~)
Estimates vary, but the English language has around 170,000 to 220,000 words in current use, per major dictionaries like Oxford and Merriam-Webster. Including obsolete, technical, and variant forms, the total can exceed 1 million.

Given the disclaimer,

  • 3,764 forenames were found on the ~370,105 English Language~ list (or about 1.01% of words were forenames)
  • 17,540 surnames were found (or about 4.73%)

FINAL NOTE
While the work will be boring, mundane, and tedious, once it gets rolling, the clustering will be extremely useful. However, we expect many Hispanic surnames to be missed, which will manual detection of the proper name.

In the future the same technique can be used with other keyphrases and keywords.

--

The Anatomy of a Large-Scale Hypertextual Web Search Engine

by Sergey Brin and Lawrence Page

http://infolab.stanford.edu/pub/papers/google.pdf

Comments

Popular posts from this blog

Kamala Harris was NEVER assigned to be the "Border Czar" - Video and written evidence support this

15+ Movies Watched about "Border" and "Immigration" from -- 1910s thru 2020s

Massive Changes in Tweets from @BorderObserver