satya - 1/12/2024, 4:37:38 PM

gensim api

satya - 1/12/2024, 4:38:09 PM

Similar word set


import gensim.downloader as api

# Load a pre-trained Word2Vec model (or any other KeyedVectors model)
word_vectors = api.load("word2vec-google-news-300")

# Given set of words
given_words = ["king", "queen", "man"]

# Calculate vectors for each word in the given set
word_vectors_set = [word_vectors[word] for word in given_words]

# Calculate the mean vector of the set
mean_vector = sum(word_vectors_set) / len(word_vectors_set)

# Find similar words to the mean vector
similar_words = word_vectors.similar_by_vector(mean_vector, topn=10)

# Print the similar words and their similarity scores
for word, score in similar_words:
    print(f"{word}: {score:.4f}")

satya - 1/12/2024, 7:23:44 PM

is there a word2vec model online that I can run queries against in a browser?

Search for: is there a word2vec model online that I can run queries against in a browser?

satya - 1/13/2024, 12:48:04 PM

word embeddings

word2vec

GloVe

Gensim

satya - 1/13/2024, 1:18:26 PM

NLTK

satya - 1/13/2024, 1:18:50 PM

Getting stem words using NLTK


from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import word_tokenize

text = "Long sentences"

# list of strings
words = word_tokenize(text)
print(words)


stemmer = EnglishStemmer()
#list of strings
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

satya - 1/13/2024, 1:20:56 PM

Lemmatizing with NLTK


from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "some sentence"
words = word_tokenize(string_for_lemmatizing)
print(words)

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

satya - 1/13/2024, 1:22:11 PM

Key classes of NLTK

EnglishStemmer
word_tokenize
WordNetLemmatizer

satya - 1/13/2024, 1:29:14 PM

NLTK uses

Text Processing: Tokenization, stemming
Part-of-Speech Tagging: Part of speech
Parsing: for setnence structure
Named Entity Recognition (NER): names and nouns
Sentiment Analysis
Machine Learning for Text Classification
Text Corpora and Lexical Resources: Brown corpus, wordnet, linguistic databases
Text Summarization
Concordance Analysis: Frequency analysis of words
Language Learning and Teaching
Research and Experimentation in Linguistics and NLP

satya - 1/13/2024, 1:45:37 PM

Useful python segment in libraries


def localTest():
    print ("Starting local test")
    print ("End local test")

if __name__ == '__main__':
    localTest()

satya - 1/13/2024, 1:52:46 PM

For some of nltk to work


import nltk
nltk.download('punkt')
nltk.download('wordnet')

satya - 1/13/2024, 4:09:59 PM

Additional nltk initializations


from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

satya - 1/13/2024, 4:20:50 PM

Here is how you navigate a ChunkedTree


import nltk

text = "John and Mary are living in New York City since 2020."
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
chunked = nltk.ne_chunk(tagged)

for subtree in chunked:
    if isinstance(subtree, nltk.Tree):
        label = subtree.label()
        entity = " ".join([word for word, pos in subtree.leaves()])
        print(f"Named Entity: {entity}, Label: {label}")

satya - 1/13/2024, 4:21:24 PM

Here is an example of a chunck tree


(S
  (PERSON John/NNP)
  and/CC
  (PERSON Mary/NNP)
  are/VBP
  (GPE living/VBG)
  in/IN
  (GPE New/NNP York/NNP)
  City/NNP
  since/IN
  (DATE 2020/CD)
  ./.)

satya - 1/13/2024, 4:23:21 PM

More on chunck tree

The structure of the ChunkTree returned by ne_chunk() typically consists of nodes and leaves, where nodes represent named entity chunks, and leaves represent individual words or tokens.
Each node in the tree has a label indicating the type of named entity, and it can have children nodes and leaves that form a hierarchical structure.
(S ...): Represents the top-level sentence.
(PERSON John/NNP): Represents a named entity "John" classified as a person (PERSON).
(PERSON Mary/NNP): Represents a named entity "Mary" classified as a person.
(GPE New/NNP York/NNP): Represents a named entity "New York" classified as a geopolitical entity (GPE).
(DATE 2020/CD): Represents a named entity "2020" classified as a date (DATE).

satya - 1/13/2024, 4:29:06 PM

Sample code NLTK name recognition and chunking


# *********************
# Import and download some stuf!!
# You have to do this only once per session I believe
# *********************


import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Key functions
# *********************
from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

def example():    
    ner_text = "Some sentence with peoples and corp names"

    #word list: strings
    tokens = word_tokenize(ner_text)
    print(tokens)

    #A list of key/value tuples
    pos_tagged = pos_tag(tokens)
    print(pos_tagged)

    # Chunked tree object
    result = ne_chunk(pos_tagged)

    print(result)
    #result.draw() 
    #this will open a new window with the tree rendering

satya - 1/13/2024, 4:49:36 PM

Parts of speech


def posTaggingExercise():
    text = """
    We hold these truths to be self-evident, that all men are created equal, 
    that they are endowed by their Creator with certain unalienable Rights, 
    that among these are Life, Liberty and the pursuit of Happiness.
    """
    words = word_tokenize(text)
    taggedWords = pos_tag(words)
    print(taggedWords)
    return taggedWords

# Example

[('We', 'PRP'), ('hold', 'VBP'), ('these', 'DT'), ('truths', 'NNS'),..]

satya - 1/13/2024, 4:50:54 PM

NLTK parts of speech attributes

Search for: NLTK parts of speech attributes

satya - 1/13/2024, 4:51:04 PM

NLTK home page

satya - 1/13/2024, 4:54:15 PM

NLTK documentation page

satya - 1/13/2024, 7:48:23 PM

What we have done so far with word2vec

How to distinguish between a script (executable) and a library file in python. Use this mechanism to test functions in library files by executing the library file
Use list comprehensions to process each record in a list and make a new list from the result
Find vector representations for a word (gensim word2vec)
Find most similar words in the corpus fro a given word
Find most similar words for a set of words in a list
Find the dissimilar words in a list
Average words by averaging their vectors

satya - 1/13/2024, 7:49:38 PM

What we have done so far with NLTK

Find the stem words for very many variations of the same word
Find the lemmatization of a word
Tokenize a sentence into words
Categorize or tag words in a sentence as to their grammatical "parts of speech"
Identify nouns and their classification in a sentence (Ex: Proper names, organizations, Geopolitical entities, dates etc.). Uses a concept called "chunking"

satya - 1/13/2024, 8:04:55 PM

Idea of a list comprehension in python


A language critique:

#
# Conceptual record processing
# in any language, with python as an example
# In python these are called list comprehensions.
#

# Take this procedural idea for an example

for every-record in a list
   do-something with that record
   store that record in a list

#
#This is expressed in python as
#
[do-something for every-record] #put each processed record in a list

#
# Now your target container can be set or a dictionary as well
# in addition to a list
#

{do-something for every-record} #put each record in a set
{do-something-for-akey: do-something-for-value for every-record} #Put it in a dictionary

satya - 1/19/2024, 5:46:16 PM

Hugging face home Page: https://huggingface.co/

satya - 1/19/2024, 6:23:18 PM

Where to get the API keys

as a link: /settings/account: access tokens
or in ui: icon, profile, settings
or link: /settings/tokens

satya - 1/19/2024, 6:23:31 PM

You have to verify your email first for this to work

satya - 1/21/2024, 7:28:32 PM

Where is hugging face text inference api request and response documented?

Search for: Where is hugging face text inference api request and response documented?

satya - 1/21/2024, 7:29:57 PM

Seemed to be documented here

satya - 1/21/2024, 7:34:26 PM

Here is a list of inputs and outputs to the api

satya - 1/21/2024, 7:42:04 PM

Each type of task has different inputs and outputs to the API

satya - 1/21/2024, 7:43:14 PM

Some task names

Text Answering
Summarization
Text Generation
Text Classification
Named entity recognition
Translation
...
etc.

satya - 1/21/2024, 7:43:51 PM

Inputs and outputs to the text generation task are documented here

satya - 1/21/2024, 7:44:57 PM

Example input


import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query({"inputs": "The answer to the universe is"})

# There are other parameters other than input
# See the api docs

satya - 1/21/2024, 7:45:49 PM

Return value is either a dict or a list of dicts if you sent a list of inputs

satya - 1/21/2024, 7:46:25 PM

Example


data == [ 
  {"generated_text": 'hello'}
]