ai code snippets

gensim api

Search for: gensim api


import gensim.downloader as api

# Load a pre-trained Word2Vec model (or any other KeyedVectors model)
word_vectors = api.load("word2vec-google-news-300")

# Given set of words
given_words = ["king", "queen", "man"]

# Calculate vectors for each word in the given set
word_vectors_set = [word_vectors[word] for word in given_words]

# Calculate the mean vector of the set
mean_vector = sum(word_vectors_set) / len(word_vectors_set)

# Find similar words to the mean vector
similar_words = word_vectors.similar_by_vector(mean_vector, topn=10)

# Print the similar words and their similarity scores
for word, score in similar_words:
    print(f"{word}: {score:.4f}")

is there a word2vec model online that I can run queries against in a browser?

Search for: is there a word2vec model online that I can run queries against in a browser?

NLTK


from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import word_tokenize

text = "Long sentences"

# list of strings
words = word_tokenize(text)
print(words)


stemmer = EnglishStemmer()
#list of strings
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "some sentence"
words = word_tokenize(string_for_lemmatizing)
print(words)

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
  1. EnglishStemmer
  2. word_tokenize
  3. WordNetLemmatizer
  1. Text Processing: Tokenization, stemming
  2. Part-of-Speech Tagging: Part of speech
  3. Parsing: for setnence structure
  4. Named Entity Recognition (NER): names and nouns
  5. Sentiment Analysis
  6. Machine Learning for Text Classification
  7. Text Corpora and Lexical Resources: Brown corpus, wordnet, linguistic databases
  8. Text Summarization
  9. Concordance Analysis: Frequency analysis of words
  10. Language Learning and Teaching
  11. Research and Experimentation in Linguistics and NLP

def localTest():
    print ("Starting local test")
    print ("End local test")

if __name__ == '__main__':
    localTest()

import nltk
nltk.download('punkt')
nltk.download('wordnet')

from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

import nltk

text = "John and Mary are living in New York City since 2020."
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
chunked = nltk.ne_chunk(tagged)

for subtree in chunked:
    if isinstance(subtree, nltk.Tree):
        label = subtree.label()
        entity = " ".join([word for word, pos in subtree.leaves()])
        print(f"Named Entity: {entity}, Label: {label}")

(S
  (PERSON John/NNP)
  and/CC
  (PERSON Mary/NNP)
  are/VBP
  (GPE living/VBG)
  in/IN
  (GPE New/NNP York/NNP)
  City/NNP
  since/IN
  (DATE 2020/CD)
  ./.)
  1. The structure of the ChunkTree returned by ne_chunk() typically consists of nodes and leaves, where nodes represent named entity chunks, and leaves represent individual words or tokens.
  2. Each node in the tree has a label indicating the type of named entity, and it can have children nodes and leaves that form a hierarchical structure.
  3. (S ...): Represents the top-level sentence.
  4. (PERSON John/NNP): Represents a named entity "John" classified as a person (PERSON).
  5. (PERSON Mary/NNP): Represents a named entity "Mary" classified as a person.
  6. (GPE New/NNP York/NNP): Represents a named entity "New York" classified as a geopolitical entity (GPE).
  7. (DATE 2020/CD): Represents a named entity "2020" classified as a date (DATE).

# *********************
# Import and download some stuf!!
# You have to do this only once per session I believe
# *********************


import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Key functions
# *********************
from nltk import ne_chunk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

def example():    
    ner_text = "Some sentence with peoples and corp names"

    #word list: strings
    tokens = word_tokenize(ner_text)
    print(tokens)

    #A list of key/value tuples
    pos_tagged = pos_tag(tokens)
    print(pos_tagged)

    # Chunked tree object
    result = ne_chunk(pos_tagged)

    print(result)
    #result.draw() 
    #this will open a new window with the tree rendering

def posTaggingExercise():
    text = """
    We hold these truths to be self-evident, that all men are created equal, 
    that they are endowed by their Creator with certain unalienable Rights, 
    that among these are Life, Liberty and the pursuit of Happiness.
    """
    words = word_tokenize(text)
    taggedWords = pos_tag(words)
    print(taggedWords)
    return taggedWords

# Example

[('We', 'PRP'), ('hold', 'VBP'), ('these', 'DT'), ('truths', 'NNS'),..]

NLTK parts of speech attributes

Search for: NLTK parts of speech attributes

NLTK home page

NLTK documentation page

  1. How to distinguish between a script (executable) and a library file in python. Use this mechanism to test functions in library files by executing the library file
  2. Use list comprehensions to process each record in a list and make a new list from the result
  3. Find vector representations for a word (gensim word2vec)
  4. Find most similar words in the corpus fro a given word
  5. Find most similar words for a set of words in a list
  6. Find the dissimilar words in a list
  7. Average words by averaging their vectors
  1. Find the stem words for very many variations of the same word
  2. Find the lemmatization of a word
  3. Tokenize a sentence into words
  4. Categorize or tag words in a sentence as to their grammatical "parts of speech"
  5. Identify nouns and their classification in a sentence (Ex: Proper names, organizations, Geopolitical entities, dates etc.). Uses a concept called "chunking"

A language critique:

#
# Conceptual record processing
# in any language, with python as an example
# In python these are called list comprehensions.
#

# Take this procedural idea for an example

for every-record in a list
   do-something with that record
   store that record in a list

#
#This is expressed in python as
#
[do-something for every-record] #put each processed record in a list

#
# Now your target container can be set or a dictionary as well
# in addition to a list
#

{do-something for every-record} #put each record in a set
{do-something-for-akey: do-something-for-value for every-record} #Put it in a dictionary

Hugging face home Page: https://huggingface.co/

  1. as a link: /settings/account: access tokens
  2. or in ui: icon, profile, settings
  3. or link: /settings/tokens

You have to verify your email first for this to work

Where is hugging face text inference api request and response documented?

Search for: Where is hugging face text inference api request and response documented?

Seemed to be documented here

Here is a list of inputs and outputs to the api

Each type of task has different inputs and outputs to the API

  1. Text Answering
  2. Summarization
  3. Text Generation
  4. Text Classification
  5. Named entity recognition
  6. Translation
  7. ...
  8. etc.

Inputs and outputs to the text generation task are documented here


import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

data = query({"inputs": "The answer to the universe is"})

# There are other parameters other than input
# See the api docs

Return value is either a dict or a list of dicts if you sent a list of inputs


data == [ 
  {"generated_text": 'hello'}
]

Hugging face API keys