In a previous article, we talked about using Python to scrape stock-related articles from the web. As an extension of this idea, we’re going to show you how to use the NLTK package to figure out how often different words occur in text, using scraped stock articles.
Initial Setup
Let’s import the NLTK package, along with requests and BeautifulSoup, which we’ll need to scrape the stock articles.
'''load packages''' import nltk import requests from bs4 import BeautifulSoup
Pulling the data we’ll need
Below, we’re copying code from my scraping stocks article. This gives us a function, scrape_all_articles (along with two other helper functions), which we can use to pull the actual raw text from articles linked to from NASDAQ’s website.
def scrape_news_text(news_url): news_html = requests.get(news_url).content '''convert html to BeautifulSoup object''' news_soup = BeautifulSoup(news_html , 'lxml') paragraphs = [par.text for par in news_soup.find_all('p')] news_text = 'n'.join(paragraphs) return news_text def get_news_urls(links_site): '''scrape the html of the site''' resp = requests.get(links_site) if not resp.ok: return None html = resp.content '''convert html to BeautifulSoup object''' soup = BeautifulSoup(html , 'lxml') '''get list of all links on webpage''' links = soup.find_all('a') urls = [link.get('href') for link in links] urls = [url for url in urls if url is not None] '''Filter the list of urls to just the news articles''' news_urls = [url for url in urls if '/article/' in url] return news_urls def scrape_all_articles(ticker , upper_page_limit = 5): landing_site = 'http://www.nasdaq.com/symbol/' + ticker + '/news-headlines' all_news_urls = get_news_urls(landing_site) current_urls_list = all_news_urls.copy() index = 2 '''Loop through each sequential page, scraping the links from each''' while (current_urls_list is not None) and (current_urls_list != []) and (index <= upper_page_limit): '''Construct URL for page in loop based off index''' current_site = landing_site + '?page=' + str(index) current_urls_list = get_news_urls(current_site) '''Append current webpage's list of urls to all_news_urls''' all_news_urls = all_news_urls + current_urls_list index = index + 1 all_news_urls = list(set(all_news_urls)) '''Now, we have a list of urls, we need to actually scrape the text''' all_articles = [scrape_news_text(news_url) for news_url in all_news_urls] return all_articles
Let’s run our function to pull a few articles on Netflix (ticker symbol ‘NFLX’).
articles = scrape_all_articles('NFLX' , 10)
Above, we use our function to search through the first ten pages of NASDAQ’s listing of articles for Netflix. This gives us a total of 102 articles (at the time of this writing). The variable, articles, contains a list of the raw text of each article. We can view a sample one by printing the following:
print(articles[0])
Now, let’s set article equal to one of the articles we have.
article = articles[0]
To get word frequencies of this article, we are going to perform an operation called tokenization. Tokenization effectively breaks a string of text into individual words, which we’ll need to calculate word frequencies. To tokenize article, we use the nltk.tokenize.word_tokenize method.
tokens = nltk.tokenize.word_tokenize(article)
Now, if you print out tokens, you’ll see that it includes a lot of words like ‘the’, ‘a’, ‘an’ etc. These are known as ‘stop words.’ We can filter these out of tokens using stopwords from nltk.corpus. Let’s also make all the words upper case. This will allow us to avoid case sensitivity issues when we get any word frequency distributions.
from nltk.corpus import stopwords '''Get list of English stop words ''' take_out = stopwords.words('english') '''Make all words in tokens uppercase''' tokens = [word.upper() for word in tokens] '''Make all stop words upper case''' take_out = [word.upper() for word in take_out] '''Filter out stop words from tokens list''' tokens = [word for word in tokens if word not in take_out]
*NLTK also has functionality to filter out stop words from other languages, as well.
In addition to filtering out stop words, we also probably want to get rid of punctuation (e.g. commas etc.). This can be done by filtering out any elements in tokens that are in string.punctuation, which contains a list of common punctuation forms.
tokens = [word for word in tokens if word not in string.punctuation] tokens = [word for word in tokens if word[0] not in string.punctuation]
Now, we’re ready to get the word frequency distribution of the article in question. This is done using the nltk.FreqDist method, like below. The nltk.FreqDist method returns a dictionary, where each key is each uniquely occurring word in the text, while the corresponding values are how many times each of those words appear. Setting this dictionary equal to word_frequencies, we sort the result as a list of tuples (word_frequencies.items()) by the frequency of each word in descending order.
'''Returns a dictionary of words mapped to how often they occur''' word_frequencies = nltk.FreqDist(tokens) '''Sort the above result by the frequency of each word''' sorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] , reverse = True)
Getting a function to calculate word frequency…
Let’s create a function from what we did that takes a single article, and returns the sorted word frequencies.
def get_word_frequecy(article): tokens = nltk.tokenize.word_tokenize(article) '''Get list of English stop words ''' take_out = stopwords.words('english') take_out = [word.upper() for word in take_out] '''Convert each item in tokens to uppercase''' tokens = [word.upper() for word in tokens] '''Filter out stop words and punctuation ''' tokens = [word for word in tokens if word not in take_out] tokens = [word for word in tokens if word not in string.punctuation] tokens = [word for word in tokens if word[0] not in string.punctuation] '''Get word frequency distribution''' word_frequencies = nltk.FreqDist(tokens) '''Sort word frequency distribution by number of times each word occurs''' sorted_counts = sorted(word_frequencies.items() , key = lambda x: x[1] , reverse = True) return sorted_counts
Now, we could run our function across every article in our list, like this:
articles = [article for article in articles if article != ''] results = [get_word_frequency(article) for article in articles]
The results variable contains word frequencies for each individual article. Using this information, we can get the most frequently occurring word in each article.
most_frequent = [pair[0] for pair in results] most_frequent = [x[0] for x in most_frequent]
Next, we can figure out the most common top-occurring words across the articles.
most_frequent = nltk.FreqDist(most_frequent) most_frequent = sorted(most_frequent.items() , key = lambda x: x[1] , reverse = True)
Filtering out articles using word frequency
If you print out most_frequent, you can see the words ‘NETFLIX’, ‘PERCENT’, and ‘STOCK’ are at the top of the list. Using word frequencies could be useful in giving a quick check to test whether an article actually has much to do with the stock that it’s listed under. For instance, some of the Netflix articles may be linked to the stock because they mentioned it in passing, or in a minor part of the text, while actually having more to do with another stock(s). Using our frequency function above, we could filter out articles that mention the stock name infrequently, like in the snippet below.
'''Create a dictionary that maps each article to its word frequency distribution''' article_to_freq = {article:freq for article,freq in zip(articles , results)} '''Filter out articles that don't mention 'Netflix' at least 3 times''' article_to_freq = {article:freq for article,freq in article_to_freq.items() if freq >= 3}
Note, this isn’t a perfect form of topic modeling, but it is something you can do really quickly to make educated guesses about whether an article actually has to do with the topic you want. You can also make this process better by filtering out articles that don’t contain other words, as well. For instance, if you’re looking for articles specifically about Netflix’s stock, you might not want to include articles about new shows etc. on Netflix. So, you could maybe filter out articles that don’t mention words like ‘stock’ or ‘investing.’
One last note…
Another way of thinking about word frequency in our situation would be to get word counts across all articles at once. You can do this easily enough by concatenating (or joining together) each article in our list.
overall_text = ' '.join(articles) top_words = get_word_frequency(overall_text)
This type of analysis can go much deeper into the world of natural language processing, but that would go well beyond a single blog post, so that’s the end for now!
Originally posted on TheAutomatic.net blog.
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.