In this blog post, we are going to introduce the readers to an important field of artificial intelligence which is known as Sentiment Analysis. It’s something that is used to discover an individual's beliefs, emotions, and feelings about a product or a service. As we proceed further in this tutorial, readers and passionate individuals will come to know how such an amazing approach is implemented with the flow diagram. To help readers understand the things better and practical, live code is also inserted in one of the sections.    

At the end while concluding we are presenting our approach to customize the basic sentiment analysis algorithm and will also provide an API where the users can do practical testing of customized approach.

Now, let's define sentiment analysis through a customer's reviews. As an example, if we take customer feedback, it's sentiment analysis in a form of text measures the user's attitude all towards the aspects of a product or a service which they explain in a text.

The contents of this blog post are as follows:

What is Sentiment Analysis?

Sentiment analysis is the process of using natural language processing, text analysis, and statistics to analyze customer analysis or sentiments. The most reputable businesses appreciate the sentiment of their customers—what people are saying, how they’re saying it, and what they mean.

If we look at the theory, then it is a computational study of opinions, attitudes, view, emotions, sentiments etc. expressed in the particular text. And that text can be seen in a variety of formats like reviews, news, comments or blogs.

Why are sentimental analysis of Giant E-commerce websites important?

In today's world, marketing and branding have become the strength of colossal businesses and to build a connection between the customers' such businesses leverage social media. The major aim of establishing this connection is to simply encourage two-way communication, where everyone benefits from online engagement. Simultaneously, two huge platforms are emerging in the field of marketing. In proceeding further, we’ll grasp why these two enormous platforms have become so efficient specifically for analyzing the sentiments of the customers.

Flipkart and Amazon India are emerging as the two colossal players in the swiftly expanding online retail industry in India. Although Amazon started its operations in India much later than Flipkart, it is giving tough competition to Flipkart.

Generalize approach for Sentiment analysis?

Sentiment analysis uses various Natural Language Processing (NLP) methods and algorithms. There are two processes which clarify to you how machine learning classifiers can be implemented. Take a look.

  1. The training process: In this process (a), the model learns to associate a particular text form to the corresponding output which can be recognized as a tag in the image. Tag is based on the test samples used for training. The feature extractor simply transfers the input of the text into a feature vector. Pairs of tags and feature extractor (e.g. positive, neutral, or negative) are fed into the machine learning algorithm to generate a model.  
  2. The prediction process: The feature extractor used to transfer unseen text inputs into feature vectors. Then the feature vector fed into the model which simply generates predicted tags (positive, negative, or neutral).

This kind of representation makes it possible for words with similar meaning to have a similar representation, which can improve the performance of classifiers. In the later section, you'll get to learn the sentiment analysis with the bag-of-words model, data collection and so on.

We are explaining the general approach for implementing sentiment analysis using a predefined library. I will implement three phases of this approach such as data gathering, data cleaning and predicting with live code. Each line of the code will be explained in the respective section. General approach is also mentioned in the flowchart below.

Steps for sentiment analysis using predefined library

Data collection via web scraping

Data scraping and web scraping are similar things in which we extract the data from the specific URL.  Data scraping is the technique to collect the data which extracts data with a computer program and make it human-readable content so that you can read store, and access easily.

Selenium is an open source testing tool which is simply used to testing web applications. As well as it is also used for web scraping where the unofficial documents can be checked.

Scrappy is the python framework which simply provides you a complete package for the developers. It works similarly to beautiful soup.

We’re going to use one of the best libraries “Beautifulsoup”. Let’s understand what beautiful soup is actually.

Beautiful soup is a python library used to parse html and XML files.

Here, we will import beautiful soup along with ‘urllib’ then we’ll name the source with a ‘lxml’ file.

To begin, we need to import ‘Beautiful Soup’ and ‘urllib’. And the source code would be:

In source we will mention the path of that particular url

Then we will save the scraped data in the soup variable. And if you want to read the complete file then ‘print’ action to soup variable.

If you want to check the specific tag then you can provide the ‘print’ action with a specific tag:  

The scraper can then replicate or store the complete website data or content elsewhere and use it for further processing. Web scraping is also used for illegal purposes, including the undercutting of prices and the theft of copyrighted content. An online entity targeted by a scraper can suffer severe financial losses, especially if it’s a business strongly relying on competitive pricing models or deals in content distribution.

Data Preprocessing (Cleaning)

Data Preprocessing is the technique of data mining which is implemented to transform the raw data in a useful and efficient format. And if your data hasn’t been cleaned and preprocessed, your model does not work.  

Using Regex

Line 1: \W stands for punctuation and \d for digits.

Line 2: Removes the link from the text.

Line 3: Values are returned by the function to the part of the program where the function is called.

Using nltk

Line 4: This function converts a string into a list based on splitter mentioned in the argument of split function. If not splitter is mentioned then space is used as a default. Join converts list into a string. One can say that joining is a reverse function for the split.


Line 5: So the first step is to convert the string into a list and then each token is iterated in the next step and if there are any stop words in the tokens they are removed.

Line 6: In the final step, the filtered tokens are again converted into a list.

Line 7: Values are returned by the function to the part of the program where the function is called.

Predicting the scores using predefined library SentimentIntensityAnalyzer

Line 8: sid is the object of the class SentimenIntensityAnalyzer(). This class is taken from nltk.sentiment.vader.

Line 9: It will return four sentiments such as (negative, neutral, positive and compound) along with their confidence score. The compound score is a metric that calculates the sum of all lexicon rating that has been normalized between -1 and 1. If the compound value is greater or equal to 0.05, then it will point to positive score and if it is less than or equal to -0.05, then sentence is positive and if it does not lies in both range then sentence is neutral.

Line 10: Function will return the output to the calling function.

Code Snippet for General Approach

Two files are used one for getting the text from the end user named as index.html and another for rendering the response result.html. This small application is built using django and this is a small code snippet to let you know how predefined approach works. In the next section we will have some discussion on how to create a custom approach to make better sentimental analysis API.

Discussion on custom sentimental analysis approach

In this approach, data gathering will remain the same as this is the basic step and is needed for any approach. Different regex patterns can be applied after data is gathered to make it clean and data is subjected to different operations of nltk such as stemming, removing stop words, lemmatizer to clean it more effectively. Here, some custom functions can be developed based on the requirement and structure of the data set. After this step one will get refined text which is applied to an mechanism which converts the text into some tensors or integer. This could be word embedding or using a bag of words or tf–idf. The benefit of using word embedding as compared to later method is former helps to maintain semantic relationship with the words and helps to understand the context better. Output of such is passed to any deep learning or ml model. I would suggest plotting the graph for the text and if graph represents some non-linear relationship then it is good to opt for deep learning else machine learning.

When the choice of model is done in the previous step it is time to feed tensor to model for training the model. Training model time depends on the amount of data you have. Once this step is complete I would recommend saving the model so that for prediction phase one needs to load the model instead of training the model again. Suppose, if you are using keras then follow the below steps to save the model.

If you are making the model using pytorch, then please execute the below code to save the model

Once the model is saved, it is time for loading the model for real time prediction. Saving the model helps to load the model from the checkpoint instead of training it again for each prediction. For the prediction phase, it is important to create the features of the real time testing data which is fed to the saved model. Output may be over fit or under fit and hence you need to tweak hyper parameters while creating the model.

Conclusion

In this short tutorial, we have seen what is sentimental analysis and why it is used. Amazon or Flipkart uses it extensively to increase their sales and productivity. We had also implemented a general approach with code and explained every line of code. In the end, there was also discussion on a custom approach which can make the code more robust.  

PS: If you liked, then kindly share your kind reviews in the comments section below. And to stay in touch and not miss any of our articles/blogs, then do subscribe to our newsletter, and check out our blog page https://blog.paradisetechsoft.com/

PPS: Follow us on our Social media handles: Medium: medium.com, Facebook: https://www.facebook.com/ParadiseTechSoftSolution/, LinkedIn https://www.linkedin.com/company/3302119/admin/, GitHub: Do check out our recent repositories at https://github.com/puneet-kaushal/

Appendix

APIs to web scraping

Some websites providers offer Application Programming Interfaces (APIs) that simply allows you to access their data in a predefined manner. You can avoid parsing HTML with APIs and instead access the data directly using formats like JSON, and XML. HTML is primarily a way to visually present content to users.  

Scrape HTML Content From a Page

Then open up a new file in your favourite text editor. All you need to retrieve the HTML are a few lines of code:

Storing the content in page object

Bags of Words


The bag of words model usually has a large list, probably better thought of as a sort of "dictionary," which are considered to be words that carry the sentiment. These words each have their own "value" when found in the text. The values are typically all added up and the result is a sentiment valuation.

The equation to add and derive a number can vary, but this model mainly focuses on the words and does not attempt to understand language fundamentals.

TFIDF

This is another method of converting string into integers. This is a more effective method compared to Bag of words. This looks for the word in the current document and in the whole document. This enables us to give more score to more meaningful words in the text. The equation for TF-IDF is as below

tf-idf = tf * logNdf

where N is the number of documents

df refers to number of document containing term t