Sentiment Analysis | My Assignment Tutor

Assignment 2: Tweet Sentiment Analysisversion 1.02Mark DrasMay 17, 20211 IntroductionSentiment analysis1 is an application of Natural Language Processing (a branch of Artificial Intelligence) that is concerned with detecting the sentiment of text. A common dimension for measuringsentiment uses labels positive, negative and neutral; there are many other possibilities as well(e.g. how strong the sentiment is, how active vs subdued it is, etc). Figure 1 contains two sampletweets about the current series Falcon and the Winter Soldier, one positive and one negative. Socialmedia is a particularly popular arena for deploying sentiment analysis: companies want to knowhow their products are being perceived, etc. Consequently, there are many organisations offeringapps or services for building them; a screenshot from a demo of such an app is given in Figure 2.2The earliest and simplest techniques for carrying out sentiment analysis (although this type ofapproach is still in fact widely used) just carried out keyword matching in the text, based on wordsfrom a source of words that have known sentiment (a sentiment lexicon). Often, these lexicons don’t have extensive coverage: there are many words with sentiment that aren’t included inthem, particularly in the case of social media text, where misspellings, abbreviations and slangare common. Consequently, there are other approaches to the task: there’s a large class of machine learning3 techniques applied, as well as other techniques like label propagation,4 wheresentiment labels are propagated through a graph structure.In this assignment, you’ll work with a set of real tweets collected by researchers who developedone of the first approaches to sentiment analysis of tweets,5 and build your own tweet sentimentanalyser. Early stages of the assignment just use a keyword-based approach, building up to a simpleversion of label propagation later.2 DataThere are four sorts of data you’ll be using.1[email protected] 20, 1:07pm: Imma say it…… The Falcon and The Winter Soldier is betterthan WandaVision, I’m sorry it’s just the [email protected] Apr 20, 1:31pm:booooorrring.This last episode of falcon and winter soldier is soooooAlso ugh with all the America is evil and bad crap Figure 1: Sample tweets about Falcon and the Winter SoldierFigure 2: Sentiment app from NCSU, target Falcon and the Winter Soldier”Tweets Early sentiment analysis work6 included the collection of a set of tweets, some for traininga machine learning model for sentiment analysis, and some for evaluating how good that model is.We’ll be using that same data; it includes the following information for each tweet:7• the gold polarity of the tweet (0 = negative, 2 = neutral, 4 = positive, = not given)• the id of the tweet (2087)• the date of the tweet (Sat May 16 23:58:44 UTC 2009)• the query (lyx)• the user that tweeted (e.g. robotickilldozr)• the text of the tweet (e.g. Lyx is cool)We’ll be ignoring the query. I’ve written code to read in the CSV file that the data is stored in.The starting sample data you’ll be working with consists of 10 tweets, with details as in Figure 3.Basic Sentiment Words There’s a widely used subjectivity and sentiment lexicon8 that I’veextracted data from. Each line consists of a word, followed by the typical sentiment of that wordwithout any additional context, indicated by the string positive or negative, e.g.6 slightly from polarity: NEGpredicted polarity: NONEID: 1467810369user: _TheSpecialOne_date: Mon Apr 06 22:19:45 PDT 2009 @switchfoot – Awww, that’s a bummer.===========================================You shoulda got David Carr of Third Day to do it. ;D gold polarity: NEGpredicted polarity: NEGID: 1467810672user: scotthamiltondate: Mon Apr 06 22:19:49 PDT 2009 is upset that he can’t update his Facebook by texting it… and might cry as a result===========================================School today also. Blah! gold polarity: NEGpredicted polarity: NONEID: 1467810917user: mattycusdate: Mon Apr 06 22:19:53 PDT 2009 @Kenichan I dived many times for the ball. Managed to save 50%===========================================The rest go out of bounds gold polarity: NEGpredicted polarity: POSID: 1467811184user: ElleCTFdate: Mon Apr 06 22:19:57 PDT 2009my whole body feels itchy and like its on fire===========================================gold polarity: NEGpredicted polarity: NEGID: 1467811193user: Karolidate: Mon Apr 06 22:19:57 PDT [email protected] no, it’s not behaving at all. i’m mad. why am i here? because I can’t see you all over there.===========================================gold polarity: NEGpredicted polarity: NONEID: 1467811372user: joy_wolfdate: Mon Apr 06 22:20:00 PDT [email protected] not the whole crew===========================================gold polarity: NEGpredicted polarity: NEUTID: 1467811592user: mybirchdate: Mon Apr 06 22:20:03 PDT 2009Need a hug===========================================gold polarity: NEGpredicted polarity: POSID: 1467811594user: coZZdate: Mon Apr 06 22:20:03 PDT [email protected] hey long time no see! Yes.. Rains a bit ,only a bit LOL , I’m fine thanks , how’s you ?===========================================gold polarity: NEGpredicted polarity: NONEID: 1467811795user: 2Hood4Hollywooddate: Mon Apr 06 22:20:05 PDT [email protected]_K nope they didn’t have it===========================================gold polarity: NEGpredicted polarity: NONEID: 1467812025user: mimismodate: Mon Apr 06 22:20:09 PDT [email protected] que me muera ?===========================================Figure 3: The 10 tweets from the small sample file. The predicted polarities are after runningpredictTweetSentimentFromBasicWordlist() with basic-sent-words.txt.3ID: 1467810369, user: TheSpecialOne , …ID: 1467810672, user: scotthamilton, … ID: 1467810917, user: mattycus, …ID: 1467811184, user: ElleCTF, … ID: 1467811193, user: Karoli, …ID: 1467811372, user: joy wolf, … ID: 1467811592, user: mybirch, …ID: 1467811594, user: coZZ, … ID: 1467811795, user: 2Hood4Hollywood, …ID: 1467812025, user: mimismo, …Figure 4: Graph defined by connecting tweets with shared words, using sample tweets from Fig 3and inverse index inv-index-50.txtabandoned negativeabandonment negativeabandon negative. . .Finegrained Sentiment Words The full lexicon from above also includes information aboutthe strength of the sentiment: weaksubj indicates weak sentiment, and strongsubj strong.type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negativetype=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negativetype=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negativetype=strongsubj len=1 word1=abase pos1=verb stemmed1=y priorpolarity=negative. . .Inverse Index The credit-level tasks and above will require constructing a graph, linking tweetsthat share words. I’ve constructed some inverse indexes that, for each word, give the IDs of tweetsthat contain that word.sleep 1467814783,1467816665,1467818603thanks 1467811594falling 1467819022go 1467810917,1467815924. . .3 Your TasksFor your tasks, you’ll be adding attributes and methods to existing classes given in the codebundle accompanying these specs. Where it’s given, you should use exactly the method stub4provided for implementing your tasks. Don’t change the names or the parameters. You can addmore functions if you like.The two classes provided are Tweet and TweetCollection. The former represents an individualtweet, and the latter a collection of them.Note that the Tweet class contains two enumerated types: Polarity represents the possible sentiment polarity values for a tweet (POSitive, NEGative, NEUTral or NONE); and Strength, for thestrength of polarity (WEAK, STRONG), for the Distinction-level tasks.3.1 Pass LevelTo achieve at least a Pass (≥ 50%) for the assignment, you should do all of the following. You’ll bebasically implementing a simple keyword-based method for sentiment analysis of tweets, countingup the numbers of positive and negative words in a tweet to determine the predicted polarityof the tweet. (This differs from the gold polarity, which is what has been decided as the truepolarity of the tweet; you’re going to try to see how well you can predict it based on the content ofthe tweet.)T1 You will choose approprate representations for the Tweet class. You may or may not chooseto base it on other classes I’ve supplied (Vertex, VertexIDList). Material from weeks 9{11of lectures will be particularly relevant in helping you decide.You’ll need to write a constructor based on your chosen representation that instantiates anempty tweet.public Tweet(String p, String i, String d, String u, String t) {// Constructor// TODO}T2 You’ll also need to do the same for the TweetCollection class. You might want to look aheadat the Credit-level tasks, which require the class to have some graph-like properties, to makethe decision here. (Alternatively, you can just start with some underlying representation thatwill let you implement all of the Pass-level task functions, and then revise later.)public TweetCollection() {// Constructor// TODO}Also write the following two functions.9public Tweet getTweetByID (String ID) {// PRE: –// POST: Returns the Tweet object with that tweet ID// TODO9Note that the code bundle also includes return null; statements in these functions. This is so that the classstill compiles even when there are functions not yet implemented.5}public Integer numTweets() {// PRE: –// POST: Returns the number of tweets in this collection// TODO}T3 Write some getter functions for the properties of the tweet passed in via the constructor,implemented using your chosen representation for tweets.public Polarity getGoldPolarity() {// PRE: –// POST: Returns the gold polarity of the tweet// TODO}public String getID() {// PRE: –// POST: Returns ID of tweet// TODO}public String getDate() {// PRE: –// POST: Returns date of tweet// TODO}public String getUser() {// PRE: –// POST: Returns identity of tweeter// TODO}public String getText() {// PRE: –// POST: Returns text of tweet as a single string// TODO}Also write a getter and setter function for predicted polarity, which you will use when tryingto predict the polarity of a tweet based on the content of its text.public Polarity getPredictedPolarity() {// PRE: –// POST: Returns the predicted polarity of the tweet6// TODO}public void setPredictedPolarity(Polarity p) {// PRE: –// POST: Sets the predicted polarity of the tweet// TODO}Note that I have provided the implementation of a function public String[] getWords().This takes the text of a tweet and splits it into words (‘tokenises’ it) in a standard way,which is returned as an array of String. You will want to use this for other functions whenyou check whether a tweet contains a particular word.T4 I’ve supplied most of the content of a function public void ingestTweetsFromFile(String fInName)that will read in the content from a specified .csv file fInName, and for each line in that fileit instantiates a new Tweet using the constructor.10 You need to add code to insert that intowhatever representation you have chosen for your collection of tweets from Task T2.T5 Write a function in TweetCollection that will read in a file of basic sentiment words (i.e.words paired with their sentiment, as described in Sec 2), and store them in whatever representation you choose for sentiment words.public void importBasicSentimentWordsFromFile (String fInName) throws IOException {// PRE: –// POST: Read in and store basic sentiment words in appropriate data type// TODO}Also write a getter function. If w represents a word that does not have an associated sentiment,the function should return NONE.public Polarity getBasicSentimentWordPolarity(String w) {// PRE: w not null, basic sentiment words already read in from file// POST: Returns polarity of w// TODOreturn null;}T6 In TweetCollection, write the following function that will assign predicted sentiments basedon the content of the tweet. To assign sentiment, use the following rule:• If there are no positive or negative words in the tweet, assign predicted sentiment NONE.• If there are more positive than negative words, assign predicted sentiment POS.10The csv reader requires the opencsv jarfile, which you’lll have to include as a library in the Eclipse Java project,along with one of its dependents. See the notes on the iLearn page with the code bundle for help with doing this.7• If there are more negative than positive words, assign predicted sentiment NEG.• Otherwise, assign NEUT.public void predictTweetSentimentFromBasicWordlist () {// PRE: Basic word sentiment already imported// POST: For all tweets in collection, tweet annotated with predicted sentiment// based on count of sentiment words in sentWords// TODO}Consider the text of tweet 1467811594 (see Figure 3), with text @LOLTrish hey long timeno see! Yes.. Rains a bit ,only a bit LOL , I’m fine thanks , how’s you ?”. Words from thefile basic-sent-words.txt are indicated as negative or positive. This tweet would then getthe predicted polarity POS. (Note that the matching with words in the sentiment words fileshould be done using the tokenisation provided by getWords().)T7 Write a function that calculates the accuracy of your sentiment predictions for each tweet.Accuracy is defined as follows:• Count up the number of tweets for which the predicted polarity is the same as the goldpolarity, as long as this is not NONE (numCorrect).• Count up the number of tweets for which a prediction is made, i.e. not NONE (numPredicted).• Accuracy is the proportion numCorrect / numPredicted.If numPredicted is 0, the function should return 0.public Double accuracy () {// PRE: –// POST: Calculates and returns accuracy of labelling// TODO}For the small sample of 10 tweets, and the sentiment words in basic-sent-words.txt, youshould get an accuracy of 0.4. (See Fig 3.)(Note: Don’t include tweets in either numerator or denominator that have gold polarityNONE.)T8 Write a function that calculates the coverage of your sentiment predictions for each tweet.Coverage is defined as follows:• As in Task T7, count up the number of tweets for which a prediction is made, i.e. notNONE (numPredicted).• Coverage is the proportion numPredicted / total number of tweets.public Double coverage () {// PRE: –// POST: Calculates and returns coverage of labelling8// TODO}For the small sample of 10 tweets, and the sentiment words in basic-sent-words.txt, youshould get a coverage of 0.5. (See Fig 3.)(Note: Don’t include tweets in either numerator or denominator that have gold polarityNONE.)3.2 Credit LevelTo achieve at least a Credit (≥ 65%) for the assignment, you should do the following. You shouldalso have completed all the Pass-level tasks.In the approach to sentiment labelling from the Pass-level tasks, you’ll notice that you ended upwith quite a few unlabelled tweets, because several of them didn’t contain words from the sentimentlexicon. As noted in Sec 1, another kind of technique is to propagate sentiment labels from onetweet to another (similar) one.In the Credit-level tasks, essentially what you’ll be doing is building a graph that links together‘similar’ tweets (for some definition of similarity), and then identifying connected componentsin the graph with a view to propagating sentiment labels via the edges in the graph within thoseconnected components.T9 Implement functions in class Tweet for handling neighbours. You may need to augment yourrepresentation of tweets to do this.public void addNeighbour(String ID) {// PRE: –// POST: Adds a neighbour to the current tweet as part of graph structure// TODO}public Integer numNeighbours() {// PRE: –// POST: Returns the number of neighbours of this tweet// TODO}public void deleteAllNeighbours() {// PRE: –// POST: Deletes all neighbours of this tweet// TODO}9public Vector getNeighbourTweetIDs () {// PRE: –// POST: Returns IDs of neighbouring tweets as vector of strings// TODO}public Boolean isNeighbour(String ID) {// PRE: –// POST: Returns true if ID is neighbour of the current tweet, false otherwise// TODO}T10 You’ll be constructing a graph of tweets by adding an edge between two tweets if they sharea word. For example, in the sample tweets of Fig 3, the tweets with IDs 1467811184 (mywhole body . . . “) and 1467811372 (@Kwesidei not the whole crew”) share the word whole,and so there will be an edge between the two.I have constructed an inverse index (see Sec 2) that contains all relevant words from a set oftweets, and following that the list of tweets in which each word occurs.11In TweetCollection, write a function that reads in the contents of this file, and returns thisinformation as a map from strings (the words) to a vector of strings (a vector of the IDs ofthe tweets that contain the word).public Map importInverseIndexFromFile (String fInName) throws IOException {// PRE: –// POST: Read in and returned contents of file as inverse index// invIndex has words w as key, IDs of tweets that contain w as value// TODO}T11 Now write the function that constructs that graph in TweetCollection.public void constructSharedWordGraph(Map invIndex) {// PRE: invIndex has words w as key, IDs of tweets that contain w as value// POST: Graph constructed, with tweets as vertices,// and edges between them if they share a word// TODO}For the running example, the graph should look as in Fig 4.After this function is run, queries to tweets about neighbours should return appropriateresponses. For example, d.getTweetByID(“1467810672”).numNeighbours() should return1.11Note that the inverse index might refer to some tweets that are not actually in the graph. For the specific filesused in the running example (i.e. training-10.csv, inv-index-50.txt) you’ll find some tweet IDs in the inverseindex that don’t appear in the graph; the ones of use to you for this task are the ones that do appear in the graph.(inv-index-50.txt was in fact constructed from training-50.csv, which training-10.csv is a subset of.10T12 As noted above, you’ll be propagating sentiment labels across connected components. Write afunction that, according to whatever graph representation you have chosen, annotates tweetsas belonging to a particular connected component.public void annotateConnectedComponents() {// PRE: –// POST: Annotates graph so that it is partitioned into components// TODO} (Note: This won’t be tested directly; it will just be tested indirectly via the functions below.)T13 Write a function that, after components have been identified as in T12, counts the numberof connected components.public Integer numConnectedComponents() {// PRE: Connected components have been annotated// POST: Returns the number of connected components// TODO}For the running example, the answer would be 7.T14 Write a function that, after components have been identified as in T12, counts the numberof times a particular sentiment label appears in a connected component, where the particularconnected component is identified by a tweet ID contained in that component.public Integer componentSentLabelCount(String ID, Polarity p) {// PRE: Graph components are identified, ID is a valid tweet// POST: Returns count of labels corresponding to Polarity p in component containing ID// TODO}For the running example, componentSentLabelCount(“1467811372”, Polarity.POS), forinstance, would give the value 1. There are two tweets in that component, with tweet IDs1467811372 and 1467811184. You’ll see from Figure 3 that the latter tweet has predictedpolarity POS and there is no prediction (NONE) for the former tweet.3.3 (High) Distinction LevelTo achieve at least a Distinction (75 – 100%) for the assignment, you should do the following. Youshould also have completed all the Credit-level tasks.The main goal for this level is to propagate sentiment labels via the edges in the graph definedabove, and the majority labels in those connected components. Additionally, there will be a taskon using a richer sentiment labelling scheme.T15 Write a function to propagate a particular polarity p across a particular component. Thecomponent can be identified by the ID of any tweet in that component; and the functionhas a binary flag to indicate whether the tweet should only be labelled with polarity p if its11existing label is NONE, or whether it should always be labelled with p regardless of its existinglabel.public void propagateLabelAcrossComponent(String ID, Polarity p, Boolean keepPred) {// PRE: ID is a tweet id in the graph// POST: Labels tweets in component with predicted polarity p// (if keepPred == T, only tweets w pred polarity None; otherwise all tweets// TODO}For example, propagateLabelAcrossComponent(“1467811184”, Polarity.NEUT, Boolean.TRUE)would result in the tweet with ID 1467811184 keeping its predicted polarity of POS and thetweet with ID 1467811372 being labelled with polarity NEUT.T16 The rule for propagating sentiment across a connected component involves determining themajority sentiment of that component, as follows (analogous to the tweet-labelling rules ofT6):• If there are no positive or negative tweets in the component, majority sentiment is NONE.• If there are more positive than negative tweets, majority sentiment is POS.• If there are more negative than positive tweets, majority sentiment is NEG.• Otherwise, NEUT.Then propagate that across the component as in T15.public void propagateMajorityLabelAcrossComponents(Boolean keepPred) {// PRE: Components are identified// POST: Tweets in each component are labelled with the majority sentiment for that component //////////Majority label is defined as whichever of POS or NEG has the larger count;if POS and NEG are both zero, majority label is NONEotherwise, majority label is NEUTIf keepPred is True, only tweets with predicted label None are labelled in this wayotherwise, all tweets in the component are labelled in this way // TODO}In the running example, for keepPred == True, the only tweet to gain a new predictedpolarity is the one with ID 1467811372, which becomes POS.T17 There is a file available of finegrained sentiment (see Sec 2). Write functions, analogous tothose of T5, as follows.public void importFinegrainedSentimentWordsFromFile (String fInName) throws IOException {// PRE: –// POST: Read in and store finegrained sentiment words in appropriate data type// TODO}12public Polarity getFinegrainedSentimentWordPolarity(String w) {// PRE: w not null, finegrained sentiment words already read in from file// POST: Returns polarity of w// TODO}public Strength getFinegrainedSentimentWordStrength(String w) {// PRE: w not null, finegrained sentiment words already read in from file// POST: Returns strength of w// TODO}Note that in the file of finegrained sentiment, there may be multiple occurrences of individualwords with different sentiment. (For example, fun is both negative and positive.) This canbe because when words are used as different parts of speech (e.g. nouns, verbs) they havedifferent sentiment. Since we’re ignoring parts of speech, just use the last sentimentmentioned in the file for a particular word.T18 There are two strengths of finegrained sentiment, STRONG and WEAK. Write a function thatadapts the method from T6, but that assigns weights to negative and positive words depending on whether they are strong or weak.public void predictTweetSentimentFromFinegrainedWordlist (Integer strongWeight,Integer weakWeight) {// PRE: Finegrained word sentiment already imported// POST: For all tweets, tweet is annotated with predicted sentiment// based on weighted count of sentiment words in sentWords// TODO}In the running example, for the tweet with ID 1467810672 (is upset that . . . “), and assigningweight 2 to strong sentiment and 1 to weak sentiment, the negative words would have weight5 in total (upset weak, cry and blah strong), and the positive words 2 in total (might strong).3.4 BonusThis section is not worth any marks: you can get 100% by completing the above functions correctly.This is only an additional task for anyone interested.T19 Write your own function to assign predicted sentiment to tweets. The goal is to produce anassignment that has high accuracy and high coverage. (Obviously, it should not use the tweet’sgold sentiment. The marking of this function will use data that does not make available thegold sentiment.)public void myTweetSentimentPredictor () {// PRE: –// POST: All tweets are annotated with sentiment// TODO}134 What To Hand InIn the submission page on iLearn for this assignment you must include the following:Submit a zip file consisting of all the Java classes (i.e. the .java files) in the packagefrom the original assignment code bundle.Instructions that you should follow on how to create the zipfile are available in iLearn: you’ll findthem with all the assignment 2 material.Your file must leave unchanged the specification of already implemented functions, and includeyour implementations of your selection of method stubs outlined above.Do not change the names of the method stubs because the auto-tester assumes the names given.Do not change the package statement. You may however include additional auxiliary methods ifyou need them.Please note that we are unable to check individual submissions and so it is veryimportant to abide by the above submission instructions.5 Changelog• 6/5/21: Assignment released.• 11/5/21:{ Added edge case for Task T7.{ Corrected Tasks T15 and T16 to be consistent with code bundle.{ Corrected which tweet was POS and which NONE in Task T14.{ Made the final task a bonus one rather than one counting for marks.• 17/5/21: Added footnote with more detail on inverse indexes.14


Leave a Reply

Your email address will not be published. Required fields are marked *