Santa’s Super Sorter: Naughty or Nice?

This post is part of the fantastic F# Advent Calendar 2017 series organized by Sergey Tihon (@sergey_tihon) to bring the community together to celebrate the holiday season.  Be sure to check out the other posts!

The goal:  help Santa sort out who has been naughty or nice this Christmas !!

During this time of year Santa is overwhelmed, preparing presents for Christmas, and he is looking for helper elves to outsource some very important tasks.  He is asking for our help to design a program that can efficiently decide who has been naughty or nice this year.

The idea for this program is to evaluate a person’s tweets to determine if a given Twitter user has been Naughty or Nice.  This program will analyze the last year of tweets generated by the member.  The program uses the IBM-Watson Tone Analyzer service to interpret and analyze the tweet text making a qualitative determination of either naughty or nice.

 

Note: this program uses the last year of tweets, but you should be able to plug any other source of historical data; such as Facebook.  The program analyzes the emotions and tones of what people write online to identify whether they are happy, sad, hungry, and more. In this example Twitter was used because of its importance as a social media channel, the prolific availability of content from posts, reviews and messages from just about every well-known person on the planet.

 

This application gets data from Twitter and then reviews the data leveraging Watson’s cognitive linguistic analysis in order to identify a variety of tones at both the sentence and document level.  This service is able to detects three types of tones, including emotions anger, disgust, fear, joy and sadness.  It can also identify social propensities; such as, openness, conscientiousness, extroversion, agreeableness, and emotional range.  Finally, it detects language styles in the text such as analytical, confident or tentative. Using these metrics, the program fetches the historical tweets, runs the analysis and If it determines that there is a majority of anger or disgust compared to the level of joy, for example, it will put you on the naughty list.

 

ibmTo implement this project, we will use VSCode (with Ionide) and the dotnet-cli for cross-platform compatibility.  You can download the latest .NET Core SDK and find more details here.

 

The IBM-Watson services provides SDK, but this program will access the API directly using a regular HTTP request to remove any extra dependencies. This decision is based on the fact the IBM-Watson SDK are not yet fully compatible with the dotnet-core, and furthermore, because the API exposed are provided only in synchronous version, which adds more complexity to running multiple request at the same time. This program is written to leverage the asynchronous programming model, accessing the capabilities of the Tone Analyzer service via an HTTP Representational State Transfer (REST) API calls, to analyze multiple chunks of tweets in parallel.

The historical tweets are fetched using the Tweetinvi package, which provides a dotnet-core version of the library.

 

Here the steps to create the project.  The final implementation can be found here

  1. Create a new project with dotnet-core

                 dotnet new console -lang F# -o SantaListAnalyzer

  1. Then add the packages needed to run the program

                dotnet add package System.Net.Http

               dotnet add package System.Configuration.ConfigurationManager

               dotnet add package System.Text.RegularExpression

               dotnet add package Newtonsoft.Json

               dotnet add package TweetinviAPI

  1. Build the project, which allows ultimately to get code completion in VSCode

              dotnet build

 

Before you start coding, lets take care of the account set up.

 

IBM-Watson account set up:

IBM has tied Watson to the cloud development environment “Blue Mix”, which is a full-service suite of applications. We are interested in the Tone Analyzer, but there are other options available, you should take a look here

To have free access to the IBM cloud development environment, you need to register and create an account here

After the registration, you will receive an email to confirm the account generated, and then you are good to start.

The account gives you 2500 free calls per month, which is quite a large number to experiment and have fun with the API. You can find more information about the Tone Analyzer here

 

 

Twitter development account set up:

This tutorial uses tweets as the data source. To pull tweets programmatically from Twitter, you use the Twitter APIs, for which you need OAuth credentials. If you do not have a Twitter account, you can sign up here

Once you have a valid account, the next thing you need to do is to create an API key for your application. With your log-in active, head here

Get the access tokens by following the instructions here . When you are done, you should have the following four credentials: consumer key, consumer key secret, access token, and access token secret.

 

Tweetnvi library set up:

Tweetinvi is a .NET library used to access the Twitter REST API.

We will use the dotnet-core version, which can be used for development on different platforms. This library can be found here

We have already added the Tweetnvi package to the library using the dotnet command line.

 

Now you have all that you need to start coding.

Below is the implementation of the program broken out in parts for ease of explanation. The full code implementation can be found here

The ToneEmotionScores is a record type that carries the tone score for different emotions. The Zero static member and the (+) overloading are used to define “monoid”[1] properties. These properties ease the aggregation of multiple ToneEmotionScores types resulting from the execution of parallel operations. The pattern used is called fork/join, which will be further explained. In this case, the plus (+) operator is associative (and commutative), which guarantees a deterministic aggregation of parallel operations, because the order of computation is irrelevant. You can find more information in chapter 5 “Data Parallelism” my book (link here).

 

 

 

NOTE In functional programming, the mathematical operators are functions. The + (plus) is a binary operator, which means that it performs on two values and manipulates them to return a result. A function is associative when the order in which it is applied does not change the result. This property is important for reduction operations. The + (plus) operator and the * (multiply) operator are associative because:

(a + b) + c = a + (b + c)

(a * b) * c = a * (b * c)

A function is commutative when the order of the operands does not change its output, so long as each operand is accounted for. This property is important for combiner operations. The + (plus) operator and the * (multiply) operator are commutative because:

a + b + c = b + c + a

a * b * c = b * c * a

Why does this matter?

This matters because, using these properties, it is possible to partition the data and have multiple threads operating independently on their own chunks, achieving parallelism, and still return the correct result at the end. The combination of these properties permits the implementation of a parallel pattern like divide-and-conquer, Fork/Join, and MapReduce.

 

Monoids:  Using math to simplify parallelism

The property of association leads to a common technique, known as a monoid (https://wiki.haskell.org/Monoid), which works with many different types of values in a simple way. The term monoid (not to be confused with monad https://wiki.haskell.org/Monad) comes from mathematics, but the concept is applicable to computer programming without requiring any background in math. Essentially, monoids are operations whose output type is the same as the input, and which must satisfy some rules: associativity, identity, and closure.

 

Here we have defined the discriminated unions Emotions and SantaList, which are used to handle the different tone emotion cases and to apply the final result of the evaluation; whether you have been Naughty or Nice.

Next, we define a few helper functions used to clean up the Tweets to facilitate the tone analysis. The maxSizeSentence is a value used to constrain the max text length allowed to send to the IBM-Watson API.  The other values are used to initialize the credential to access both the IBM-Watson and Twitter development API.

This function naughtyOrNiceSelector, as you can imagine by the name, is responsible to calculate if a twitter handle has been either naughty or nice. In this code, we are simply verifying if the last year of tweets for a user contains more anger and disgust in the tone as compared to a joyful tone.

The following function getTweetHistory fetches the tweets from the last 12 months. Because the API can fetch a max of 3200 tweets per request, we are recursively sending a request keeping track of the time for the previous request, and then subtracting the time of the last tweet obtained until all the tweets from one year period have been retrieve. In this way, we are able to collect all the tweets, even if the given twitter handle has produced more than 3200 tweets in the past year. The result of this function is a set of tweet chunks, where each chunk could have a max number of 3200 tweets.

The function tweetCleanup aims to clean up the mined tweet data. In fact, the tweet text could potentially, and most likely, contain several strings that should not be considered for any textual analysis. For example, the word “RT” is not relevant for sentiment analysis, neither are phrases that represent URLs. In this step, clean up the tweet text so that the Watson services can analyze the text better.

The following functions tweetToAnalyzeMerger and tweetPartitioner, respectively aggregate all the tweet chunks together and then partition the merged data into tailored string chunks. This size of each portion of text is scoped not exceed the max value “maxSizeSentence” that the IBM-Watson service can analyze at one time.

The following function emotionAnalyzer is the final and the core piece of the program. In fact, this function connects and feeds the Tone Analyzer to get an analysis of all the tweets passed and derive scores for the sentiments expressed in the text.

The set of “tweetChunks” are passed into a Seq.map function to transform the input sequence into a sequence of Async types, which are processed in parallel using the Async.Parallel operator.

As mentioned, because the result type of these operations is represented by a ToneEmotionScores type, which has monodoial properties, we can merge the results simply with the Array.reduce(+) function.

The asynchronous tone analysis operations run in a fork/join fashion, which is a pattern that aims to split, or fork, a given data set into chunks of work so that each individual chunk of work is executed in parallel. After each parallel portion of work is completed, the parallel chunks are then merged, or joined, together..The Fork/Join pattern splits a task into subtasks that can be executed independently in parallel. Then, when the operations complete, the subtasks are joined again. It is not a coincidence that this pattern is often used to achieve data parallelism. In fact, there are clearly some similarities. Ultimately, the last a point-free function [2] naughtyOrNiceAnalisys composes all the previous defined functions together to run the analysis.

In this case the tweets are retrieved in batches of 1000, but this is an arbitrary number that can be changed.

Well done!  Now, let’s run the program and check who has been Naughty or Nice this year….

 

What about Don Syme  ??? try it yourself to check ..!!!

 

Enjoy and Happy holidays!!!!

 

[1] https://en.wikipedia.org/wiki/Monoid

[2] https://wiki.haskell.org/Pointfree

Leave a Reply

Your email address will not be published. Required fields are marked *