BigData.be project n°2 – Real-time analytics with Storm – workshop 2
Tonight, in Gent [0], we have been invited by Wim, Daan, Kenny (and certainly many others) to tackle several challenges regarding the real-time analytics using tools enabling Big Data analysis:
- Running Storm [6] in distributed mode
- Create a web service for sentiment analysis [7]
- Define business logic for positive/negative market movements
For non-IT people, this means that we want:
- Read tweets published on Internet
- Filter the tweets containing the names of publicly traded companies like Neuhaus (Chocolate), AB Inbev (Biers) (French Fries), Umicore (most sustainable company in the world) to name a few Belgian companies
- Evaluate the content of the tweet based on 3 criteria:
- a sentiment based on the words used inside the tweet
- the amount of retweets of that initial tweet
- the amount of followers of the person who retweets the initial tweet
- Set a global score of our analysis that might be used to act upon such information i.e. call the police, inform by phone the parents as their children reached their destination, buy or sell a stock …
If you are interested by the early results check the Source code repository : https://github.com/bigdatabe
The repository will grow until we deliver an open source, working implementation of the current functional analysis (I may dream here)
Here is the result of a bit more than an hour of functional analysis…
The Functional analysis
- Twitter stream is gathered (i.e. “NSA & FBI have tapped into servers of tech firms to gain intelligence data; firms include $GOOG, $YHOO, $MSFT, $AAPL, $FB. – @WashingtonPost“)
- The system determines which tickers are stored in the tweet (i.e. $AAPL is APPLE, $GOOG is Google)
- Determine the language of the tweet (i.e. US)
- Filter the tweet based on our list of supported languages; we reject all the tweets that can’t be understood (i.e. only NL, FR and US language will be analysed)
- in parallel we perform the following tasks in order to come up with a final scoring of the tweet
- (A) split the tweet by language
- split the tweet into words
- Value the sentiment of each word using the corpus [3]. Note that Google provides the largest English corpus at 155 billion words. Also has corpora for other languages [5]
- Tweet id
- Company id (i.e. AAPL)
- Word id: sequence of word into the tweet (i.e. 1, 2, 3 …)
- Value: the value can be positive or negative.
- Calculate tweet’ score ; i.e. you take the total tweet’ score and divide it by a predefined value (i.e. the total amount of relevant words like display retina, chromebook pixel …)
- Tweet id
- Company id
- Score type : [Sentiment score]
- Sentiment value (i.e. floating point, positive or negative: 0.8)
- (B) Calculate the retweet’ score
- Tweet id
- Company id
- Score type : [Retweet score]
- Sentiment value (i.e. floating point, positive or negative: -0.2)
- (C) Calculate follower’ tweet score
- Tweet id
- Company id
- Score type : [follower’ score]
- Sentiment value (i.e. floating point, positive or negative: 0.6)
- We combine the Sentiment, Retweet and the Follower’ score into a final score using a predefined computation model (i.e. an average of the scores multiplied by a weight per score type: (0.8 * 2) + (-0.2 * 0.3) + (0.6 * 1) = 2.14
- The outcome this process will be transmitted to another component:
- Company id (i.e. AAPL)
- Score: (2.14 in our case)
- The outcome this process will be transmitted to another component:
- Now the analyst can decide and act upon the score i.e. call the police, inform by phone the parents as their children reached their destination, buy or sell a stock …
During the analysis we didn’t consider…
- the linkage between words semantic (i.e. retina display) and the fact that those words relate to a certain company (i.e. Apple)
- The absolute and relative value we should consider to act on the information
The early mistakes we made …
- Insert too much complexity from the beginning (what if… what if .. what if…)
- Include technical requirements into the functional requirements (IT guys never change)
The functional team
- Stijn https://twitter.com/stijnbe
- Lode https://twitter.com/lodeblomme
- Wouter
- Daan https://twitter.com/daangerits/
- Abdelkrim https://twitter.com/abdelkrim
Further links
[0] Ghent is located in the Flemish region of Belgium https://en.wikipedia.org/wiki/Ghent
[1] Big data meetup http://www.meetup.com/bigdatabe/events/122726552/
[2] Twitter Streaming APIs: https://dev.twitter.com/docs/streaming-apis
[3] What is the Text corpus? http://en.wikipedia.org/wiki/Text_corpus
[4] GITHub repository of Big data BE: https://github.com/bigdatabe
[5] Google’ Text corpus freely available: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
[6] Storm is a free and open source distributed realtime computation system: http://storm-project.net/
[7] What is the sentiment analysis https://en.wikipedia.org/wiki/Sentiment_analysis
[8] What is the Retina Display? https://en.wikipedia.org/wiki/Retina_Display
[9] Wim, organiser of the Belgian Big Data meetup http://www.meetup.com/bigdatabe/members/17216461/