Topic analysis and trend summarization in enterprise social networks


October 2012

Dmitry Brusilovsky, Lana Bogouslavski, Amir H. Razavi, Diana Inkpen


Summary



Business Intelligence Solutions together with University of Ottawa researchers Diana Inkpen and Amir H. Razavi developed a framework for text analytics with the purpose of analyzing topics and summarizing trends in enterprise social media/collaboration platforms. We focused on the automatic detection of topics in text messages of posts and comments to these posts (that is, whole threads), as well as the automatic extraction of key phrases that summarize each thread. The developed framework was implemented in a software tool that will be later integrated by Business Intelligence Solutions into the call center support systems of their clients. This will allow better decision making and better customer support services, since the call center agents will have real-time access to similar or relevant problems, solutions, and opinions previously discussed by the clients and/or other agents.

1. Introduction


Today, modern social networks and services have become an increasingly important part of how users spend their time online and the social networks became a vehicle for people to share their interests, thoughts, or pictures, with their friends or even with the public. Recently, posting messages, photos, or even videos and audio became one of the daily deeds for a many developed and developing countries. Those postings are normally done through social networks via personal computers, tablets or even cellular phones. Social networking sites are now increasingly transforming into social networking services, and they bring more information to users through available means. In the meantime, in order to present the best set of features of a social network or service and also to have a proper control on such a vast interface, social network analysis finds its own important role.
At the same time, the concepts of social media are being actively adopted by the enterprise and many companies are also implementing enterprise social media platforms. The market of enterprise social media collaboration software is fast growing. In this project, we focused our efforts on the unstructured part of enterprise social networks which is textual postings and related comments. We designed methods for automatically extracting key phrases and general topics, in order to allow for structured analysis of text. Although a human could recognize these key phrases and topics with some degree of certainty, it is not an easy task for a computer system.
We implemented our methods for topic detection and for summarization (via key phrases) in Java. For the first task (topic detection), we embedded several machine learning algorithms from freely available Java libraries (Weka). We provide a model file already trained, so that the system is able to classify a social network thread (a posting and its follow-up comments) into one of the general topics.

2. Data sets


For this project, we were looking for some source of social network data which includes the context of postings and related comments, in which the main posting could be connected to the corresponding comments in order to form a thread; the connection is done via parent/posting identifier (id) information items. We spent a while to search for such a dataset. In the end, we used two different sources of data, as described in the following sub-sections.

2.1. Friendfeed Dataset


We located and downloaded a large amount of data from Friendfeed.com, from which we were able to extract the information items that are useful to our project (only the main entries and the related comments) through multi-level filtering tasks. The source data was in more than 11 different languages; therefore we initially applied a procedure of language identification based on all or a portion of the text being in English; however, we removed threads that were partially commented in other languages. We kept only threads that were entirely in English in the final stage. The data collection process has been performed based on the following steps:
  • Downloading the Friendfeed raw data (~ 23 GB in compressed format) from the Internet;

  • Decompressing the data in order to extract three files of Main posts and one file of comments which are linked based on Post_id;

  • Writing some scripts in order to load those four files in a structured format in a database;

  • Creating tables and loading data into the tables;

  • Creating all the necessary indexes in order to cope with a large size of data;

  • Filtering out all threads that are missing main posts (Main_Post equal Null);

  • Joining the Posts table (12,450,658 records) to the Comment table (3,749,890 records) in order to combine the main posts with their related comments (it was such a time and memory consuming task so we had to break it in smaller parts);

  • Filtering out all the threads without comments (with Null comments)

  • Writing an SQL function server to integrate all the comments of a post (from many rows) all together in one row (thread) in the format of: Post-id, Main_Post, all comments separated by commas;

  • Writing a piece of Java code to identify the language of each message/comment in a thread using JTCL (http://textcat.sourceforge.net/) and filtering out all languages, except the English postings. (We tried more than one language identifier; however at the end we chose JTCL as the most appropriate for the task); there were some posting/comments mixed in English and Turkish that represented another challenge at this stage and we filtered them out;

  • Filtering out all threads smaller than 120 characters;

  • Filtering out all threads with less than three comments;

Finally, we came up with more than 24000 usable threads to be fed to our key-phrase extraction and generic topical detection tasks. We randomly selected a subset of 500 threads to be used for manual topic annotation in order to be used for training a classifier, in addition to a random subset of ~4000 which has also been selected for estimating the LDA models that are needed for topic detection, as explained later in Section 4.2.

The second data set was real enterprise social network data, provided by SociaLabra Inc. (www.socialabra.com). The data was initially in .xml format including all social network information items.

After xml markup filtration and the message/comments extraction process, we obtained a total number of 845 threads (main posts, each followed by its related comments). We performed almost all of the preparatory stages listed for the first dataset; however because of the small size of the data set and the existence of many posting in the file which have less than three comments (as opposed to the Friendfeed data set for which in each thread we have at least 3 comments per posting), we decided to keep all of them and not prune them. We also ignored the size filtration for this dataset. The rest of steps of the data preparation process are as follows:

  • We linked up two original files received from the company as one input file. The linkage process was based on the Post_id and the Parent_id of the postings;
  • Since the title of the posting in a thread is repeated at the beginning of each follower posting (comment), we decided to omit the repetitions and keep the title only once in a thread;
  • After the integration process of the main postings and the follower postings (comments) we came up with 722 threads.

3. Manual Annotation

We randomly selected a subset of 500 threads from the Friendfeed dataset for manual key-phrase extraction and general topic annotation, which has been used for training and testing a classifier in the next step (for the second task, the identification of the general topic). The class labels have been selected based on topics extracted by the LDA method, but they were manual mapped into a set of general topics. The final set of general topics that were selected are based on the following 10 categories:

{consumers, education, entertainment, life_stories, lifestyle, politics, relationships, religion, science, social_life, technology}.

All 722 threads of the SociaLabra dataset have also been selected for manual key-phrase extraction, for additional testing of our automatic key-phrase extraction module. The manual annotation with topics and expected key-phrases has been done by our partner Business Intelligence Solutions.

4. Methodology

After preprocessing the datasets, our prototype system executes the following steps: LDA topical estimation modeling, classification of the general topics, and key-phrase extraction. Each step will be explained in the next sub-sections.

4.1. Preprocessing

In the preprocessing stage, first all the different headers, internet addresses, email addresses and tags have been filtered out. Then all the delimiters such as spaces, tabs or newline characters, in addition to some characters like: r : ( ) ` 1 2 3 4 5 6 7 8 9 0 ' , ; = [ ] ; / < > { } | ~ @ # $ % ^ & * _  have been removed from each post, whereas expressive characters like quotation and punctuation marks were kept. Punctuation marks could be useful for determining the scope of speaker messages. This step prevents the system from coming up with a lot of unrealistic tokens as features for our classifier and LDA estimation/inferences.

There are two kinds of stop-word removal, which are performed in two steps: static stop word removal and domain specific dynamic stop word removal. First, we apply a general predefined stop word list that is appropriate for the domain we are working with (i.e., social network). At this stage, we tokenized the posts/comments individually to be sent to the static stop-word removal step that is based on an extensive list of stop-words which has been already collected specifically for the applied dataset.
Second, for some steps, like key-phrase extraction, the stop-words are determined based on their frequency and distribution, inside the input data based on a tokenization strategy (i.e., unigrams, bigrams, 3 or 4 grams). We remove tokens with a very high frequency relative to the corpus size because they are in every topic class, therefore they are not relevant for the topic identification task.
The output of this stage, according to the application of the data (e.g., LDA estimates/inferences, classification, or key-phrase extraction) could be sent for stemming through the Snowball stemming algorithm.
The output of this stage is either an .arff file to be used as our training/testing datasets for the classification task or a standard formatted .txt file to be fed to the LDA topical estimation/inference modeling or key-phrase extraction procedure.

4.2. LDA Topical Modeling

As our first approach to topical extraction from social network threads, we applied the well-known Latent Dirichlet Allocation (LDA) algorithm. LDA was first introduced by David Blei et al [1]. We used the code originally written by Gregor Heinrich [2] based on the theoretical description of Gibbs Sampling.
An interesting point is that the method lets a word participate in more than one topical subset based on its different senses/usages in its context.
The dataset that has been selected initially to be applied for the LDA algorithm consisted of 4000 threads which already passed the preparation and filtration processes. Each thread is represented by a number of topics in which each topic contains a small number of words inside; and each word can be assigned to more than one topic across the entire input data if it is a polysemous word. The number of maximum topics and words inside topics are two parameters of the system that can be adjusted according to the input data. For our project, the values of the parameters have been set empirically to 50 topic clusters, and maximum 15 words in a cluster.
The LDA method assigns some groups of words (the 50 groups of 15 words inside each group) as topics, with different weights for each thread. The topical group of words can be interpreted as a real topic, manually. For example; {"Google", "email", "search", "work", "site", "services", "image", "click", "page", "create", "contact", "connect", "buzz", "Gmail", "mail"} is a real topical group extracted by the LDA model estimation process, which initially has been interpreted (manually) as the Internet topic and at the next level of the topic generalization has been placed under the technology and social_life categories.
The 50 topics extracted by the LDA method have been manually mapped to a set of 10 generic and more human readable topics. We observed that the 10 class labels (topics) are distributed unevenly over the dataset, in which we had 21 threads for consumers, 10 threads for education, 92 threads for entertainment, 28 threads for incidents, 90 threads for lifestyle, 27 threads for politics, 58 threads for relationships, 31 threads for science, 115 threads for social_activities, and 94 threads for technology. Thus, the baseline for any further classification experiment over the dataset may be considered as 18.8%, which is the frequency percentage of the major class, technology.
One of the output files of the method (Model_nam.theta2) contains the topic-document distribution in which each line is a document and each column is a topic where rows are sorted in a descending order based on topic relevancy, which can be easily used for topical extraction and interpretation of the entries (threads). This output can be considered as one of the contributions of the current project.
However, the LDA modeling does not assign one general topic (e.g., entertainment) or even a key-phrase to a thread; hence the assignment of the general topics (e.g., one of the 10 class labels) or key-phrases, these are two further tasks to be done through separate classification and key-phrase extraction processes.
The output of the LDA method includes the following files:

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

in which:

  • <model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.
  • <model_name>.others: This file contains some parameters of LDA model, such as: alpha=? (LDA parameter), beta=? (LDA parameter), ntopics=? # (i.e., number of topics), ndocs=? # (i.e., number of documents), nwords=? # i.e., the vocabulary size, liter=? # (i.e., the Gibbs sampling iteration at which the model was saved)
  • <model_name>.phi: This file contains word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary
  • <model_name>.theta: This file contains topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic
  • <model_name>.theta2: This file contains topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic. In this file, rows are sorted in descending order based on topic relevancy. Each cell starts with the topic# then ':' and then the relevancy weight of the topic which can be easily used for topic extraction and interpretation of the entries (threads).
  • <model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>
  • <model_file>.twords: This file contains most likely words of each topic. In this file we can see the most expressive words of each topic extracted by the LDA method. The number of topics has been set to 50 as one of the parameters of the method in this project.

    The method also saves a file called wordmap.txt that contains maps between words and word's IDs (integers) because the system works directly with integer IDs of words/terms inside instead of text strings.



4.3. Topic Classification


As mentioned before, the classification training dataset consisted of 500 threads. For this data, we initially applied a variety of Bag of Word (BOW) representations (i.e., binary, frequency and TF-IDF based methods) to create the best discriminative representation over the entire 500 threads dataset. The TF-IDF method (term frequency versus inverse document frequency) is a classic information retrieval method that gives higher weights to terms that are frequent in a document but rare in the whole text collection. The intuition behind it is that a term that appears in few documents is very specific and it is relevant for the topic of those documents.
For evaluation and comparison purposes, several representations were selected. First, the BOW representation for which we also applied the Snowball stemming algorithm in order to reduce the feature space. After removing stop-words and stemming, we obtained 6573 words as the feature set for the classification.
As the second representation of the same data, we transformed the entire BOW feature space to a set of 50 topics distribution vectors (that is, 50 features for classification) using the LDA technique (the output data we extracted at this stage included only the most important 50 transformed dimensions, though LDA can provide more.). We evaluated the second representation of the data and reserved for the final comparison between the two representations. Then we integrated the two representations mentioned above into one .arff file, which consisted of 6623 features (words and topics) to test the classification performance over the integrated representation. We also applied almost the same procedure on the SociaLabra dataset.
The last step was applying the Synthetic Minority Oversampling TEchnique (SMOTE) [4] over the class labels with frequencies lower than average and under-sampling method over those which have frequencies higher than average to obtain an evenly distributed dataset.
As part of the supervised machine learning core of the system, we trained a variety of classifiers in order to increase the general topic detection ability to the system.
As an additional experiment, we have run the LDA method over the FriendFeed dataset integrated with the SociaLabra dataset as a background resource, in order to improve the representation power of the LDA topical distribution vectors.
Since most parts of the computation load are ran prior to the detection (offline), the classification system could be applied easily in online interactive applications.

4.4. Key Phrase Extraction


After stages of data cleansing (data preparation and data pre-processing) on both data sets (Friendfeed and SociaLabra), we run a key-phrase extraction program to extract a summary of each thread.
The key-phrase extraction method extracts 1- to 4-grams (one to four consecutive words that occurred in a thread) as potential key phrases from the thread. The key phrases are assigned a specificity weight value and ranked in descending order; this means that the first key phrases for each thread are the most specific key-phrase candidates. The number of key phrases extracted for each thread is variable based on their length and their weight values. Hence the number of key phrases extracted for each thread is not constant and varies based on a combination of a set threshold and the length and content of each thread; however the maximum number of key phrases is a parameter of the system that can be adjusted based on the input data. The maximum value that set in this project was 20.
The core specificity weight calculation method is based on the TF-IDF algorithm.
The TF-IDF value of a word I in the document j is calculated as follows:
  • Multi-gram tokenization versus unigrams;

  • Domain specific dynamic stop words removal (local frequencies and/or class labels);

  • Document length normalization;

  • Stemming to a basic form of the word versus simple truncation the suffixes;

  • Using a related rich corpus as background to improve the TF-IDF weights (if the data-set size is not large enough)

At the final stage of our key-phrase extraction process we add a filtration part to boost the longer key-phrases weights versus some other key-phrases which are included inside the longer ones (e.g., Roger rabbit doll vs. Roger, Rabbit, Doll, Roger rabbit, Rabbit doll). This part improves the performance by increasing the specificity and verity of the output key-phrases.

The topical phrases have been applied on the FriendFeed and SociaLabra datasets, and the results were evaluated manually.

Approximate Matching For Evaluation


In order to have a realistic automatic evaluation of the key-phrase extraction method, we also developed an approximation string matching program based on the Viterbi (Bellman) dynamic programing algorithm to be used for matching different words with the same stems. The program can also match key phrases with different internal word orders and also sub-phrases of a longer key phrase.

Results and Discussion


We initially started to run our comparison classification experiments on the original data to get a sense of the performance on unbalanced data. This dataset consisted of the 500 filtered Friendfeed threads. As explained before. They were manually labeled with one of the 10 topical categories (class labels), for training and testing purposes. We conducted classification evaluations using stratified 10-fold cross-validations (this means that the classifier is trained on nine parts of the data and tested on the remaining part, then this is repeated 10 times for different splits, and the results are averaged over the 10 folds) and on a separate test set of 250 threads. We performed several experiments on a range of classifiers and parameters for each representation to check the stability of classifier performance. We changed the Seed parameter of the 10-fold cross-validation in order to avoid the accidental over-fitting. The evaluation measures calculated for each classifier are shown in the following table.



Evaluation measure →

TP Rate

Wtd. Avg.

FP Rate

Wtd. Avg.

Precision

Wtd. Avg.

Recall

Wtd. Avg.

F-Measure

Wtd. Avg.

Accuracy

%

Representation/

Classifier used ↓

LDA Topics/ CompNB

0.422

0.093

0.413

0.422

0.402

42.20


BOW(TF-IDF)/ CompNB Over & Under Sampling

0.772

0.025

0.744

0.772

0.743

77.22


BOW(TF-IDF)/ MNNB Over & Under Sampling

0.797

0.023

0.786

0.797

0.778

79.66


BOW(TF-IDF)/ SVM(SMO) Over & Under Sampling

0.8

0.022

0.786

0.8

0.79

80.00


LDA Topics/ CompNB Over & Under Sampling

0.488

0.057

0.489

0.488

0.467

48.77


LDA Topics/ MNNB Over & Under Sampling

0.508

0.055

0.514

0.508

0.5

50.77


LDA Topics/ SVM(SMO) Over & Under Sampling

0.539

0.051

0.533

0.539

0.529

53.88


LDA Topics/ J48 Over & Under Sampling

0.482

0.058

0.475

0.482

0.475

48.22


LDA Topics/ Adaboost (j48) Over & Under Sampling

0.693

0.034

0.679

0.693

0.684

69.33


LDA Topics / CompNB  On the 250t Unseen dataset- Original distributions

0.396

0.084

0.408

0.396

0.381

39.60


Number Range Table. Comparison of the classification evaluation measures for different representation/classification methods.



The manual key-phrase extraction on the Friendfeed data achieved an accuracy of a bit over 47%. This was measured by the overlap (agreement) between the 7,093 key-phrases that were automatically extracted by the system and the manually-assigned key phrases. The estimation of the accuracy of the key phrases extracted from the SociaLabra data was 64%, as measured with the help of the approximate matching algorithm. The performance is quite competitive with similar studies in related literature, because key phrase extraction is a difficult task, and it is also difficult to assess.
In the above classification results, the performance is clearly under the influence of the distribution of data (the imbalance rate of the class labels). This is why balancing the data with undersampling/oversampling worked well, leading to accuracy up to 80%.
The other issue that we should take into account is the uncertainty degree of the annotation task (even for the human beings) which has roots in three important aspects:
  • the topics in our list of 10 topics could be sometimes too general;

  • the nature of the social network scattered postings (informal and incorrect text, abbreviations that are explained, etc.);

  • the subjectivity of human annotations. The reasons for some discrepancies between human annotations (with the same problem definition) could be their different personality, mood, background and some other subjective conditions. Human judgment is subjective and is not necessarily the same among different people upon the same case.

Even according to the related literature, when threads are annotated by more than one human annotator the expected agreement percentage between judges is normally around 60-85% on different datasets. Therefore, it is helpful to have a standard annotation system that always annotates based on some constant definitions, patterns and rules, as our automatic system does.

6. System Advantages

  • The LDA method automatically assigns topics to the entry documents (via a small group of words clustered together). We manually interpreted and generalized them into 10 high-level classes (showed in section 4.2.). This manual annotation can be projected to more documents, in order to increase the amount of annotated data. When moving to a new domain (dataset), this is a considerable advantage of this method (since the number of topical groups that need to be manually mapped are far smaller than the number of entries of a corpus that would need to be annotated to build a training set manually).

  • Each document (thread) can be represented by the LDA weighted membership distribution of the topical word groups; hence any other high dimensional vector representation of the documents can be replaced by its LDA weighted membership distribution in order to reduce the dimensionality. Therefore it can be fed to any supervised/unsupervised machine learning algorithm which cannot be applied on high-dimensional data.

  • The performance of the topic and key-phrase extraction systems could be simply improved by adding a similar source of data as background.

  • The system can be used for either Post level or Thread level analysis.

  • Our program is not very sensitive to punctuation and grammatical mistakes.

  • The major part of computational load of the system is on the batch-processing side (offline); therefore the system could be also adapted for online/interactive internet applications.

  • The system is implemented in Java; therefore it can get the benefit of many available related libraries for further customization and expansions.

  • The system includes step-by-step intermediate and final output files from all available functions, to improve the system monitoring ability.


7.System Limitations

  • This system does not benefit from the syntactical structure of the messages explicitly and could be equipped with some modules designed for noun/verb phrase detection or even named-entity recognition (NER) based on their lexicons (in this case we have to take into account that the short length of each post would be a kind of limitation for the system).

  • The current system's design (including the LDA topical estimation, classification and key-phrase extraction) is based on case insensitive text. They can be developed based on case sensitive texts for more precise information extraction and system quality.


8. Future Work


In the future, we could replace the manual interpretation of the LDA topical word groups with an automatic topic assignment by getting the benefit of some dictionaries like Wordnet Domains; also we are going to integrate the distributional information outputs of the LDA model to improve the classification features/models and consequently the general topic assignment task. We could also add opinion analysis, if opinions related to the extracted key phrases appear in the threads.

9. Applications


Our topic detection and key-phrase extraction systems can be applied for summarization or trend detection purposes in any collaborative writing web sites in which people add or modify contents, in the style of posts/comments. It could also be handy for some web-logs or some specialist forums. The system could also be adapted for some kinds of message categorization or even spam detection for any type of text messaging services on the internet or even on cellular phones. It also could be useful over any comment acceptance ports in network builder sites or even in review collection services over the web.

10. Conclusions


We designed and implemented an efficient topic detection and key-phrase extraction system over two social network textual datasets. The system applies LDA topical modeling estimation/inference for the topic detection purpose in addition to the well-known TF-IDF method. The system also gets benefit from a range of available classification algorithms for general topic detection purpose. The performance of the system can be boosted by using a similar textual collection as a background resource.
The system is useful for standard topic annotation and key-phrase extraction applications, mostly in messaging services and collaborative writing web sites.