From time to time, we all face the same problem - data analysis. How to find useful information in a shapeless mass of data. This is quite a typical problem, when we face retrofit from spreadsheet-like data storage.

sieve
Source: https://www.solidswiki.com/index.php?title=Sieving

Importing into relational database is quite easy, but then data quality and queryability is usually very low.

I have faced the exact same problem recently. In my case it was analyzing feedback responses. Essentially, there was substantial quantity (50k) of entries that was mainly plain text.

Look into your data

My first step was to look into data itself, just a brief look around to gather as much knowledge as possible about the nature of data I was going to work with. Quality, typical typos, data types, points of interests - anything that would be advantageous during further phases of work.

During this phase I have found a few interesting facts:

  • entries were written without any spell-check,
  • data structure was tree-like rather than table-like,
  • data was contextually redundant, but plain text analysis would not show it.

This was really great, to start with a naked-eye analysis. Human pattern recognition software is quite fast to spot those things.

Importing data

As usual, I had limited time for this exercise. My idea was to use text data mining as a set of techniques for data processing and data modeling.

Due to data specificity (tree-likeliness) I have started with importing data to MongoDB using mongoimport from CSV text file.

mongoimport -d misc-data -c <collenction-name> --type csv --file <filename.csv>  --headerline

This gave me he ability to query data using find function with some basic filters and projections, because not all data was interesting from my point of view.

db.getCollection('collection-name').find({<some filters here>},{ <some projection here>}).map( function(u) { return u["Some prop"]; } )

Next step was to clean up data entries. This was more like data exploration using node REPL rather than carefully designed approach. But overall it came pretty nicely.

I have decided that, as the end result I need to have plain statistic data about usage of some keywords, based on a structural model.

Cleaning up the mess

I have started with each entry as a single entity ( with .reduce()) and then applied regex-es to remove urls, white spaces and other non-important parts.

var str = line.reduce((prev, curr) => prev + " " + curr);
var lower = str.toLowerCase();
var no_url = lower.replace(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#$?&//=]*)/gi, " ");
var clean = no_url.replace(/(\.|:|\(|\)|\\|\/|\-|<|>|\&)/g, " ");

Last thing was to produce tokens from pristine entry.

var tokens = clean.split(' ');

Removing common words

Next thing was to remove common top 100 english words like 'a', 'the', 'in' etc. from token list. I had useed library called common-words. Code looks more or less like this.

var common = require('common-words');

function removeCommonWords(words, common) {
  common.forEach(function (obj) {
    var word = obj.word;
    while (words.indexOf(word) !== -1) {
      words.splice(words.indexOf(word), 1);
    }
  });
  return words;
};

Look into your data - again

The rest was pretty standard. I have just created a Map with all tokens and count of each of them.

Now was the time to look into my data again. Based on top stats I've prepared a model - in my case it was JS object with tree-like structure to map into tokens, that was important to my research. It looked like that:

var groups = [
  { name: 'groupName1', list: [ 'token11', 'token12', 'token13'] },
  { name: 'groupName2', list: [ 'token21', 'token21', 'token23'] }
]

Final stats

Using my model derived directly from data, I have gone through all tokens and matched them to the model. Base on this I was able to gather basic stats with additional group information.

The whole exercise took me about 3-4 hours and the end-result was"good-enough" for me.

Afterthoughts

As usual in JS ecosystem, there is a library to help you go through this process without reinventing the wheel. It's called text-miner done by Philipp Burckhardt @burckhap .

It has APIs like .removeInterpunctuation() and .removeWords(words[, case_sensitive]) that will help you to automatically remove unwanted content .

The great thing is that it also supports more complicated model using Terms for term-document matrix.

I drew some inspiration from a great conference talk called "From Power Chords to the Power of Models" from OreDev 2016 by Ali Kheyrollahi @aliostad - https://vimeo.com/191120690

Due to the fact that tokens gathered by me are basically context-less, additional depth of data could be gathered by producing word vectors with deep learning described in this article and here. The library to use this methods is available at npm - node-word2vec.