From time to time, we all face the same problem - data analysis. How to find useful information in a shapeless mass of data. This is quite a typical problem, when we face retrofit from spreadsheet-like data storage.
Source: https://www.solidswiki.com/index.php?title=Sieving
Importing into relational database is quite easy, but then data quality and queryability is usually very low.
I have faced the exact same problem recently. In my case it was analyzing feedback responses. Essentially, there was substantial quantity (50k
) of entries that was mainly plain text.
Look into your data
My first step was to look into data itself, just a brief look around to gather as much knowledge as possible about the nature of data I was going to work with. Quality, typical typos, data types, points of interests - anything that would be advantageous during further phases of work.
During this phase I have found a few interesting facts:
- entries were written without any spell-check,
- data structure was tree-like rather than table-like,
- data was contextually redundant, but plain text analysis would not show it.
This was really great, to start with a naked-eye analysis. Human pattern recognition software is quite fast to spot those things.
Importing data
As usual, I had limited time for this exercise. My idea was to use text data mining as a set of techniques for data processing and data modeling.
Due to data specificity (tree-likeliness) I have started with importing data to MongoDB using mongoimport
from CSV text file.
mongoimport -d misc-data -c <collenction-name> --type csv --file <filename.csv> --headerline
This gave me he ability to query data using find
function with some basic filters and projections, because not all data was interesting from my point of view.
db.getCollection('collection-name').find({<some filters here>},{ <some projection here>}).map( function(u) { return u["Some prop"]; } )
Next step was to clean up data entries. This was more like data exploration using node REPL
rather than carefully designed approach. But overall it came pretty nicely.
I have decided that, as the end result I need to have plain statistic data about usage of some keywords, based on a structural model.
Cleaning up the mess
I have started with each entry as a single entity ( with .reduce()
) and then applied regex
-es to remove urls, white spaces and other non-important parts.
var str = line.reduce((prev, curr) => prev + " " + curr);
var lower = str.toLowerCase();
var no_url = lower.replace(/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#$?&//=]*)/gi, " ");
var clean = no_url.replace(/(\.|:|\(|\)|\\|\/|\-|<|>|\&)/g, " ");
Last thing was to produce tokens
from pristine entry.
var tokens = clean.split(' ');
Removing common words
Next thing was to remove common top 100 english words like 'a'
, 'the'
, 'in'
etc. from token list. I had useed library called common-words
. Code looks more or less like this.
var common = require('common-words');
function removeCommonWords(words, common) {
common.forEach(function (obj) {
var word = obj.word;
while (words.indexOf(word) !== -1) {
words.splice(words.indexOf(word), 1);
}
});
return words;
};
Look into your data - again
The rest was pretty standard. I have just created a Map
with all tokens and count
of each of them.
Now was the time to look into my data again. Based on top stats I've prepared a model - in my case it was JS object with tree-like structure to map into tokens, that was important to my research. It looked like that:
var groups = [
{ name: 'groupName1', list: [ 'token11', 'token12', 'token13'] },
{ name: 'groupName2', list: [ 'token21', 'token21', 'token23'] }
]
Final stats
Using my model derived directly from data, I have gone through all tokens
and matched them to the model. Base on this I was able to gather basic stats with additional group information.
The whole exercise took me about 3-4 hours and the end-result was"good-enough" for me.
Afterthoughts
As usual in JS ecosystem, there is a library to help you go through this process without reinventing the wheel. It's called text-miner done by Philipp Burckhardt @burckhap .
It has APIs like .removeInterpunctuation()
and .removeWords(words[, case_sensitive])
that will help you to automatically remove unwanted content .
The great thing is that it also supports more complicated model using Terms
for term-document matrix.
I drew some inspiration from a great conference talk called "From Power Chords to the Power of Models" from OreDev 2016 by Ali Kheyrollahi @aliostad - https://vimeo.com/191120690
Due to the fact that tokens gathered by me are basically context-less, additional depth of data could be gathered by producing word vectors with deep learning described in this article and here. The library to use this methods is available at npm
- node-word2vec.