On BigData and Data Mining

A few years ago people shrugged off our technology – based on a model of the user psyche – without batting an eye. The hearts of analysts and managers of the IT companies were fully occupied by BigData. It seemed that in a couple of years the large amounts of data would make computing the user’s behavior possible, just like sci-fi. Everyone was waiting for a miracle. But not me. Functional dependencies singled out in BigData lately are quite interesting but. When it comes to analyzing and predicting human behavior unlike – for example – analyzing statistical data on failures of complex technical systems, the psychoanalyst in me starts having strong doubts. Following up on the argument I was having with chief analyst of one of the largest gaming companies (hello there, omegian!) I want to give a comparison. Regarding the human behavior and existence appropriate would be a ‘writing a novel’ metaphor. I find it not very accurate, but – on the whole – a well-fitting one. Both have periods with specified features (eg ages), both have plot twists (career, personal life, leisure), characters that are not accidental/supporting (friends, enemies, relatives), and finally they both have the atmosphere, the genre (worldview, lifestyle). And, most importantly, in both cases previous events cause the subsequent ones. How do data mining algorithms function within this metaphor? The words are randomly selected out of the unfinished novel – in our case the words stand for the registered internet user activity – and then arranged into the list. Take then some more millions (tens, hundreds of them) of suchlike word lists – the other users’ registered internet activity. This lists can be written in different languages, by people from different countries, with different climates, of different ages, current situations. In order to predict the next ‘line’ or ‘plot twist’ of the particular ‘novel’ one should examine the entire dataset, seek the correlations. Fear not, they are entirely findable. Then again these correlations have up to no relevance to predicting the behavior of a particular user at a particular life segment. And the event prognosis itself does not exceed 60% threshold. Nearly fifty-fifty probability balance – the event would either take place or would not. It is not often that we already have ‘the gun on the wall‘ in the first chapter. Now let’s compare this process to the user psyche modelling. We know the principles of novel writing – various genres, the terms of unravelling of the plot, the most likely characters. How difficult can it be to determine the subsequent chapters? Genre, protagonist, story setting almost completely determine remainder of the story. Start reading any novel. In general, its resolution won’t surprise you. If we step aside from metaphors, we could safely claim that any person is characterized by a complex of congenital and acquired qualities. By identifying them and knowing how they manifest themselves in different situations, you can very accurately predict this person’s behavior. Look at your good friends – you can guess their answers to the questions “Do they like this movie?”, “Will he buy this car?” and “What brands does she prefer?” almost without fail even if you did not discuss this subjects with them. Surely, in BigData the attempts to identify the implicit groups (namely characteristics) are constantly made – by means of Latent Dirichlet allocation (LDA), for instance, but I haven’t heard of the high efficiency of such algorithms. I believe the psyche modeling will be the most effective way to predict user reactions in the next couple of decades.