+91-735-007-0755 info@kausalvikash.in

Introduction

 
So, the buzz is data science. We are now discussing everywhere there was a big hype in the media about the data science. If you look back sometimes early 2000 you could see people start talking about it regularly and frequently. However, it was originally presented back in the sixties in academia. In Sep 2012 data science has been awarded as “The Sexiest Job of the 21st Century” at Harvard Business Review, a prestigious and very popular magazine by Harvard Business School. Now you could imagine how it able to grab all recent attention and limelight.

Now, what is data science?

 
In short, a scientific way to look at, analyze data which in turn would open up a new dimension of available information. This is a sophisticated disciple requires statistical knowledge and computer programming skill. Data science is a complex disciplinary. Data science become a new and “sexist” profession in the world. Behind this success a long story there was the combination of the mature discipline of statistics on computer science and especially delegate a new phase that involved with a vast story of big data. A long history has been associated to make a sense of data science. It has been analyzed by a scientist, computer scientist, Liberian, statistician and other. The following information gives a brief idea about how the term data science evolved, related term and it’s used.

Early days

 
Data science has been formally introduced evolved to become the sexiest job in the world. There were several mathematicians, scientist and international organizations who played the key role directly or indirectly. Interestingly those contributions were not related to data science always, however, they have defined few building blocks which was very important for data science discipline.

International Federation for Information Processing established in 1960 under UNESCO who set some key guideline and concept on data, how it should be processed and what standard should maintain. However it was not data science at all, but it defines a systematic way of data processing and presenting. Before those guidelines, there were data presentation or processing was limited to individual domain and used to be very difficult to interpret on the other domain. They first introduced a term Datalogy in 1968 to formalize this data analysis practice.

John W. Tukey who was an American mathematician and famous for the development of FFT algorithm and box plot. He writes the book ‘The Future of Data Analysis’ in 1962. He first brought the idea of the relationship between the statistic and analysis or more preciously data analysis. Earlier data analysis used to consider as “applied” disciple of Statistics; which makes the scope very limited and scoped within the business area. In this book he writes:

“For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt……………

I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier……………….

A large part of data analysis are inferential in the sample-to-population sense but these are only parts, not the whole… Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.”

This book has been cited several times in many research paper as the formal introduction of data analysis outside Statistical disciple. At later stage researchers come up with several hypotheses to derived another dimension of the same data resulting in a better decision making.

One important thing to notice Tukey was a Mathematician, not a Statistician; who blends statistical analysis and mathematics together to make data analysis more “scientific” and acceptable.

In the year 1977 Tukey published his another major work: Exploratory Data Analysis. He bought another major idea on how “Explanatory” and “Confirmatory” data analysis should be done and he stressed upon “Side By Side” approach. That means we need the new or revised hypothesis to do this analysis side by side. But why this idea was so important? As mentioned before contemporary data analysis used to be a statistical disciple and limited to the specific domain. For example, a particular hypothesis may be useful to find out a type of health issues of the population but the same hypothesis might not be applicable to another area like identifying a quality of a particular corp.

Another important name that came after Tukey is Peter Naur. In 1974 he published “Concise Survey of Computer Methods” in Sweden and The United States. The book is a collection of modern data processing method from various domain used worldwide in the verity of applications. Other important fundamental aspects of the book were data standard or guidelines defined by International Federation for Information Processing. Which makes those ideas more acceptable and interpretable with wide ranges of audiences. In fact, those ideas detailed in the book comes with a short survey or example data processing. In this book, he used the term “Data Science” several times. At the later stage, Naur produces the new or formal definition of data science. “The science that dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.” From this time the term ‘Data science’ is used very frequently. But it really took a long time to catch on. After his paper data science is pushed towards more and more.

In 1977, The International Association for Statistical Computing (IASC) was originated. It is included as the sector of International Statistical Institute (ISI). The main aim of IASC is to connect and exchange statistical computing worldwide between statistician, computer professional, educational institute, researchers and government on various subject or domain. They start publishing a monthly journal named “Computational Statistics & Data Analysis”. This was a tremendous move as it helps with knowledge sharing and new ideas on computational statistics and data analysis. If you notice by this time data analysis become and has been accepted as an important disciple.

In 1989, first ‘Knowledge Discovery Database Workshop’ has been organized by Gregory Piatetsky- Shapiro, also known as KDD-89. KDD-89 discussed these areas,

  • Expert Database Systems
  • Scientific Discovery
  • Fuzzy Rules
  • Using Domain Knowledge
  • Learning from Relational (Structured) Data
  • Dealing with Text and other Complex Data
  • Discovery Tools
  • Better Presentation Methods
  • Integrated Systems
  • Privacy

KDD-89 has been cited is several research paper at a later stage and has been considered a pioneer for a formal improvement of data representation. Following this session scientist and researchers start exploring these options for data representation and data storage. Which in turn helps DBMS for better storage, retrieval, and presentation of data. In fact, today’s “data science” or data analysis expands towards various areas starting from health, retail, manufacturing, service and govt organization. However, this wide acceptability becomes a success with the improvement of DBMS a general solution for data management rather a tool. KDD-89 became ACM SIGKDD Annual conference on Knowledge Discovery and Data mining.

Next couple of year’s data science gets another dimension with the improvement of database management system or DBMS or simply “database”. DBMS changed the way we used to store view or review data. Following KDD-89 researches come up with the easy representation of DBMS technology allowed to store and share data more easily and effectively.

By this time we have realized we need more computational power in order to continue with those data analysis. As researchers have arrived to improve computer processing power data analysis become easy with more and more “deep dive” and getting improved day by day.

Both of these has some significant contribution towards today’s shape of “data science” indirectly. Yes still this term “data science” has not been accepted formally but everyone one is discussing on data analysis and management.

September 1994 BusinessWeek publishes a cover story on “Database Marketing”. This is being considered as the first concept of data analysis towards commercial gain. This cover story gives hint how we should reuse of our data. For example, companies are collecting data on product sales, about the customer, product etc. Now we could predict users buying pattern and how likely they going to buy another product. That means what would be the next marketing approach and is it a good time for promotion etc. Majority of today’s commercial data analysis being used or reused for marketing purpose. This cover story has been cited by several universities and prestigious business colleges; has conducted multiple research to derived correct and better hypothesis. Nowadays this key concept has been evolved and is available in several forms starting from advertisement to predict future sales or to optimize the future supply chain.

International Federation of Classification Societies (IFCS) for the first time included the term “Data Science” in the conference title in 1996 at Kobe, Japan during the biennial conference. Now it was very important as “Data Science” has been accepted as a key thing with a prestigious organization. At the later stage, they will start using this term along with other key terms like data mining, data analysis etc. The IFCS was founded in 1985 which associated with six countries and it is a language-specific classification society.

We have discussed on KDD-89 earlier, we should remember by this time multiple papers being published and discussed by researchers and scientists on the application of KDD. One of the key paper was by Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.” They have discussed on “data mining” which they refer to as a domain-specific algorithm for data extraction.

From the paper “Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing… In our view, KDD [Knowledge Discovery in Databases] refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data… the additional steps in the KDD process, such as data preparation, data selection, data cleaning, incorporation of appropriate prior knowledge, and proper interpretation of the results of mining, are essential to ensure that useful knowledge is derived from the data. Blind application of data-mining methods (rightly criticized as data dredging in the statistical literature) can be a dangerous activity, easily leading to the discovery of meaningless and invalid patterns.”

Interestingly they haven’t discussed at all or never used the term “data science” rather they stressed upon data extraction. Which at later stage been cited several times as key building blocks of today’s data science. Prior to this paper data extraction used to consider a generic process which failed to deliver all available data.

Professor C. F. Jeff Wu, Georgia Institute of Technology first proposed for statistics to be renamed data science and statisticians to be renamed data scientists in his inaugural lecture for the H. C. Carver Chair in Statistics at the University of Michigan in 1997. This gets huge attention on the subject of “Data Science”. Though Professor C. F. Jeff Wu did receive a mixed response along with huge criticism. But people start thinking and spending the time to differentiate with Statistics resulting revised hypothesis and acceptance of data science.

In 1999 Jacob Zahabi notices that some new tools are needed to manage the huge amount of information that’s are available to business. In the book ‘Mining Data for Knowledge’ he wrote:

“Scalability is a huge issue in data mining… Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address website decisions.”

William S. Cleveland published a paper in 2001 “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.” This got two key take-ups away, probably this the first research paper which mentioned “Data Science” on the subject. Secondly, he arrived and proposed a few new ideas to eliminate or marge the distance between Statistician and computer scientists. This was a very important decision. To analyze a data we need statistical input and knowledge of computer. As mentioned earlier today’s “data science” is a specialized blend of statistics and computer engineering method and its evolving. This paper influenced many researchers to come up with the new hypothesis from a different domain and finally, it enriched overall data science.

In 2001 ‘software-as-a-service’ was created which was the primary pre-cursor for a cloud application.

William S. Cleveland being a plan for the training of data scientist to achieve the needs of future in 2001. He prepared an action plan that name is ‘Data Science: An action plan for expanding the Technical Areas of the field of Statistic.’ He explained that how to increase the technical experience of data analysis. And he proposed six area that would be studied in the University Department. He also specifies a few resources for research among these six areas. His idea is used in the government sector and in the corporate sector.

The International Council of science: the board of Science and Technology for Data start to be publishing the ‘Data Science Journal’. The main aim of the publication is to focus on some area such as the definition of Data system, application procedure, publication on the internet and another legal issue.

In 2003 January “Journal of Data science” published. According to this, “By ‘Data Science’ we mean almost everything that has something to do with data: Collecting, analyzing, modeling…… yet the most important part is its applications–all sorts of applications. This journal is devoted to applications of statistical methods at large…. The Journal of Data Science will provide a platform for all data workers to present their views and exchange ideas.”

In 2006, Hadoop was released as an open source database 0.1.0. It is based on Nutch, which is another open source database.

In 2008 the term ‘Data Science’ become a part of the language and it began to buzzword. DJ Patil of LinkedIn and Jeff Hammerbacher of Facebook have invested some money for initiating of Data science use as per buzzword.

In 2009 the term NoSQL has been re-established by John Oskarsson but its variation has been used since 1998. At that time, he arranged a discussion about ‘open source, non-relation Databases.’

Data scientist job increased in the year 2011. The conference and seminar about the Data science and Big data have been increased side by side. Data science become a face of profit and it includes with the corporate culture. In this year James Dixon who was the CTO of Pentaho, provide the concept of Data Lake and Data Warehouse. He promoted the difference between the Data Lake and Data Warehouse. According to him, Data Warehouse is a place where data is pre-categorized at the point of entry, wasting of energy and time. While Data Lake only stored the data without categorized it and data lake accepts the detail by using the non-relational database.

In 2013 IBM shows that in the last two year 90% of data has been created. In 2015 using a complex learning technic, Google voice and speech has been recognized. This is an amazing experienced and spirited performance.

In the past 10 years, data science is developed very quickly worldwide in the area of business and various organization. It became a most important part in academic research.

Some other factors become the sexiest job in the world

 
So what are the other factors to become data science so important and popular today? Because data became the most valuable resource in the world. Forbs cited “Data is the new oil”. But how come so many data is being generated. Let me give a small example

  1. You have purchased a product (let’s say an XXXX headphone) from online store ABC.COM
  2. You have shared that “Breaking news” on social media By tagging both ABC.COM and XXXX
  3. Finally, you have reviewed that product by sharing your experience with the product

These 3x activity has been generated some data an now how it would be reused. I will discuss a few high-level points,

  • From XXXX price ABC.Com would try to understand on which category you should be fall in based on expense nature
  • Why did you choose XXXX over another band? Let’s say you did some research by comparing other product before buying it, they will review what other product has reviewed.
  • Based on your product review they will see if share same ideas with other customers
  • If you are a frequent customer they may try to predict what would be the next purchase to identify your buying pattern.
  • On the other hand, your friends start asking about the product on social media. Based on the comments etc. product vendor may try to understand other previous customer reaction and there is any demand at all.
  • Since both ABC.COM and XXXX manufacturer has been tagged on your post some of your friends who were not aware of such might show some interest and may visit those social media page. And may “Like” or comment something. Based on this XXXX Product Company would try to understand the potential customer.

Now this list would get bigger and may become endless. The scenarios I have mentioned above requires a systematic approach to extract that information. Most importantly data won’t be available the same way for example in social n/w data may appear a bit different way than ABC.com. There is 2 distinct target for both ABC.COM XXXX manufacturer while extracting those insights from your activity. Say you have commented bad reviews on headphone on social media whereas you have praised ABC.Com for prompt delivery or maybe the vice versa. But both of them would try to understand how that affects you. Let’s say you have bad or good delivery experience with ABC.Com on all previous occasion.