Home » Books » Computers & Technology » Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information

Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information

Author: Jules J. Berman

Publisher: Morgan Kaufmann

Genres: Computers & Technology

Publish Date: June 18, 2013

ISBN-10: 124045766

Pages: 288

File Type: PDF

Language: English

Book Preface

Data pours into millions of computers every moment of every day. It is estimated that the total accumulated data stored on computersworldwide is about 300 exabytes (thatâ€™s 300 billion gigabytes). Data storage increases at about 28% per year. The data stored is peanuts compared to data that is transmitted without storage. The annual transmission of data is estimated at about 1.9 zettabytes (1900 billion gigabytes, see Glossary item, Binary sizes).1 From this growing tangle of digital information, the next generation of data resources will emerge.

As the scope of our data (i.e., the different kinds of data objects included in the resource) and our data timeline (i.e., data accrued from the future and the deep past) are broadened, we need to find ways to fully describe each piece of data so that we do not confuse one data item with another and so that we can search and retrieve data items when needed. Astute informaticians understand that if we fully describe everything in our universe, we would need to have an ancillary universe to hold all the information, and the ancillary universe would need to be much much larger than our physical universe. In the rush to acquire and analyze data, it is easy to overlook the topic of data preparation.

If data in our Big Data resources (see Glossary item, Big Data resource) are not well organized, comprehensive, and fully described, then the resources will have no The primary purpose of this book is to explain the principles upon which serious Big Data resources are built. All of the data held in Big Data resources must have a form that supports search, retrieval, and analysis. The analytic methods must be available for review, and the analytic results must be available for validation.

Perhaps the greatest potential benefit of Big Data is the ability to link seemingly disparate disciplines, for the purpose of developing and testing hypotheses that cannot be approached within a single knowledge domain. Methods by which analysts can navigate through different Big Data resources to create new, merged data sets are reviewed.

What exactly is Big Data? Big Data can be characterized by the three Vâ€™s: volume (large amounts of data), variety (includes different types of data), and velocity (constantly accumulating new data).2 Those of us who have worked on Big Data projects might suggest throwing a few more Vâ€™s into the mix: vision (having a purpose and a plan), verification (ensuring that the data conforms to a set of specifications), and validation (checking that its purpose is fulfilled; see Glossary item, Validation).

Many of the fundamental principles of Big Data organization have been described in the â€œmetadataâ€ literature. This literature deals with the formalisms of data description (i.e., how to describe data), the syntax of data description (e.g., markup languages such as eXtensible Markup Language, XML), semantics (i.e., how to make computer-parsable statements that convey meaning), the syntax of semantics (e.g., framework specifications such as Resource Description Framework, RDF, and Web Ontology Language, OWL), the creation of data objects that hold data values and selfdescriptive information, and the deployment of ontologies, hierarchical class systems whose members are data objects (see Glossary items, Specification, Semantics, Ontology, RDF, XML).