Guest Column | May 24, 2017

How Genomics Researchers Deal With Big Data

Look For Big Data Opportunities In Healthcare IT

By Greg Hoffer, Vice President of Engineering, Globalscape

While the term Big Data was, according to the New York Times’ Steve Lohr, coined in 1998, it hit the mainstream as a hot business concept about a decade ago when major advancements in processing power, storage, and bandwidth coincided with the trend toward cloud computing. As with anything new in tech there were skeptics, but Big Data is real and an important part of business today — especially in medical science and research.

But how big is Big Data? According to VCloud News, the human race generates 2.5 quintillion bytes of data every day. That’s Big Data to the tune of 25 with 18 zeros tacked on the end. In the field of genomics, Big Data management and analysis are vital. The size of a single human genomic data set can be as large as 700 megabytes (MB) and the data set generated by a sequencer can be even bigger — around 200 gigabytes (GB). A Washington Post report stated, “The amount of data being produced in genomics daily is doubling every seven months, so within the next decade, genomics is looking at generating somewhere between 2 and 40 exabytes a year.”

That’s a lot of data, so it’s probably a good idea to know what is meant by Big Data before we go much further.

Gartner's IT Glossary describes Big Data as “high-volume, high-velocity, and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” That means data comes at you fast and it doesn’t stop. You need to be able to unlock relevant patterns and trends that might otherwise be lost in the deluge and turn that data into useful information — breakthroughs that can lead to new therapies and more positive medical outcomes.

The Three Vs

The resulting business decisions might be simply a price tweak or could prompt a complete re-thinking of your organization’s business model. Given Big Data’s potential, let’s examine the “Three V’s”:

  1. Volume: Businesses generate and gather high volumes of data every day, and in ways you probably haven’t considered. Any time you sell a product there is a lot of data involved, such as: Who made it? When did you put it in inventory and for what price? Who shipped it and from where? Who purchased it and from what industry? Was the purchase a cash transaction or credit? How was it marketed? Was it a replacement? How long did the original last? Do your competitors carry the product or similar products? How much do they charge?
    You get the idea, and that is just a superficial look at the level of detail that can be examined with the purchase of a single product. Many pieces of data are generated or shared during a day, and the volume of data is just one aspect to understanding Big Data.
  2. Velocity: The pervasiveness of technology — and especially mobile communications — means nearly every aspect of our businesses, whether local or global, is generating more information and sending it home faster than ever before. It’s difficult to keep up with the analog world, never mind the digital realm with all its devices and global interconnectedness. Not only are there large amounts of data being created, shared, and stored, but the speed at which data is generated has also increased, giving new meaning to the metaphor “drinking from a fire hose.”
    That fire hose is pumping non-stop. Unfortunately, most businesses aren’t equipped to track where the data comes from, how it can be used, or where it's stored. For those that do (or who partner with an organization that does), volume plus velocity means there’s a need for new ways of collecting and storing Big Data.
  3. Variety: Data is still often thought of as the kind of thing that populates the tables in a database, but these days everything can be rendered and transferred in some digital form. It is customer buying habits, website traffic, HR records, and financial or research information. It is image files, audio files, social media interactions, emails, texts, and documents of various types and more. That means there are no standard format and no standard means of searching or analyzing it all for its deeper meaning. It is unstructured data, and it comprises most businesses data requiring specialized processes to mine it for the valuable nuggets it all contains.

The ability to handle the vast variety of data is vital to distinguishing “Big Data” from “a lot of data.”

Insight And Decision Making

Once you’ve got a handle on the Three V’s, you can get down to the business of separating the wheat from the chaff, creating useful intelligence from the previously unintelligible. And you’ve got to do it while following all the various regulatory guidelines. These include the Health Insurance Portability and Accountability Act and its complement the Health Information Technology for Economic and Clinical Health Act (HIPAA-HITECH), the Food and Drug Administration’s Title 21 CFR Part 11, Federal Policy for the Protection of Human Subjects, Genetic Information Nondiscriminatory Act (GINA,) and the National Institutes of Health Genomic Data Sharing Policy. And those are just the U.S. regulations.

Consider one case from our experience working with a global pharmaceuticals company heavily invested in genomics. This company regularly generates huge files — as large as 200 terabytes (TB) — that they need to manage efficiently and securely to conduct their research. That means not only working within regulatory boundaries, but ensuring the security of their systems to preserve the integrity of the data, research, and resulting intellectual property. The threats are numerous and real. Check the headlines and you’ll find reports of ransomware, unethical insiders, hackers and other vulnerabilities just about every day.

Once the plumbing and the policies are in place, the data must be made available to the company’s researchers around the world. That means organizing the data in a central repository with aggregators and tools that can collect the data into useable containers and with the proper algorithms. This is where speed is essential. Day-to-day use of the internet’s popular sites and streaming media services can give the impression the digital world is instantaneously on demand, but the reality in business is much different. Consider, for example, that it is faster to store 1TB of data on physical media and fly it from New York to London on a jet liner than it is to transfer over a standard 100Mbps connection. And it’s not even close. The flight takes about six hours while the transfer might take ten days.

Using systems with bandwidth and acceleration technology necessary to complete such a transfer (remember, our customer regularly generates 200TB files) in minutes or seconds is a huge competitive advantage in the race to file patents and be first-to-market with new products.

Conclusion
Systems and solutions designed to move Big Data and business intelligence securely from place to place are vital to protecting your business’ interests. Given the expectations for the ways we access and receive information and interact with the world around us, security is (or should be) a top priority to minimize the risk of a leak due to human error, process error or intentional act.

About The Author

Greg Hoffer is Vice President of Engineering at Globalscape where he leads the product development teams responsible for the design and engineering of all of Globalscape products. In more than 12 years of service to the company, Greg has overseen the creation of products such as the Enhanced File Transfer suite and Secure FTP Server, established technology partnerships which helped accomplish Federal Information Processing Standards (FIPS) and developed features and modules such as the DMZ Gateway, Auditing and Reporting, OpenPGP, Ad Hoc Large File Transfer, Advanced Workflow Engine, Workspaces, and the Web Transfer Client.