How data usage defines categorization
Did you know that your data has a temperature? It’s true.
However, that temperature has nothing to do with Fahrenheit or Celsius, or even how healthy your data is. The “temperature” of your data refers to two things:
- How soon after the data is created it is used, transformed, or processed
- How it’s stored
A brief history
The temperature references for data were initially developed when data location and the type of physical storage space being used were often the limiting factors to being able to quickly access and use data. Data that needed to be accessed immediately was typically stored in a data center located near where the processing took place so that data access times were reduced. Data not needed as frequently was typically stored off-site, which would take longer to access, but reduced the cost of storage. However, in the digital era, traditional file storage systems have been replaced with newer, more efficient file systems and virtual storage, so data access times based on location have been significantly reduced.
Today, data is separated into temperature categories based on the level of criticality for a business. The more critical the data is to the business operation, the more frequently the data is accessed and the hotter the data temperature. As such, three categories are used to classify data – hot, warm, and cold.
Let’s take a brief look at each of these.
Data categories
Hot data refers to data that needs to be accessed immediately upon being created (i.e., in real time), or data that is accessed very frequently. The data may also need to be transferred from where it’s created to another location for processing. As this type of data is typically used as soon as it’s created, or soon thereafter, it will not be stored for a long period of time.
Since hot data often needs to be instantly accessible, another critical issue is the speed at which the data can be accessed. That’s why today, hot data is often stored in the cloud, where availability and accessibility is typically well above 99.9%. Data latency is another critical issue for hot data, so the closer to the source the data processing takes place the better. Examples of hot data is data used for real-time decision-making for online transactions in a retail environment or telemetry data being captured from a satellite for computations.
Warm data refers to data accessed within a relatively short period of time after being created, or data accessed on a regular basis. Unlike hot data, warm data is not typically being transferred to another location, so it may even be archived when it is not in use. Note the time defined as short will be different for each company and will depend on the usage; there is not an industry standard. An example of warm data is daily point of sale information from a cash register summarized at the end of the day or on a weekly basis to provide weekly update reports.
Cold data refers to data that has been created, but is accessed rarely or infrequently, so it doesn’t need to be immediately available. Cold data is generally older data that is not active or up to date, so latency is another issue that is not as critical when accessing this data. An example of cold data are compliance records that may only need to be accessed annually for end of year reporting.
Data categories may be further refined by implementing policy-driven procedures that determine how long data will stay in one category before being moved to the next. This data categorization process helps ensure that only the data required in each category is available.
Data storage
In addition to the data categories, the other component for determining the temperature of your data is how it is stored. This determination is important because the type of storage used directly impacts the data accessibility and its ability to be transformed.
The type of data storage used is strictly up to the business based on their unique data access requirements, but some general recommendations are discussed below.
Hot data storage – Because this type of data needs to be immediately available and accessible, whether for processing or transfer, it requires short access times with no latency issues. Data storage typically used for this category includes cloud storage with solid-state drives optimized for accessibility and low latency. Another option may be a storage area network (SAN) located near where the data processing will take place.
Warm data storge – This data type is accessed within a relatively short period of time after being created or is data accessed on a regular basis. Data storage for this type of data may include NAS drives or cloud storage.
Cold data storage – Data of this type is not accessed on a regular basis, so it does not need to be immediately accessible. This data can be located offsite in a data warehouse. Because access and latency are not critical, this type of data storage often uses a storage medium such as a tape drive or hard drives.
Why categorize data?
As you have seen, each of the three categories of data indicates the timeframe in which the data will be accessed (immediate, short-term, long-term) and processed. Knowing how data will be used in your business will make the assignment of the correct category easy.
Why is it important to assign data to the proper category?
Because if your data is not correctly categorized, access to the data may be compromised, which could impact your business. In today’s digital economy, data is the new lifeblood of business and is what drives insight and innovation. It’s what provides product teams with direction for new product development. It’s what informs the business. Ensuring access to that data is a mission-critical step.
If your data isn’t categorized correctly, it may be stored incorrectly or routed incorrectly, leading to delays in access, delays in processing, and not being able to derive the insights and information from the data when your business needs them. Incorrectly categorized data also can impact your business financially.
In summary, the best way to ensure your business is getting all the value out of your data is by knowing what temperature your data is, when and how you will use that data, and making sure it is categorized correctly.
About the Author
Richard Hatheway is a technology industry veteran with more than 20 years of experience in multiple industries, including computers, oil and gas, energy, smart grid, cyber security, networking and telecommunications. At Hewlett Packard Enterprise, Richard focuses on GTM activities for HPE Ezmeral Software.