PUBLICATIONS

Structural approach to geophysical data access

by Naum Derzhi, Eugene Lichman, and Eric Lanning

Summary

Moving data between applications and different stages of processing has been a long standing problem within the geophysical industry, requiring long delays in changing from one file format to another and in resorting information for different processing steps. The problem has grown over the last decade with the increasing number and sophistication of software packages, the increasing number of formats for different types of data, and the expanding size of the data volumes at every stage of the oil and gas business. We discuss how the problem of data access can be structured and solved in a way that would allow a large variety of data file formats to be accessed transparently, and would allow data to be read from a file in any sort order independent of the order of the data within the file.The variety of data file organizations is finite and can be broken in three logical groupings. A general solution data access is proposed that allows files to be written or read independent of format and Sort order. A method for describing data file formats and capturing these in “format templates” is presented. Using this information to understand the structure of a data file, any trace, or gather of traces, cam be accessed directly without the need to create intermediate disk files. Modifications to standard formats can be handled by editing the associated format template or by creating a new format template.

Introduction

Throughout the history of computing one of the most limiting factors to productivity has been the difficulty of getting data into the applications required to do the job. Problems in changing data files from one format to another, reordering existing data for different processes, creating multiple copies of the same information, have in most cases been accepted the status quo when working with computers. The following problems hinder the process of accessing seismic data and moving them between applications.

  1. The existing data file format does not match the application input file format.
    As the number and sophistication of applications required for operations have increased over the last several years, so have the number of file formats. At the same time, the sizes of the data volumes used by these applications have increased dramatically. These factors have led to larger amounts of time taken up in the reformatting process, increased the complexity and the amount of software required by the process, resulted in storing multiple copies of the same data, and increased amount . of on-line disk space required to support the ongoing work.
  2. The input data file must be reordered before the application can run.
    Different applications using the same information may require the data to be in a different Sort order. (This issue may be handled at the same time as the reformatting, even if the formats are the same.) This is certainly the case in many seismic processes where the data may be in CSP (Common-Shotpoint) order while the application requires the data in CMP (Common Midpoint) order. The same may also be true for 2-D and 3-D grids input into different analysis and visualization systems, each requiring a different ordering of the data at input. In addition to being time consuming, these processes lead to multiple copies of the same data and add to the problem of exploding data volumes on disk.
  3. Barrier between prestack and poststack data
    Another problem area in data access is the barrier between prestack data and poststack data. In recent years the oil and gas industry has seen a surge in the popularity of seismic analysis techniques that require access to prestack data, such as AVO and velocity analysis. Yet conventional software lacks the data access architecture necessary to form an interactive link between pre-stack and post-stack data sets.

In most business operating environments these difficulties related to data access are often the limiting factor in determining what work will and will not be done. Reformatting solutions are considered too complex and time consuming to learn, or may simply be unavailable. Replicating huge data volumes by creating reformatted or reordered data files may be impractical or impossible, and data access methods for creating links between information at different stages of processing have not been available.

Data file organization & solutions to data access

The variety of data file organizations is finite and can be broken in three logical groupings. A general solution data access is proposed that allows files to be written or read independent of format and Sort order. A method for describing data file formats and capturing these in “format templates” is presented. Using this information to understand the structure of a data file, any trace, or gather of traces, cam be accessed directly without the need to create intermediate disk files. Modifications to standard formats can be handled by editing the associated format template or by creating a new format template.

Classes of data file organization
There are three basic data organization structures. The descriptions below go from the simplest to the most complex.

  • Self-Contained: Example — SEG-Y seismic data
    In a self contained data organization the contents and ordering of information within the file are defined and fixed. This is not to say that different companies might not come up with their own versions of a particular format, as has been the case the SEG-Y data. For each of these variations there exists a definition of what information is to be stored and where it is to be stored within the file.
  • Self-Describing: Example–Many companies’ proprietary formats
    In a self-describing data organization, certain fields or information within the file imply what data are in the file and how they are ordered. Routines that access this kind of information typically read some of the information to understand what is next within the file before reading another portion of data.
  • Database
    In a database organization the data and information about the data are stored in different files instead of being kept together in a single file. Information within each group is often organized in a fashion similar to either self-contained or self descriptive format structures. Databases are most often proprietary and the organizational structure unknown to the outside world data access is typically accomplished through a layer of linking software (a data access utility) that has an intimate knowledge of the database structure.

Aspects of a general solution to data access
What are the necessary features of any general purpose data access solution? First, the solution should be capable handling existing formats, modifications to existing formats, and new data formats, without requiring any additional software development or software enhancements. The main components to the solution are:

  1. A general method of describing data file formats. It is known that information is stored as individual parameters or groups of numbers. For a given type of data, these parameters and groups of numbers are mostly the except for how the are stored. By establishing a method for describing file formats one does not have to create a different piece of software for accessing every type of data file format.
  2. A set of tools for inputting, storing and retrieving data file format information. By setting up such a set of tools it becomes possible to store and then later use the information describing the formats.
  3. A data access system that:
  4. reads the information that describes the data file formats.
  5. uses the description of the format to access the data file.
  6. can read or write in any format for which a description exists.
  7. can access data from anywhere within a file independent of the order it is stored.

With these developments it is possible to store descriptions of all the data file formats of interest and use these to covert from one format to the other, or to use the data access methods within an application to read data independent of the file format.

data_accessFormat templates and format template building blocks
A format template is a set of rules and specifications describing a particular data structure or organization. Even though the methods of data organization discussed are quite different, each method organizes data into multiple groups of related information. The main difference between these organization structures is the way the groups are stored and accessed. The information within each group is a collection of fields. Each field can contain either (a) values unrelated to other fields (for example, the value “sampling rate”), or (b) information which describes certain properties of other fields (for example, one field contains the value representing the length of another field). The fundamental building blocks for the format templates then are these groups and fields, along with a number of attributes associated with their definition.  Format templates and format template building blocks
A format template is a set of rules and specifications describing a particular data structure or organization. Even though the methods of data organization discussed are quite different, each method organizes data into multiple groups of related information. The main difference between these organization structures is the way the groups are stored and accessed. The information within each group is a collection of fields. Each field can contain either (a) values unrelated to other fields (for example, the value “sampling rate”), or (b) information which describes certain properties of other fields (for example, one field contains the value representing the length of another field). The fundamental building blocks for the format templates then are these groups and fields, along with a number of attributes associated with their definition.

With these developments it is possible to store descriptions of all the data file formats of interest and use these to covert from one format to the other, or to use the data access methods within an application to read data independent of the file format.

The following section briefly describes the elements of a format template

  • Field
    A Field is a unit of information or a single value. As previously mentioned,
    it can contain a value unrelated to other fields, or can contain information
    used to describe certain attributes of another field. It is described by such
    attributes as name, position, data type, access and field keywords.
  • Group
    Group is a structured unit containing a certain number of related Fields. The Group can be used to describe itself or to describe another Group (for example, a trace header group is used to describe a trace). The main difference between formats is the way Groups are related to one another. Among the attributes used to describe a Group is a list of the Fields of which it is composed.
  • Field keyword
    The Field Keyword is a description of the interrelations between different Fields. As mentioned above, the values stored within some Fields are used to define the attributes of other Fields.
  • Access Keyword
    The Access Keyword is a description of the desired internal representation of the value contained within the Field, and its meaning to the application. It is used as a bridge between external (encoded into a file) and the desired internal representation of the Field value. Using the Access Keyword, the application requests the value (or a set of values) stored within the data structure. The application does not need to know where and how the requested value is stored. The value will always be received in the desired internal format.
  • Overlay Box
    An overlay box is a collection of all the possible access keywords for a given type of data along with their values, whether the particular format template has fields associated with every keyword or not. The Overlay Box contains all of the information which is considered to be essential for a proper data description. It is essential for data reformatting purposes. When a target format requires more information than is provided in a source format, the missing elements are supplied through the Overlay Box. Also, using the Overlay Box, needed information can be passed into an application that uses a format template. based data access system.

Direct data access
The format templates provide the necessary information to understand, access sequentially, and manipulate information within a given file. How do we then directly access information in any order from the file without creating an intermediate or resorted data set?

  • Pointer Structure
    Data files are generally composed of related groups of numbers known as records or, in the case of seismic data, traces. In order to access information from a specific record or trace, the exact beginning position or address must be known. A pointer structure containing the addresses of the beginning of each trace within a particular file can be derived from the data file structure (format template). Once these are known any trace within a file can be accessed.
  • Access Files
    Records within a file can be grouped together in a number of different orders based on aspects of acquisition or processing geometry. Each of these different grouping and ordering of traces is a single “access” of the data. To process the data in a particular order without resorting the entire file, all the records belonging to a particular grouping must be known. This information can be obtained at the initial access of a file by reading the record (trace) headers and storing the trace numbers associated with each grouping into a separate file known as an “access file”. By creating one Access File for each different access of the data file, new accesses can be defined at any time. Once the number of the record in the file is known the beginning address can be obtained from the Pointer Structure.

Components of a general purpose data access system
Using the format template method for defining individual file structure, Pointer Structures for capturing the position of data within a file, and Access Files for determining the geometry for accessing the data, we can now consider the main components of a general purpose data access system using the format template approach.

  • Template Library — This is the storage warehouse of all the data format descriptions (format templates) of interest.
  • Template Editor — This contains the facilities for development and manipulation of the format templates. The functionality of the Template Editor includes (a) creating new format templates (b) editing existing format templates, (c) saving old templates into a format template library, and (d) loading a pre-existing format template library.
  • Data Access System — This is the portion of the software that collects all the information needed on the data file to be accessed and allows for information to be passed transparently from the initial to the target location without direct knowledge of the initial data file format.

Using seismic data as an example, all of the different accesses (ways of grouping, or gathers) that will be desired from the data file are specified, along with the appropriate format template. The first time the data file is opened the system uses this information to find the byte addresses of all traces within the file, and the number within the file of each trace in every gather. All of this can be used to fill in the following information structures:

  • Access Types — The name for each type of access such as Common-Midpoint or Common-Shotpoint.
  • Pointer Structure — The physical disk locations of each trace within the file.
  • Grouping Structures — The fields within the trace headers that describe the groupings or gathers. For each type of access, an access file can be written which contains the trace numbers within the data file for every trace in each of the gathers associated with that access.

All the information needed to access any trace, or grouping of traces, without the need for them being sequential, is now in place. When a particular gather is requested, the numbers of all the traces within the gather are read from the associated access file, the byte addresses within the file of each trace are found in the pointer structure, and then all the traces read from the data file. The gather can be passed to the calling application in the requested data type and format.

Conclusions

Using the concepts on data access described, reformatting, resorting and manipulating data from files spanning a wide range of data formats can be accomplished within a single body of software. Modifications to existing formats can be handled easily by editing previously developed format templates without disrupting existing processing sequences or work flows. New data formats can be incorporated as easily by developing new format templates and storing them for reuse. These data access methodologies will allow applications to read and write data directly in a wide variety of formats without the need for developing internal or propriety ones of their own. The direct access capabilities of such a system will be essential for processes such as interactive 3-D AVO, in which resorting and duplicating huge data volumes is not an option.