Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google's Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.
|Developer(s)||Apache Software Foundation|
1.16.0 / May 2, 2019
|License||Apache License 2.0|
Drill supports a variety of NoSQL databases and file systems, including Alluxio, HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill's datastore-aware optimizer automatically restructures a query plan to leverage the datastore's internal processing capabilities. In addition, Drill supports data locality, if Drill and the datastore are on the same nodes.
Apache Drill 1.9 added dynamic user defined functions.
Apache Drill 1.11 added cryptographic-related functions and PCAP file format support.
- All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR
- NoSQL: MongoDB, Apache HBase
- Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift
- Diverse data formats, including Apache Avro, Apache Parquet and JSON
- RDBMs storage plugins (Using JDBC to connect)
- "The Apache Software Foundation Announces Apache™ Drill™ as a Top-Level Project". Retrieved 2014-12-02.
- "Apache Drill - Schema-free SQL for Hadoop, NoSQL and Cloud Storage". drill.apache.org. Retrieved 2015-12-29.
- "Frequently Asked Questions - Apache Drill". drill.apache.org. Retrieved 2015-12-29.
Some papers influenced the birth and design. Here is a partial list:
- 2005 From Databases to Dataspaces: A New Abstraction for Information Management, the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system’s understanding of the data.
- 2010 Dremel: Interactive Analysis of Web-Scale Datasets