Notice: Trying to access array offset on value of type null in /srv/pobeda.altspu.ru/wp-content/plugins/wp-recall/functions/frontend.php on line 698

Big data is often stored in a data lake. While data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.

Many big data environments combine multiple systems in a distributed architecture; for example, a central data lake might be integrated with other platforms, including relational databases or a data warehouse. The data in big data systems may be left in its raw form and then filtered and organized as needed for particular analytics uses. In other cases, it’s preprocessed using data mining tools and data preparation software so it’s ready for applications that are run regularly.

Big data processing places heavy demands on the underlying compute infrastructure. The required computing power often is provided by clustered systems that distribute processing workloads across hundreds or thousands of commodity servers, using technologies like Hadoop and the Spark processing engine.

Getting that kind of processing capacity in a cost-effective way is a challenge. As a result, the cloud is a popular location for big data systems. Organizations can deploy their own cloud-based systems or use managed big-data-as-a-service offerings from cloud providers. Cloud users can scale up the required number of servers just long enough to complete big data analytics projects. The business only pays for the storage and compute time it uses, and the cloud instances can be turned off until they’re needed again.

How big data analytics works

To get valid and relevant results from big data analytics applications, data scientists and other data analysts must have a detailed understanding of the available data and a sense of what they’re looking for in it. That makes data preparation, which includes profiling, cleansing, validation and transformation of data sets, a crucial first step in the analytics process.

Once the data has been gathered and prepared for analysis, various data science and advanced analytics disciplines can be applied to run different applications, using tools that provide big data analytics features and capabilities. Those disciplines include machine learning and its deep learning offshoot, predictive modeling, data mining, statistical analysis, streaming analytics, text mining and more.

Using customer data as an example, the different branches of analytics that can be done with sets of big data include the following:

Comparative analysis. This examines customer behavior metrics and real-time customer engagement in order to compare a company’s products, services and branding with those of its competitors.

Social media listening. This analyzes what people are saying on social media about a business or product, which can help identify potential problems and target audiences for marketing campaigns.

Marketing analytics. This provides information that can be used to improve marketing campaigns and promotional offers for products, services and business initiatives.

Sentiment analysis. All of the data that’s gathered on customers can be analyzed to reveal how they feel about a company or brand, customer satisfaction levels, potential issues and how customer service could be improved.

Big data management technologies

Hadoop, an open source distributed processing framework released in 2006, initially was at the center of most big data architectures. The development of Spark and other processing engines pushed MapReduce, the engine built into Hadoop, more to the side. The result is an ecosystem of big data technologies that can be used for different applications but often are deployed together.

Big data platforms and managed services offered by IT vendors combine many of those technologies in a single package, primarily for use in the cloud. Currently, that includes these offerings, listed alphabetically:

Amazon EMR (formerly Elastic MapReduce)

Cloudera Data Platform

Google Cloud Dataproc

HPE Ezmeral Data Fabric (formerly MapR Data Platform)

Microsoft Azure HDInsight

For organizations that want to deploy big data systems themselves, either on premises or in the cloud, the technologies that are available to them in addition to Hadoop and Spark include the following categories of tools:

storage repositories, such as the Hadoop Distributed File System (HDFS) and cloud object storage services that include Amazon Simple Storage Service (S3), Google Cloud Storage and Azure Blob Storage;

cluster management frameworks, like Kubernetes, Mesos and YARN, Hadoop’s built-in resource manager and job scheduler, which stands for Yet Another Resource Negotiator but is commonly known by the acronym alone;

stream processing engines, such as Flink, Hudi, Kafka, Samza, Storm and the Spark Streaming and Structured Streaming modules built into Spark;

NoSQL databases that include Cassandra, Couchbase, CouchDB, HBase, MarkLogic Data Hub, MongoDB, Neo4j, Redis and various other technologies;

data lake and data warehouse platforms, among them Amazon Redshift, Delta Lake, Google BigQuery, Kylin and Snowflake; and

SQL query engines, like Drill, mandalasystem.com Hive, Impala, Presto and Trino.

Leave a Comment