Although structured data remains the backbone for many data platforms, increasingly unstructured or semistructured data is used to enrich existing information or to create new insights. Amazon Athena enables you to analyze a wide variety of data. This includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, and JSON-formatted data. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Amazon Redshift suggests to define primary key or foreign key constraint wherever applicable. But they are information only. Redshift does not enforce these constraints. In other words, a column with primary key accepts duplicate values as well as a foreign key column also allows such a value that does not exists in the referenced table. So why does Redshift consider defining primary or foreign key as their best practice list? Because query optimizer uses them to choose the most suitable execution plan. But you need to be very careful while defining these constraint in your table. Let us see why with some real-life example.
At Sqreen we use Amazon Kinesis service to process data from our agents in near real-time. This kind of processing became recently popular with the appearance of general use platforms that support it (such as Apache Kafka). Since these platforms deal with the stream of data, such processing is commonly called the “stream processing”. It’s a departure from the old model of analytics that ran the analysis in batches (hence its name “batch processing”) rather than online.