4 Pillars of Analytics

4 Pillars of AnalyticsData acquisition, processing, surfacing and actioning are key to an effective analytics initiativeJulien KervizicBlockedUnblockFollowFollowingDec 7There are four key pillars that rule the lifecycle of analytics projects, from data acquisition, processing to surfacing and actioning on the data..What is required from analytics professional is very domain dependent.Types of dataWith respect to data acquisition, we can consider four main types of data Clickstream, Databases APIs and Logs, each have their own challenges and ways to handle data collections.Clickstream data is generally obtained through integration with a tool such as google analytics or adobe analytics..The default version is to extract raw data from such a source through Google Big Query (for Analytics 360), if this is unavailable open source tools such as Snowplow or Divolte can help integrate collect raw clickstream data to a big data platform.The role of the Analytics practitioner here, is to define the metrics for collection, setup goals within the analytics tools, help setup extra logging, analyze user paths, handle experiments setup and deep dives.Databases is normally the source of information for internal system information that need to be persisted..Datasets are usually queried and extracted either as snapshot or a sequence of events depending on the form of the data.The role of the analytics practitioner with respect to acquisition of data within this domain is to model and structure the data needed to be exported from these databases..He works with the engineering team on capturing essential information within these data-streams or on making attributes directly available to make data collection a more efficient endeavour.Architecture for Data CollectionClickstream data acquisition: Analytics practitioners trying to acquire new clickstream data, typically define new events or attribute to be collected within the tag management system..In order to ingest the information for analytics purposes, these production databases need often to be replicated and operation need to be performed onto them to extract the right kind of information needed.API data Acquisition: Calls to external servers need to be made in this specific case, a worker needs to be produced that calls the different API end points and structure the data for ingestion..Once onto an event bus, they can be pushed to a a big data platform or potentially a database through a data sink connector.Technology KnowledgeEach of these data domains, requires some specific type of knowledge in order to be able to fully execute onto a data acquisition process.Clickstreams: For obtaining clickstream data , Javascript knowledge, especially jQuery and tag management systems is useful to have in order to be able to define what type of events or attributes need to be ingested by the systems.Databases: In order to extract data from a database thorough knowledge of SQL is needed, for more advanced operations knowledge of ETL tools such as Airflow might be good to have.APIs: Knowledge on how to interact with APIs, including Authorization, SOAP and REST APIs as well as programming knowledge is needed to be able to interface with these type of data sources.Logs: Interaction of log data tend to be a bit more technical than the other data sources mentioned before..It consist of different sub tasks that need to be processed on datasets, cleansing, combining and structuring the datasets, handling aggregation as well as performing any additional advanced analytics processing on top of the data.CleansingData cleansing is a task everybody working in the field of analytics must do, it requires to deep dive into the data and looking for potential gaps or anomalies, trying to structure the data in a way that it could tackle most of the problems.At the heart of data cleansing a few types of data cleansing need to be performed:Missing values: How to identify missing value and impute these when needed, when full set of data are missing determine how these should be treated.Text normalization : Text fields need to be normalized across (a) dataset(s), within free form fields, data need to be formatted into common identified words …Categorization: Certain inputs need to be matched against certain metadata.Id Matching: Ids that are different depending on the source of information need to be matched together in order to allow merging and resolving of a single identity.Deduplication: Certain events or data could be duplicated within a dataset, identifying these occurrence and removing them from the dataset is part of the cleaning process.Miss Attribution: In some cases some rows of the data could be miss attributed to a given source.. More details

Leave a Reply