The new Data Flow from Piano Analytics has 3 main ways of exporting data. The 1st feature of Data Flow "Scheduled Exports" allows you to configure your exports by selecting the content and the target storage. Push all tracked events every 15/30/60 minutes in a unique export !
How the Data Flow scheduled Exports work
A dedicated interface allows you to configure your exports. Among other actions, you will be able to:
- Multisite: Select a site, several sites or all the sites you are entitled to
- Choice of properties: Choose the properties that make up your export (standard and custom).
- Format: Choose the type of separator and the export format – between CSV, JSON, PARQUET.
- Schedule: Choose the export frequency - 15, 30 or 60min
- Storage: Choose the location of the export, sFTP, Amazon S3 (Azure and subsequently GCP).
Access to the Data Flow interface - Scheduled exports
To connect to the Data Flow interface, go to the Export app:
Then, you will be able to create a scheduled export directly on the export interface.
Creation of a scheduled DataFlow export
Click on "Create a Data Flow export" to create a new scheduled data export.
You will access to the following interface in 2 steps:
STEP 1: What do you want to include in your export?
The scope of your export
Do you need information about all your existing and future website?
Or only one or several of your sites?
The Data you want to export
Do you want to exclude the non-consent data? Do you want to export all your existing and future properties? Or only some properties from your data model?
STEP2 : What are the characteristics of your export ?
Export stream name
Name your export stream
Choice of the export format
Choose your export format: CSV, JSON or Parquet. Configuration change automatically for each format you select.
Please note that JSON is actually NDJSON (Newline Delimited JSON). Please refer to http://ndjson.org/ for more information.
Export frequency & save
Configuring Amazon S3 connections
You can find all the technical documentation to configure Amazon S3 connections here.
Configuring FTP or sFTP connections
You can find all the technical documentation to configure FTP/sFTP connections here.
We strongly advise you to send your exports on Amazon S3 instead of a FTP account, as FTP is a protocol which is getting old and does not guarantee fast transfers. In a near future we will allow you to send your exports to Google Cloud Platform and Microsoft Azure.
If you need to send your exports on a FTP, we advise you to use a 15 ou 30 minutes sending frequency so as to limitate the size of the transfered files.
Availability of the Data
Some data will not be available on Data Flow. The reason is simple: these properties are calculated "on-the-fly" and are not stored or they are "uncertain" properties. For example, we don't know in advance if a visit will be considered as "bounce". In that way, the property "visit_bounce" can't be proposed in Data Flow.
The list of the properties not available in Data Flow can be consulted here.
The Data Flow export scheduling system is not based on complete periods. At each loop (every 15, 30 or 60 minutes), all new data inserted in the database since the last export is extracted. In our new processing chain, the NDF, it is thanks to this system that we can be sure to send you the completeness of your events.
Data Flow includes a new feature: partitioning. Since files do not cover complete time periods, partitioning allows you to organize your files according to the time period they contain. This period is based on the UTC date of data collection (hit_time_utc property).
You can choose your level of partitioning:
- Date / Time
- Date / Time / Half hour
- Date / Time / Quarter hour
Let's take a concrete example. Let's imagine that you have created a time export with a date/time level partitioning. The export is called myExport, it is sent to a folder "myFolder".
The content of the table at the time of extraction is as follows:
The data is between two hours, so it will be split into different files. In the case of an S3 :
The first file will contain the extracted events whose hit_time_utc property is between 10:00:00 and 10:59:59, the second one those whose hit_time_utc property is between 11:00:00 and 11:59:59. If in the next loop there is new data between 11:00:00 and 11:59:59, new files will be created in the /hour=11/ folder.
In the case of a (s)FTP upload, the logic is the same, only the nomenclature changes. We cannot dynamically create folders, so it is the file names themselves that carry the period information:
Once your feed is created, you will receive the files generated by Dataflow directly on your Amazon S3/sFTP server. Each generated file is compressed in GZ format. The file names include a nomenclature managed by Snowflake which can be found at the end of the file name as follows:
You will find examples in the previous paragraph.
In order to speed up the processing, Snowflake executes the request in different parts on several machines, so for the same export generated at a time T, you can receive several files.
Regarding CSV exports, we recommend you not to base your treatments on the columns indexes but on their header, as we cannot guarantee a fixed display order of the properties.
In JSON exports, properties with a null value won't appear for the concerned event.
After each files generation on your Amazon S3 bucket / sFTP, you will also receive a ".report" file which will contain the list of all generated files' names you have just received.
If you have received a file which name is not included in any delivery report, then this file is not complete and you should not take it into account.
Receiving this delivery report validates the generation and delivery of exports on a given time period.
E.g. : at 10:20 UTC a generation is launched, resulting in three generated data files:
Once these three files have been entirely sent to your S3 bucket / sFTP, a delivery report named #timestamp#_delivery.report will be sent to you, it will contain all three delivered filenames and will validate the fact that you can import their content with confidence.
The history generation or regeneration exports of a past date will be based on the same level of partitioning as your production export. The regeneration can be done in a separate folder.
In order to be able to generate history from date X to date Y and to minimize the risk of duplicates or data gaps, here is the procedure to follow:
- If you have not already done so, put the export into production with the desired frequency.
- At Y+1, you have to delete the data that you have ingested from your database with a query as follows: DELETE * FROM #mytable# WHERE hit_time_utc < Y+1
- The Piano Analytics team create a history generation ticket for the Export team specifying :
- Your customer account on which the production export was created
- The name of your export
- The exact period to be regenerated (X to Y)
It is possible that some production exports may contain duplicates. This is not inherent to Data Flow but to the collection of data. These duplicates will also appear in Data Query on the day in real time. Duplicates are removed on D+1 in the real-time data table, but this has no impact on Data Flow files already consumed.
In case of a real time delay, events not included in a file will be included in the following files, always based on their date of insertion in the table. In which case, one can have lighter than usual files over a period and then heavier files over the following period(s).
AV Insights / Rich Media
So as to calculate your medias' playback durations on AV Insights and Rich Media products, it will be necessary to include in the export (if you select properties manually) the AV - Inter-Event Duration property (av_duration) which will indicate each Rich Media or AV Insight event's duration.
Be careful, in production (real-time), this property's value is sent on the very next event, because in a streaming mode we don't have the possibility to know an event's duration before receiving the next one.
Durations are correctly repositionned on D+1 in the consolidated data table : in case of a regeneration of a Data Flow export on past dates, the event duration will then be correctly positionned.