I’m trying to set up a connection from Azure Data Lake Store (ADLS) to Power BI through a HDInsight Spark Cluster. My exact data pipeline is as follows:
- CSV file with time series data stored in ADLS
- From Spark cluster, run Jupyter notebook that pulls in CSV data and outputs it to a table
- From Power BI Cloud Service, connect to Spark cluster and query tables previously connected
I got this pipeline to work in a test environment with a different set of data, but now that I’m working with more realistic data, it’s not quite working. Everything is the same up until I go to create the visualizations in Power BI. The metadata for the table is imported correctly, but when I drag a column from the column list onto the reporting canvas, no data is loaded and I just get the circling dots for a few minutes and then nothing. I’m not seeing any errors.
I’ve successfully connected to the spark cluster I’m currently working with from the Power BI Desktop client and loaded the data there, so I’m sure I’ve set up everything correctly.
As a reference:
My test dataset was 26,729 rows x 10 columns for a total of 2.8 MB. Queries were a little slow, but not too bad.
My current dataset (the one timing out) is 104,274 rows x 2 columns for a total of 6.5 MB.
So the current data set is about as four times as many rows and a little over twice the size of the test data I was working with. Is this big enough to hit such drastic performance limits?