-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DMA too slow for Fastino #1423
Comments
I believe this is a symptom of the more general issue #946 |
@jordens we discussed getting data on RTIO/DMA throughput for Kasli. Were there any measurements in particular you wanted? |
I am pretty sure that the optimization I discussed in #946 will yield good results, and getting measurements should not be in the critical path - though it is good to make them at some point to quantify the improvement. |
@pathfinder49 you can also try with the core analyzer disabled in the gateware. That should speed things up a decent amount (see #946) |
@hartytp You should make the measurements that tell you whether your use case will be limited by RTIO frabric/DMA throughput. You know best what those are. |
I would describe this as just a design flaw in Fastino, if I understand correctly. If you want Kasli to stream 32 channels of 16-bit data at 2.5 MS/s, this is 160 MB/s sustained data transfer for EACH Fastino card, using a bus that is shared for many other ARTIQ purposes, and with latency constraints that are much more stringent than for typical computing applications. It seems to me -- and please correct me if I am wrong -- that modifications to DMA may provide a bit of a patch, or adding much larger buffers to reduce the impact of all the other bus traffic on the ability to guarantee samples at the DAC on time, but that really what one should consider is having a dedicated SDRAM on Fastino for waveform playback, using a dedicated bus that is optimized for the task, and properly sized buffering queues on the Fastino FPGA to allow for memory refresh. Then have the DMA recording process place samples into the Fastino memory rather than Kasli memory. |
You are wrong. The limitation is definitely not Fastino. Fastino has no problem handling all the samples you throw at it. Hardware is not the limitation: the SDRAM on Kasli can already sustain an order of magnitude more than 160 MB/s. |
That's not just "a bit of patch", I expect the improvement to be very significant. For linear access and a fine-tuned command pattern, one can use ~99% of the peak I/O bandwidth of a SDRAM, i.e. the Kasli SDRAM could do ~15.8Gbps on Kasli 2.0 with the -3 speed grade FPGA, and more if we do not phase-lock the SDRAM clock to the CPU clock (not doing it increases latency and complexity). Not counting RTIO-DMA overhead and analyzer writeback (maybe the analyzer could ignore Fastino data channels?), a Fastino at maximum speed is just 1.3Gbps. |
Out of interest, how do you expect DMA to perform on Kasli-SOC? Will #946 also be applicable to Kasli-SOC? |
Like this: https://www.embeddedrelated.com/showarticle/988.php
The #946 optimizations are not applicable to Zynq, and would result in the same or higher performance than Zynq can possibly achieve (at least without adding a dedicated SDRAM chip for fabric DMA, which sounds complicated and expensive). I suspect higher. FPGA fabric is efficient at moving parallel/pipelined data around, and allows the fine-tuning of low-level SDRAM command sequences - there is no clear advantage to the hard SDRAM controller used in Zynq and plenty of disadvantages (it is sometimes faster to design something from scratch in the fabric than get the quirky/buggy Zynq hardware to behave). |
I can't seem to find how to disable the core analyser. Could someone please give me a hint? |
OK, I stand corrected.
I'm not trying to "peddle" Shuttler. It's aimed at a different use case than Fastino, and I am not trying to suggest that people should choose Shuttler instead. I agree that the problem of feeding the DACs their samples will be even worse on Shuttler than Fastino. And maybe the answer for Shuttler is to stream reduced-representation samples out from a Kasli over EEM, and have the FPGA on Shuttler just be in charge of turning that into DAC samples in whatever the appropriate manner is (CIC, spline, etc). This would certainly allow Shuttler to reuse/build on developments made for Fastino, which is good.
I agree that improving general RTIO DMA is much more useful than doing some specific optimized use case. I didn't understand that the proposal from @sbourdeauducq in #946 (comment) would have such a dramatic impact as to completely alleviate all concerns for running multiple Fastinos off a single Kasli. This changes my thoughts on whether Shuttler would need its own DMA.
I would contend while the hardware layout/debugging may be a "unique success story", this label seems inconsistent with the issue that started this post, namely the inability to run even a single one of the 32 output channels at the spec'ed max update rate. But if making significant improvements to RTIO DMA fixes this completely, then sure, it has the potential to be a "unique success story". Likewise, once a PHY exists to generate the samples from reduced representations rather than needing to stream them from memory, the issue would be solved. According to @hartytp there is no funding for the required improvements to RTIO DMA -- again, please correct me if I am wrong here. So it remains to be seen when Fastino can actually realize its potential, right? The PHY for sample generation has been funded by Hannover, is that right?
Yes I agree, and this may be the way to do Shuttler, for example. |
I have been running 32 channels with full update rate just fine. This has been shown. The noise measurements have been done using this mode. There is not a single aspect of the Fastino project that would prevent this. Fastino can fully realize its potential. Therefore it's completely accurate to label it a unique success story. Maybe your confusion is just a lack of knowledge of how Fastino works: The only mode of operation currently is to constantly stream all 32 channels at 2.55 MS/s from Kasli. Fastino either works at full rate or it doesn't work at all. The fact that RTIO and DMA would generally limit high event rates was clear months ago (in fact years since #946). It was clearly communicated and explicitly accepted that feeding a (any) RTIO PHY with arbitrary waveforms on many channels would likely not work. But this is in no way implied or triggered by or attributable to Fastino. You can just use the labels we place on these issues for this purpose. There is no secret or back room dealing going on. |
OK. Then what's the difference with what @pathfinder49 is doing? And yes, I have a lack of knowledge about Fastino, I have not been following the development due to bandwidth limitations. My apologies for coming in late and not having all the background story.
Sure, not debating that, and I was not trying to say this was "caused" by Fastino. But it seemed from this initial post like it was not possible to run Fastino by streaming DMA samples, and I had the impression from @hartytp in an email yesterday that the funding was not in place to address this issue. |
He is trying to generate arbitrary samples using RTIO and/or DMA on Kasli.
If it's not attributable to Fastino, and if it furthermore can and should be resolved elsewhere, then it's certainly wrong to claim that there is a design flaw in Fastino.
That's correct for this literal use case. Maybe the wide RTIO interface addresses it, maybe it won't. The interpolator will certainly address it for its use cases. |
Indeed. Not to mention that #946 was an issue long before Fastino/Shutter/anything else. DMA throughput is already a significant bottleneck. You just start to notice it even more when you try to do things like like fast shuttling. Anyway, fun as it's been, I think this is a duplicate of #946 so closing. |
Indeed. But issues are still likely if one wants to update, say 10 or more, the DAC decently simultaneously for any length of time at anything close to the max rate the hw can support. As you say though, that's not new or surprising. |
Bug Report
One-Line Summary
Updating multiple Fastino channels @2.55 MS/s via DMA results in RTIO underflow.
Issue Details
Using the Fastino single channel update functionality, the maximum sample rate can not be achieved on all channels. Without DMA, Kasli can only update a single channel at ~1.3 MS/s. Using DMA I found the following:
Steps to Reproduce
The experiment below demonstrates the bug. Underflow time was determined by measuring the Fastino output and/or finding after which sequence length underflows stopped occurring.
Your System (omit irrelevant parts)
The text was updated successfully, but these errors were encountered: