State space models for language modelling

Dan Fu, a Stanford University PhD candidate, is with us today. In the conversation with Dan, we go through the drawbacks of state space models for language modelling as well as the quest for substitute building elements that can aid lengthen context without being computationally impractical. Dan takes us through the H3 architecture and the Flash Attention approach, which can lessen a model’s memory footprint and make fine-tuning possible. We also discuss his work on enhancing language models with synthetic languages, the problem of long sequence length affecting both training and inference in models, and the quest for something sub-quadratic to perform language processing more efficiently than the brute force method of attention.

Source link