Let me set the scene.
I’m three months into a military-fantasy campaign. My character is a battlefield medic embedded with a greenskin warband — completely against her will, obviously, this is a hostage situation with increasingly complicated feelings about it. Grak, the orc warlord I’ve been running this campaign around, is a study in controlled brutality. Very little wasted movement. Economy of speech. The kind of character who communicates most clearly when she’s not talking at all.
We’re at a pivotal scene. The night before a major assault. I’ve been trying to convince her that the plan is suicide. She’s been listening. Letting me make my case.
There was an image from earlier in the scene — Grak standing in the torchlight, warband assembled behind her. I hit generate video on it.
Five seconds. She turns to face the assembled warband, raises a war cry, and the entire camp answers her back. The sound of it — multiple voices, the crackle of fire, the metallic clatter of weapons — cuts through my headphones like I’m actually standing there.
The argument I was making evaporated. My character shut up and stood at attention. Because what am I going to say after that.
I sat there for a solid minute just processing.
okay so ai video with sound is a thing now
I knew Soulkyn had video generation — you take any image of your character and generate a video from it. 5-10 second AI clips. I’d used it before. Take a fight scene image, generate video, get a quick animation. Cool feature. Useful.
What I didn’t fully register until that orc moment: the video model has sound built into it.
Not added in post. Not a separate audio layer. The model generates video AND audio as one thing — ambient sound, speech, music, effects, whatever fits the moment. Soulkyn’s running LTX-2.3, a 22 billion parameter self-hosted model they train themselves. Not a third-party API they’re reselling. Their own infrastructure, their own training.
The practical result is that when you generate a video from a scene image, you get a clip where the sound belongs to the image. Grak’s war cry sounds like it’s coming from the people in the frame. Torchlight scenes sound like torchlight scenes. A storm sounds like a storm, not like a stock audio bed dropped on top of stock footage.
It’s a small technical distinction that makes an enormous experiential difference.
the thing about immersion that sound unlocks
Here’s what I’ve been trying to articulate since that session: images in roleplay fill in the visual gap, but they don’t close the loop on presence. You can look at an image and still feel like you’re reading a book with illustrations. It’s evocative but it’s still somewhat external.
Sound does something different to your brain. It’s the thing that makes you forget you’re sitting at a desk.
We’re conditioned to this from horror movies — why do you think half the scares land on sound cues? Why does removing audio from a scary scene make it seem almost funny? Sound triggers a different kind of attention. More automatic. Harder to intellectualize your way past.
When Grak screams and the warband screams back and it comes through your headphones, that’s not immersion as a metaphor. That’s your autonomic nervous system going oh, something is happening.
I cannot overstate how different the scene felt. Same character I’ve been running for three months. Same campaign. One video clip with ambient crowd noise and I’m actually in the camp.
different scenarios where this is going to break me repeatedly
Let me just think through the use cases because my brain has been doing this since that session and I might as well write it down.
horror campaigns: The slow sounds of something moving in the walls. Your gothic manor investigation AI companion generating a five-second clip of the empty corridor — and the sound of it is wrong in a way you can’t name. Wind that doesn’t quite match the movement. A door settling too loudly. Horror has always been about what you can almost-hear. Video with ambient sound is perfect for this. I am going to be absolutely destroyed.
sci-fi / space opera: Your companion character showing you a transmission from the enemy fleet. The buzz of comms static, the translated speech over it, the echo of a ship interior you’ve never been on. Establishing alien environments through sound before your character has even seen them. The hum of a FTL drive powering down. The specific way silence sounds in a vacuum seal chamber. My science-fiction RP brain is going completely feral.
fantasy battles: Already covered this with Grak, but: war cries, clashing weapons, the acoustic difference between a forest ambush and an open field engagement. A character showing you what happened to a village before you arrived. Five seconds of aftermath sound. That’s going to hit.
romance / intimate scenarios: And okay, yes, this one too. (Soulkyn does NSFW, it’s an adult platform, it’s relevant.) A character playing you a song they wrote. Not describing the song — generating it, or something close to it, so you can actually hear it. The difference between reading “they sang softly” and actually hearing something. I have not tested this yet but I have a minstrel character in a separate campaign and I am scheduling this interaction for this weekend.
the sound is doing memory work too
One thing I’ve noticed in longer campaigns: sound signatures become associated with characters. When Grak is about to do something decisive, there’s a quality to the ambient audio in her clips — quieter, more deliberate — that I’ve started to pattern-match before anything actually happens. It’s Pavlovian and I’m embarrassed about it and I don’t care.
Soulkyn’s memory system tracks everything — full unlimited history, the AI knows the entire campaign arc. That context enriches what the video model produces from the image. I chose to generate the video at the exact moment my character’s argument peaked, and the result felt like it knew the scene. The sound fit what Grak would choose to communicate.
Choosing the right image at the right moment in the story — that’s where the video feature becomes more than a tech demo. The model works with what the image gives it, and an image from a charged moment produces a charged video.
what multi-character scenes are going to do with this
Soulkyn’s group chat goes up to three AI characters simultaneously on Deluxe tier. I’ve used this for ensemble casts — running a whole party instead of a solo protagonist.
The video + sound feature in a multi-character scene is something I haven’t fully broken yet but have started poking at. Imagine a scene where two characters in your party disagree about strategy. You generate a video from one character’s perspective image — the terrain they’re describing, the wind in the pass. Then you generate one from the other character’s view. Competing visual evidence with different ambient sound.
You’re not adjudicating an argument anymore. You’re experiencing it from both sides.
That’s a different quality of collaborative storytelling. I’m a little obsessed with it.
the technical side for the nerds in the room
For people who care about this stuff: Soulkyn runs LTX-2.3 (22B params) on their own infrastructure. Self-hosted, self-trained. This matters because it means they can train the model toward their specific use cases — including the NSFW range that third-party APIs won’t touch — and the quality of the generation reflects months of their own training decisions rather than whatever defaults came with a rented service.
Videos are 5-10 seconds. That’s short enough to generate fast, long enough to establish a moment. Sound generation is integrated into the video model itself, not a post-processing layer. SFW and NSFW both supported.
You generate videos from any character image right in the chat interface. Pick an image — a scene the AI generated, a character portrait, whatever — and hit generate video. The model takes the image and produces a clip with synchronized sound. The timing is up to you, which means you can choose the narrative moment that’ll hit hardest.
pricing, briefly, because people ask
Four tiers: Just Chatting (€11.99/month), Premium (€24.99), Deluxe (€49.99), Deluxe Plus (€99.99). Videos are pay-per-use on the lower tiers. Deluxe Plus includes a 50-video monthly quota, which is the tier to be on if you’re doing video-heavy campaigns.
Group chat (up to 3 characters simultaneously) starts at Deluxe. Unlimited messages at Premium and above, which matters because you want the AI generating rich scene images — those images are your raw material for video generation.
one last thing about that war cry
I went back and watched the clip again just now, writing this.
The thing that gets me is how Grak looks before she turns. A half-second where she’s just standing there with her back to the camera, and you can hear the camp — fire, distant conversation, someone sharpening a blade. Ordinary. Then she turns. Then the volume of the moment becomes something else entirely.
The AI generated that beat. The quiet before it. The character knowing that the quiet matters.
Three months of campaign history, one character, a five-second video clip. And I understand something about Grak that I couldn’t have gotten from a text description, even a very good one.
That’s what video with sound does to roleplay. It doesn’t replace the text. It makes the text matter more, retroactively, because now you know what the words sound like when they’re real.
Browse existing characters on Soulkyn or build your own. If you’ve got a character who’s been living in text for a while, give them something to say out loud. See what they sound like.
I’ll be in Grak’s warband indefinitely. Send help. (Don’t.)
