Can you forecast a seizure with a spreadsheet and an LSTM?

A simulation-first rehearsal for forecasting a visual epileptic aura from the small visual events that may precede it — and a stricter measurement of what a model would actually be allowed to claim.


When people hear epilepsy, they usually think of the big obvious version: the full seizure, the body involved, the thing everyone immediately recognizes as epilepsy.

But that is not the whole space.

A lot of epilepsy is much more local than that. A small region of the brain can become unstable for a while, and depending on where that happens, the output looks completely different. If it spreads into motor areas, you get the version people know. If it stays near the visual system, you may get something much quieter from the outside, but very observable from the inside.

This project lives in that second regime.

The relevant event is a bright point in the field of view, planted where the gaze wants to go, like a small, invasive white sun refusing to move. Reading, coding, writing, whatever was happening before, that is over for a while.

I am going to avoid pretending this is a neurology lecture. I am not a neurologist, and for this project the exact boundary between migraine aura, epileptic aura, cortical spreading depression, inhibition, excitation, GABA, glutamate, etc. is not really the point of this article.

For the data, the useful version is much simpler: before the real aura, there are often smaller visual events. Tiny flashes. Brief interruptions. Little points of light. The proper words are something like phosphenes or elementary pre-aura visual events — EPVs from here on.

Most EPVs go nowhere. A flash appears, disappears, and the system returns to baseline. But sometimes they stop looking like independent little glitches. They begin to cluster. The spacing between them shortens. The density goes up.

That is the whole premise. The useful signal, if it exists, is not in any single EPV, but in the local statistics of the sequence: event density, clustering, and shortening inter-arrival times.

In statistical terms, this is close to a Hawkes-process problem: a sequence of events where each event can temporarily raise the probability of the next one. The bet is that EPVs are weakly self-exciting. One event makes the next one slightly more likely, the sequence becomes denser, and the system moves closer to a transition.

The question is whether that density contains a learnable pre-ictal signal, or whether it only feels predictive because the human brain is very good at drawing a line backwards after something scary has happened.

The important word is whether. Seizure prediction is famously not solved, and the existence of a clean pre-ictal state is still the disputed part. So the goal is not to prove predictability. The goal is to measure how much predictability is actually there.

And there is one honest thing to say before anything else: this article does not settle that clinical question, because the data below are synthetic event streams, generated from a deliberately small phenomenological model. That sentence weakens exactly the right thing and preserves the useful one. It stops the result from masquerading as physiology. It does not make the engineering question trivial — because the narrower question is still hard enough: if a world contains sparse self-exciting visual events, rare transitions, session-level fragility, and nuisance context, can a small forecasting pipeline recover the part of the signal that is actually recoverable — without fooling itself?

That is the experiment.

1. Simulation first

The project starts not with a model but with a world for the model to be wrong in — and an argument for why that world is allowed to be fake.

Why the first pass is a rehearsal

Given the nature of the target — rare, subjective, noisy, only partly observable — the first pass cannot honestly be a discovery paper. It has to be a modeling rehearsal.

That sounds like a demotion. It is actually the useful version of the project, because a rare-event predictor fails in boring ways long before it fails in interesting ones. It leaks across time. It learns a session-specific artifact and calls it physiology. It overuses a plausible feature because the feature sounds like a trigger. It reports ROC-AUC as if it were a usable warning system. It adds a neural network because the word sequence appears in the problem statement.

So the first thing to build is not the predictor. It is the little world in which the predictor is allowed to embarrass itself.

A microscope shows you something about the world. A simulator shows you something about your assumptions. Confuse the two and every conclusion becomes fake. But keep the boundary clean and a simulator earns its keep for exactly the reason a microscope cannot: you can know the data-generating process and still ask whether your analysis is disciplined enough to recover only the claims it deserves.

That discipline is not paranoia — it is what the literature keeps demonstrating (I come back to the specific papers near the end; here I only need their headline). The serious seizure-forecasting work is full of models that beat a random predictor in one setting and quietly deflate in another. Temporal models can find signal in EEG, but the signal is patient-specific, validation-sensitive, and far easier to overstate than to use. Payne et al. ran CNN/LSTM models over long-term intracranial EEG and reported performance above chance with strong patient-to-patient variability; Chambers et al. later tested LSTMs on unprocessed EEG and again framed the benefit as real but bounded, not magic. The lesson is not that LSTMs are bad. The lesson is that the validation protocol carries almost as much weight as the architecture.

What synthetic data can and cannot doSynthetic data can test a pipeline, compare modeling choices, expose leakage, estimate sample-size pressure, and define which claims would be forbidden. It cannot discover a clinical fact. A model that "works" on data shaped by my own assumptions has not learned the brain — it has passed a sanity floor: in a world where the signal exists, the pipeline can see some of it. That is a low bar. It is also the bar most personal ML projects quietly skip.

Building the simulator

The simulator has three moving parts.

The first is the baseline rate. EPVs can happen even when nothing is building. A flash appears, disappears, and the system returns to baseline. In point-process language, this is the boring part.

The second is self-excitation. After an EPV, the probability of another EPV is briefly higher — not forever, not deterministically, and not enough to make every flash meaningful. Just enough to create bursts. This is the Hawkes intuition from the intro: each event leaves a small temporary dent in the future, and the dents can stack.

The third is session-level fragility. Some sessions are simply more loaded than others. Poor sleep, longer wake time, fatigue, or an already-irritable visual system do not create the next event directly — they change the slope of the hill. A burst on a fresh day and the same burst on a loaded day should not be treated as the same object.

That gives the simulator a simple causal shape: EPVs raise short-term intensity; accumulated intensity raises fragility; fragility raises the probability that a dense local cluster tips into a full aura. Then random timing noise decides whether the possibility actually lands in this window or not.

That last clause is the whole game. There is noise in the generator because there is noise in the kind of question being simulated. Without it, the task is too clean; with too much of it, no learner can do anything. The interesting regime is the annoying middle: enough structure that a simple model can beat chance, enough randomness that no model gets to feel prophetic. The simulator is built so the model can see the weather forming, not the lightning bolt itself.

Four views of the toy world. The risk map and phase portrait trace how density and load evolve; the raster stacks the EPV streams session by session; and the session scatter spreads every window across its two clocks — the rare positive windows are the bright cluster drifting up and to the right.

What the model actually estimatesThe model is not trying to predict the next flash. It is trying to estimate a local hazard: given the recent density of flashes and the slower state of the session, how dangerous is this window compared with a quiet one? That distinction matters because hazard can be real while exact timing stays stubbornly uncertain.

Why more synthetic data isn't more evidence

The tempting objection writes itself: if the data are synthetic, why simulate scarcity? Why not generate a million windows and let the LSTM win or lose cleanly?

Because that would confuse two different uncertainties.

More synthetic rows reduce Monte Carlo error. They do not reduce epistemic error. A million windows from the same simulator tell you the simulator more precisely; they do not make the simulator more true. If the question were "what happens under this exact data-generating process?" then yes, generate endlessly. But that is not the question I care about.

The relevant question is harsher and more practical: under the kind of evidence budget a real logging project would actually have — few positives, correlated windows, uneven sessions, imperfect context — which claims survive? So the synthetic cohort is deliberately kept in the small-data regime. The sample size is not a limitation accidentally inherited from reality; it is part of the stress test. The generator can make as much data as I ask for, but large runs are used only to check the stability of the comparison — never treated as extra evidence about biology.

Once the generator exists, synthetic rows are nearly free — and that is exactly why they are not evidence. The bottleneck is not the number of rows Python prints; it is the number of assumptions anchored tightly enough that a future real dataset could challenge them.

2. The features

Three sections on the numbers themselves — how they get built from raw events, which tempting one is a deliberate trap, and how the set is cut down to something a person can still argue with.

Turning events into features

You cannot hand a model "a flash happened at 14:32." You hand it numbers sampled on a clock. So the first modeling decision is not which model to use — it is what number should stand in for the system is getting excited right now.

The main feature is a recency-weighted event rate: an exponentially-weighted moving average over the EPV stream. Each event bumps it up; between events it decays. Recent flashes count for a lot, old ones fade. A single isolated flash barely moves it; a cluster pushes the curve up and keeps it there.

This is not an arbitrary convenience. In the language of the simulator it is an observable proxy for the Hawkes intensity — the running estimate of how self-excited the local stream has become. The premise that "events matter when they cluster" becomes one number the model can read.

That number is paired with a slower clock: accumulated load. The fast clock asks whether a burst is happening now. The slow clock asks whether the session has been simmering long enough for the same burst to matter more.

The dual clockThere are two timescales, and the model needs both. A fast clock tracks recent density: is there a burst happening now? A slow clock tracks accumulation: how loaded is the session before this burst arrives? A short dense burst on a fresh system is not the same object as the same burst after an hour of subthreshold activity — one is a spark, the other may be a spark in dry grass.

The two clocks reading one EPV stream (top), with six more sessions for scale (bottom). Each tick is an EPV; the fast clock — recency-weighted density, cyan — spikes and decays, while the slow clock of accumulated load — amber — creeps up underneath. Aura sessions climb into a racha before the red bar; calm ones never do. This is the hypothesis drawn legibly, not evidence for it: that auras follow a racha is built into the simulator, so the figure shows the assumption the model is asked to recover, not a discovery.

Around those clocks sit a few context variables — sleep, wake duration, screen exposure, session metadata. They are allowed in because a real logger would collect them. They are not allowed to dominate, because the simulated mechanism is intentionally event-centric. The claim being stress-tested is not "all of life predicts aura." It is narrower: local EPV density may contain a weak pre-aura signal. A small hypothesis is easier to kill.

Adding a negative control

The most dangerous feature in the project is also the most psychologically satisfying: screen-context switching.

Laptop to phone. Phone to monitor. IDE to browser. White page to dark terminal. It feels like the villain — modern, visual, jittery, exactly the sort of thing a person who works on screens expects to blame. So it goes into the simulator, hard. But it goes in as a negative control.

A negative control is not a useless column. It is a feature plausible enough to tempt you, constructed so that it should carry no real predictive information. Context switching is allowed to look realistic as a behavioral stream, but it is not wired into the aura hazard. If the model gives it weight anyway, the pipeline has a problem: it is grabbing correlation-shaped noise because the dataset is small and the feature sounds good.

The pipeline rejects it. It receives no stable weight, does not survive feature selection, and does not improve session-blocked validation — though here I am getting ahead of the project, because both of those procedures come in the next two sections. I am giving the decoy's verdict before the machinery that delivers it, because a negative control is easier to hold as a goal — a feature that should fail — than as a result buried three sections later.

You can also just look at it. Stacked one row per simulated day, with every aura drawn as a red bar, the behavioral streams and the auras share a screen but not a rhythm: the blue typing trace and the amber window-switch marks wander everywhere, and the red bars fall where they fall. Nothing in the behavior lines up with the moments that matter — which is exactly what a decoy is supposed to look like.

That does not mean "screen switching is not a real trigger." A synthetic benchmark cannot tell you that — I already wrote the answer into the world. The correct conclusion is more modest and more useful: given a plausible decoy, the feature-selection procedure does not automatically crown it. A future real version of this project will be full of features that feel meaningful — brightness, caffeine, sleep debt, stress, notifications, posture, ambient light. The question is whether the pipeline can say no.

A variable can be useful precisely because it should not work: it is a trap you set for your own model. If the model starts believing in the decoy, that is telling you something ugly about leakage, feature selection, or sample size — and refusing it is one of the few ways a small pipeline can earn trust.

Pruning the feature set

At the start the candidate table has more columns than the final model deserves: the clocks, the context, simple counts, recent maxima, inter-arrival summaries, session age, and the decoy. Eleven candidates is not excessive for exploration.

But eleven is excessive for the evaluation you actually want to believe — and this is the part people get backwards with synthetic data. Because the generator can produce unlimited rows, it is tempting to let the feature set stay wide and drown the uncertainty in volume. But the benchmark is tied to a future logging regime where positives are rare. The effective sample size is not "number of rows"; it is closer to "number of independent sessions and positive transitions." Consecutive windows from the same session do not give you eleven independent chances to learn eleven effects. They give you one correlated stretch of evidence sliced into many rows.

So the pruning is done before looking at the outcome. Redundant features go, by correlation and by meaning: if two columns are two ways of saying "recent density," one survives. If a feature is a post-hoc summary not available at prediction time, it is out. If a feature exists only because it was easy to log, "because the spreadsheet has it" is not a reason. What is left is five knobs: the fast clock, the short-term density slope, accumulated session load, sleep debt, and time awake. The exact count matters less than the discipline — the model is forced to be small enough that a human can still argue with it.

The eleven candidates correlated against each other. The fast-clock family in the top-left block is almost the same column written six ways — burst and seconds-since-event are anti-correlated at -0.98. That redundancy is why eleven collapses to five: keep one reading of recent density and drop its synonyms. The decoy, window-switching, correlates with nothing, exactly as it was built to.

In rare-event work, feature selection is not a beauty contest; it is debt control. Every extra variable is a loan against evidence you probably do not have — and in a simulation study it is worse, because extra features are extra assumptions with column names.

3. Validation and models

With features in hand, the real work is scoring them without fooling yourself — the split first, then the models, then a diagnostic only a simulator allows.

Splitting the data

The easiest way to make this project look impressive is to split the data by row. It is also the easiest way to make the number meaningless.

Rows inside the same session are siblings. A window ending at 14:45 and a window ending at 15:00 share the same simulated day, the same slow load, the same context, and often most of the same events. If one lands in training and its near-twin lands in testing, the model is not forecasting a new situation — it is recognizing a cousin. The metric comes back beautiful and means nothing.

So the split is by session. A whole session goes into training or into testing, never both. Five folds, stratified so the rare positives do not all pile into one fold. The bootstrap resamples whole sessions too, because rows are not independent evidence units. Resampling rows would produce narrow intervals and false confidence; resampling sessions gives uglier uncertainty and a number worth reading.

This is unglamorous, and it is the whole game. Once the split is honest, the scores get less exciting and more useful.

Why the split is by sessionIn a time series, leakage does not need to look like cheating. It can look like a clean random split. But if near-duplicate windows from the same session appear on both sides, the model has already seen the shape of the test case. Session-level splitting is not conservative; it is the minimum condition for the metric to mean anything.

From simple baselines to the LSTM

Before the LSTM, there is a threshold. If recent EPV density crosses a fixed line, raise the flag. No memory cell, no hidden state, no learned representation — just the claim in its most primitive form: dense EPVs are more dangerous than sparse ones. It is too stupid to be the final answer, which is exactly why it belongs first. It gives every later model something humiliatingly simple to beat.

The second model is logistic regression over the five retained features. Still boring, still readable: you can look at the coefficients and ask whether the model uses the world the way the simulator actually works. The fast clock should matter. The slow clock should matter. The negative control should not. If the signs are strange, the fix is not an LSTM — the fix is that the representation or the validation is wrong.

In this scarce synthetic benchmark, the logistic model lands around ROC 0.72.

Here is what that means, with no jargon. Take one pre-aura window and one calm window at random, and ask the model which is riskier. It picks correctly about 72 times out of 100. A coin gets 50. A perfect oracle gets 100. So there is real signal inside the toy world — clearly more than chance — and nowhere near something you would trust as a warning system.

The logistic model's decision surface over the two clocks. Risk rises up and to the right — more recent density, more accumulated load — and the red pre-aura windows do lean that way. But look at the overlap: most of the field is mixed blue and red. That is what ROC 0.72 looks like up close — a real tilt, not a clean line.

The rarity makes it harsher than the number sounds. Even when the model concentrates risk, most alarms are still false, because most windows are non-events. The top-risk windows are enriched for positives, not purified. It is a metal detector that triples your odds in a field where almost every beep is still a bottle cap. A model can be directionally right and operationally useless at the same time.

That is the result in a single line: statistically above chance, operationally unusable.

So you reach for the LSTM. It is the tempting model because the data are sequential, and that temptation is not stupid: a stream of events really does have temporal structure, and a recurrent model is built to consume it. If you want one assembled gate by gate from a single neuron, that is bits2bricks/LSTMs from scratch — here it is just the obvious next thing to reach for. You would bet on it — a sequence model reading the raw stream should beat a four-feature logistic regression. That is the entire promise of deep learning.

It does not pay rent.

The LSTM can learn the simulated stream — that is not the problem. The problem is generalization under the evidence budget. With few positive sessions, the extra capacity mostly buys more ways to memorize local accidents. Under session-blocked validation, the heavier model shows no reliable edge over the small logistic model, and its apparent advantage over the threshold rule is unstable. The handcrafted clocks are already close to sufficient statistics for the simulated mechanism, while the LSTM has to learn those clocks from scarce, correlated windows. It has more freedom than evidence, so it spends the freedom badly.

And to rule out that the network was simply mis-sized, I ran two LSTMs — a small one and one with roughly double the capacity. Neither earns its keep. Scored by PR-AUC (the metric that actually respects rare positives) and averaged over the session-blocked folds, the picture repeats across every forecast horizon: the four-feature logistic is the tallest bar, the one-feature rule sits just behind, and both LSTM sizes trail — worst in the hardest 5–15 minute window.

The bake-off. In all three horizons the four-feature logistic (cyan) is the highest bar; the two LSTM sizes (orange, red) never clear it and collapse hardest in the 5–15 minute window. Doubling the LSTM's capacity did not buy more skill — only more ways to overfit a scarce, correlated dataset. And mind the axis: even the winner sits near PR-AUC 0.1, which is real signal and still nowhere near a usable alarm.

The bootstrap interval on the logistic model's edge over the dumb threshold crosses zero, and the LSTM is no cleaner. Said plainly: in this finite simulated study, I cannot rule out that the cleverer models add nothing meaningful over a density rule.

That is not a verdict on LSTMs, or on brains, or on aura forecasting. It is a verdict on this evidence regime — and it is the cleanest result in the project precisely because it is not flattering, and because it travels. "At this sample size, capacity buys overfitting, not skill" is a fact about the volume of evidence, not about the body.

Two kinds of claimSome claims travel out of a simulator: the validation must be session-blocked; the feature budget is small; a decoy feature can expose leakage; at this sample size, a complex model cannot be shown to beat a simple one; the uncertainty interval is wide. Other claims do not travel: context switching is not a real trigger; aura timing is biologically irreducible; this ROC estimates clinical performance. The whole discipline of the writeup is to say only the first kind out loud — and to treat the second kind as a hypothesis to test later, never a finding to bank now.

Revealing the hidden state

There is one more useful abuse of the simulator — the experiment no real study gets to run.

Because I built the world, it has a hidden variable a real logger would not have: the session-level latent fragility, the thing that decides whether a day is generally dangerous before any local window is considered. So I can cheat and hand that latent state directly to the model. Give it the variable it normally has to infer, and ask how much the score improves.

It barely moves.

This is not a statement about real brains. It is a statement about this generator. But it is still useful, because it tells you where the simulated difficulty lives. The model is not mainly failing for lack of a session-level column. It is failing because even after the session is known to be loaded, the exact window still has stochastic timing. A loaded die is not an oracle: knowing the die is loaded tells you the distribution, not the next roll. Said precisely: under this data-generating process, better access to session fragility does not solve event timing.

Fragility versus timingThere is a difference between predicting that a system is fragile and predicting the exact moment it breaks. Most "risk" models are good at the first kind; this project demands the second. In this simulator, most of the remaining error lives in the timing problem. That does not prove reality works that way — it is a map of the wall, not a measurement of how tall it stands. It tells the next real experiment where to push: if richer sensors suddenly improve timing, the simulator was missing a driver; if not, the bottleneck may really be short-horizon stochasticity.

4. Taking stock

What survives all of that, measured against the real literature and against the fair suspicion that none of it was necessary.

What the EEG literature contributes

I have saved this literature for last, but in the order things actually happened it came first — these are the papers that talked me out of a discovery paper and into a rehearsal. The serious seizure-forecasting literature comes at the problem from the other side — real neural recordings — and the comparison calibrates expectations rather than borrowing authority.

Payne et al. used long-term intracranial EEG from the NeuroVista dataset, passed one-minute segments through CNN/LSTM models, and forecast seizure-onset windows across several horizons. Performance was above a random predictor, but it varied patient to patient: better for some, worse for others. Chambers et al. later tested LSTMs directly on unprocessed intracranial EEG, again reporting above-chance prediction under a carefully defined framework rather than a solved problem.

That matters because this project works with a poorer signal — subjective visual events instead of electrodes. If deep models over real neural measurements still need careful patient-specific evaluation, a personal EPV stream has no right to expect a clean predictive dial. The honest comparison is structural, not evidential: both stories resist the fantasy of a green-to-red dial, and both produce the same flavor of answer — some signal, lots of uncertainty, heavy dependence on validation choices, and no permission to skip the hard prospective test.

It is also why I like the single-subject framing as a target. If subjective visual precursors are useful at all, they are likely to be personal. The relevant distribution is not "people with epilepsy" in the abstract; it is one person's private stream — their EPVs, their sleep, their wake duration, their threshold, their logging discipline. A population model may be scientifically richer. A personal hazard model may be practically closer to the thing a person actually wants — at the cost of one brutal requirement: enough personal data, collected prospectively, before the conclusion is desired.

What the synthetic pass actually buys

There is a fair objection to all of this: synthetic data cannot tell you whether a real aura is predictable — was that not obvious from the first line?

Yes. If the entire conclusion were "go get real data," this would be a long way to say nothing. But that is not the output. The useful residue is more specific.

First, the pipeline can reject a plausible decoy. Future real logs will be full of emotionally satisfying features; without negative controls you cannot know whether the model is learning physiology or your narrative.

Second, the simple clocks are strong baselines. Any future deep model has to beat them under session-level validation, not merely look better under a row split. That saves a lot of wasted neural-network theater.

Third, the relevant metric is not a single ROC-AUC. It is the score, the bootstrap interval, the base-rate behavior, and the edge over a dumb rule. A model that beats chance but fails the operational threshold is not a failed paper — it is a correct measurement.

Fourth, the sample-size question becomes concrete. The next pass does not need "more data" in the abstract. It needs more independent sessions, more positive transitions, cleaner event timestamps, and labels recorded before the analysis has a chance to rewrite them.

"Go get real data" is not an insight. "Go get this kind of real data, split this way, with these controls, and expect this much wobble" is. That is what the synthetic pass buys: not belief, but preparation.

What this was worth

So — can you forecast a visual aura from the little flashes that seem to precede it?

In the simulator: a little. Better than chance, not enough to trust, and not clearly improved by an LSTM under honest validation. In the world: not answered. That boundary is the whole point.

What the project actually produced is not a predictor. It is a pipeline that is harder to fool than the first version of the idea: it builds the right clocks, keeps the sample scarce on purpose, splits by session, uses a negative control, compares against a dumb threshold before trusting a neural network, bootstraps uncertainty at the right level, and — most importantly — knows which conclusions it is not allowed to make. That is the instrument. The predictor can wait.

I should be honest about the part that does not fit in a methods section. I wanted this to work. There is a particular pull to a problem like this — the sense that the signal is right there, that one more feature or one more layer will finally surface it — and that pull is exactly the thing a rare-event study has to be built to resist. Most of the discipline in this writeup, the session splits and the negative control and the dumb baseline I kept in the report even when it embarrassed the network, is really just scaffolding to protect the result from the person who wanted a different one. The number I trust most is the one I was least hoping for.

The next real version is straightforward and annoying, which is usually a good sign. One button per EPV. One session boundary. Sleep and wake duration recorded before the outcome is known. Context switching kept as a negative-control candidate, not a pet theory. The feature budget fixed in advance. The threshold baseline kept in the report even when it embarrasses the LSTM. Validation by session, or not at all. Then ask whether the real stream lands above the synthetic sanity floor.

I am a systems person and not a clinician, and a neurologist would probably frame the whole thing differently. But the engineering lesson travels past brains.

You do not get to build the dial until you have earned the right to trust the needle.

This is the writeup of trying to earn that right in a small simulated world first — and finding that even there, under honest scarcity, the needle barely moves. That is not a failure. It is the measurement doing its job. Most of seizure prediction is exactly this honest. Most of it just does not say so out loud.

26 · jun 26(day zero) spent a day trying to make the LSTM win. it would not win. that turned out to be the most useful day of the project.

Simulation-first, experimental, personal, and not a medical device. Nothing here decides whether to keep going when symptoms appear, and none of it is clinical advice.

5. References

Payne, D. E., Chambers, J. D., Burkitt, A. N., Cook, M. J., Kuhlman, L., Freestone, D. R., & Grayden, D. B. (2023). Epileptic seizure forecasting with long short-term memory (LSTM) neural networks. arXiv:2309.09471.

Chambers, J. D., Cook, M. J., Burkitt, A. N., & Grayden, D. B. (2024). Using Long Short-Term Memory (LSTM) recurrent neural networks to classify unprocessed EEG for seizure prediction. Frontiers in Neuroscience, 18, 1472747.

Bernabeu, A., Zhuang, J., & Mateu, J. (2025). Spatio-Temporal Hawkes Point Processes: A Review. Journal of Agricultural, Biological and Environmental Statistics.

Reinhart, A. (2018). A Review of Self-Exciting Spatio-Temporal Point Processes and Their Applications. Statistical Science.