Main

Spontaneous behaviour exhibits structure. Ethologists have long argued that the self-motivated behaviour of animals in the wild is flexibly built from modular components that are linked together over time in a predictable yet probabilistic manner1. Many well-studied laboratory behaviours—including chemotaxis, grooming, prey seeking, courtship, birdsong and exploratory locomotion—are similarly characterized by modularity and predictability2,3,4,5. However, it remains unclear how the brain regulates the expression of individual behavioural modules at any given moment, or how it dynamically composes these modules into the fluid behaviours observed when animals act of their own volition in the absence of experimental restraint, task structure or exogenous reward.

Given that the loss of dopaminergic neurons from the substantia nigra pars compacta (SNc) causes diffuse deficits in action initiation and sequencing, it is likely that the neuromodulator dopamine influences the architecture of spontaneous behaviour6,7,8. Yet we know little about the precise relationship between dopamine and behaviour when animals freely explore an environment. Although dopamine is thought to motivate spontaneous behaviour and to influence the vigour with which actions are expressed, evidence is mixed as to whether phasic dopamine transients are permissive or causal for movements, whether dopamine rises or falls when animals initiate a movement, and whether dopamine fluctuations specify movement kinematics in freely behaving animals6,9,10,11,12,13,14,15,16,17,18,19. By contrast, during structured tasks in which animals seek explicit and often cued rewards, phasic dopamine clearly conveys information related to reward and reward-prediction errors, reinforces reward-associated actions, and influences the choices made between alternative actions20,21,22,23,24,25.

Dopamine may have distinct roles during spontaneous and task-structured behaviours, given the many ways in which they differ; for example, spontaneous behaviours generally exhibit a greater variety of expressed behavioural modules, include more complex behavioural sequences, and tend to emphasize self-initiated movements associated with active sensing2,4,26. Nevertheless, both spontaneous behaviour and structured tasks demand that animals choose actions on an ongoing basis from a distribution of possibilities, suggesting that dopamine may influence the continuous assembly of naturalistic sequences through mechanisms similar to those used to support goal-driven action selection in response to rewards.

To test this hypothesis, here we characterize mouse spontaneous behaviour using motion sequencing (MoSeq)—which uses 3D imaging and unsupervised machine learning to atomize behaviour into sub-second modules referred to as ‘syllables’—while simultaneously assessing and manipulating dopamine transients in DLS, a region of the basal ganglia implicated in the composition of natural behaviours27,28,29. As SNc dopaminergic neurons acutely influence the population activity of DLS spiny projection neurons (SPNs) and induce plasticity in corticostriatal synapses30, DLS dopamine fluctuations may be particularly relevant to syllable expression and/or sequencing.

We find that DLS dopamine systematically fluctuates during the expression of behavioural syllables, and—through calibrated closed-loop manipulations of DLS dopamine—demonstrate that these fluctuations are causally related to syllable usage, sequencing and vigour. A simple computational model in which syllable-associated dopamine transients shape sequence composition effectively reproduces the observed behavioural choices made by mice during spontaneous behaviour. These results reveal that DLS dopamine transients act as a continuous teaching signal, one that affords spontaneous behaviour its moment-to-moment structure; our observations further suggest a broad model in which the composition of spontaneous behaviour from elemental components is supported by the same computations and circuits that govern action choices in more structured tasks.

Relating DLS dopamine to spontaneous behavior

To characterize the relationship between striatal dopamine release and spontaneous behaviour, we virally expressed the dopamine reporter dLight1.1 in DLS neurons, and then assessed dopamine fluctuations via photometry as mice explored a featureless open field in the dark31. In concert with these neural measurements, we both quantified conventional movement parameters and performed MoSeq4 (Fig. 1a–d and Extended Data Fig. 1). In this setting, MoSeq identified 37 commonly used behavioural syllables, whose median duration was 400 ± 636 ms, ranging from pause syllables in which mice adopted different static poses, to dynamic syllables in which mice reared, groomed or explored (Extended Data Fig. 2a–c). Syllables varied in terms of how often they were used and the order in which they occurred as time unfolded during each experiment (Extended Data Fig. 2b,d–g).

Fig. 1: Behaviour is associated with dopamine transients in DLS.
figure 1

a, dLight expression and fibre placement in DLS (Methods). b, The behavioural characterization pipeline using MoSeq (n = 14 mice for MoSeq, 216 experiments; Methods). c, Examples of measured kinematic variables. d, Aligned kinematic variables, MoSeq syllables and dLight fluorescence from an example experiment. e, Top, average correlation between kinematic variables and dLight transient rate. Bottom, correlations with dLight fluorescence. Coloured shading denotes bootstrapped s.e.m.; grey shading indicates the 95% shuffle confidence interval. Solid bars indicate statistical significance at P < 0.05 (shuffle test; Methods). f, Top, average fluorescence (z-scored to shuffle; Extended Data Fig. 4c and Methods) aligned to movement initiation or syllable onsets (n = 100 shuffles). The average syllable-associated dLight transient exceeds that associated with movement initiation (P = 0.0006, z = 3, effect size r = 0.8, two-sided Wilcoxon signed-rank test; n = 14 averages). Bottom, derivative of top panel. Green shading represents the 95% bootstrap confidence interval; grey shading represents the 95% shuffle confidence interval. g, The distribution of all syllable-associated dLight peaks. Bottom, the cumulative distribution. h, Left, the distribution of syllable-associated dLight peaks for across all experiments. Right, z-scored average syllable-associated waveforms, sorted by peak fluorescence. The blue and red stars indicate the syllable waveforms shown in k. ***, Kruskal–Wallis H test on average syllable-associated fluorescence amplitudes: P < 10−25, H = 209.29, n = 518 mouse–syllable pairs. y-axis syllable sorting is shared across panels. i, Left, example syllables with different average (across experiments) waveforms (top) but similar velocity (bottom). Right, example syllables with similar waveforms (top) but different velocities (bottom). Shading represents the 95% confidence interval. j, Robust linear regression between syllable-associated dLight and velocity (top) or angular velocity (bottom; Methods). Each point is a sampled syllable instance (n = 28,000 points; n = 2,000 points per syllable drawn randomly from each mouse). Regression line (shading indicates the 95% bootstrap confidence interval) and kernel density estimate are shown. P-values estimated by shuffle test. k, Left, average syllable-associated waveforms for starred syllables in h (right). Coloured shading represents the 95% bootstrap confidence interval; grey shading represents the 95% shuffle confidence interval. Right, syllable-associated waveforms from the left panel binned into quartiles of peak magnitudes. l, Left, held-out classifier performance predicting syllable identity from syllable-associated dLight peak amplitudes (top) or waveforms (bottom). Right, dendrogram showing syllables organized by MoSeq distance (Methods). AU, arbitrary units.

Photometry revealed pervasive, fast dopamine fluctuations in DLS that occurred throughout each experiment regardless of whether the mouse was actively moving or relatively still (average rise time during spontaneous behaviour = 67 ms, decay time = 100 ms; Fig. 1d and Extended Data Fig. 1e,h). These dopamine fluctuations (measured as both dLight transient rates and average binned amplitudes) were only weakly correlated with many aspects of movement kinematics (for example, turning, rearing and acceleration) but were significantly negatively correlated with translational velocity at short timescales (<10 s); this correlation reversed and became positive at longer timescales, consistent with the idea that dopamine broadly invigorates and motivates movement6,9,17,32 (Fig. 1e). We validated the relationship between translational velocity and dopamine transients using 3D keypoint tracking, which also demonstrated that forelimb movement per se (that is, independent of translation) only negligibly correlated with dopamine transients (Extended Data Fig. 3).

As has been observed previously, dopamine systematically fluctuated when mice initiated a movement after pausing in the arena9 (Fig. 1f). However, we also observed that dopamine fluctuated as mice transitioned from one behavioural syllable into the next. These fluctuations exhibited a characteristic dip-then-peak pattern surrounding each syllable transition; as dLight fluorescence changes lag dopamine release by tens of milliseconds31 (Extended Data Fig. 1e), it is likely that the observed dopamine dip occurs at the end of the previous syllable, and the peak in dopamine occurs during expression of the subsequent syllable (Fig. 1f,g). Consistent with this hypothesis, time-warping the dopamine trace to accommodate differences in syllable duration revealed that on average, dopamine peaked near the middle of each syllable and decayed as the syllable ended, reminiscent of the ‘burst–pause’ firing pattern of dopamine neurons previously observed at movement initiations10,11,14,33 (Extended Data Fig. 4a). Nearly every syllable instance was accompanied by a positive dopamine transient, whose amplitude varied (Extended Data Fig. 4b–e).

DLS dopamine does not predict syllable identity

Syllable-specific dopamine waveforms exhibited relatively stereotyped shapes and amplitudes (when averaged either within or across mice and experiments), suggesting that dopamine waveforms predict either the identity of the associated syllable or the kinematics of movement associated with its expression (Fig. 1h and Extended Data Fig. 4c–h). However, kinematically similar syllables (for example, two rears) could exhibit very different average dLight waveforms, whereas kinematically different syllables (for example, turning and investigating) could exhibit similar average waveforms (Fig. 1i). Aggregating different syllables into categories (such as rearing, grooming or diving) revealed that different categories of behaviour exhibited broadly overlapping average dopamine transient amplitudes (Extended Data Fig. 5a). Furthermore, syllable-associated dopamine transient amplitudes only weakly correlated with the movements actually expressed during a given syllable (Fig 1i,j and Extended Data Fig. 5b).

Consistent with a potential dissociation between DLS dopamine and syllable-related kinematics, dopamine waveforms often changed shape and amplitude across different instances of the same syllable (Fig. 1k). This was true even for syllables known to correlate with high SNc and striatal activity such as contralateral turns20,34,35 (Extended Data Fig. 5c). Indeed, random forest classifiers were unable to predict which syllable was being expressed by the mouse at any moment based on the coincident dLight waveform amplitudes or shapes9 (Fig. 1l). Syllable-associated dopamine transient amplitudes were also unrelated to many other features of behaviour, including the rendition quality of each syllable, differences in speed between syllables, the biomechanical difficulty of transitioning into a given syllable, the position of the mouse in the arena, and the specific identity of the preceding syllable (Extended Data Fig. 5d–i). Thus, although dopamine systematically fluctuates during the expression of behavioural syllables and syllables are associated on average with different dopamine waveform shapes and amplitudes, individual dopamine transients do not appear to reliably encode information about syllable identity, kinematics or related features of syllable-associated behaviour.

Dopamine predicts future syllable use and sequencing

We therefore considered whether DLS dopamine transients instead have a more flexible role in specifying ongoing patterns of syllable usage, given that the usage of specific syllables and sequences varies during the course of each experiment (Extended Data Fig. 2e–g). Notably, dopamine transient amplitudes observed during spontaneous behaviour roughly matched those observed while mice consumed an unexpected (and presumably rewarding) chocolate chip placed in the open field; this finding suggests that syllable-associated dopamine transients occurring during self-initiated behaviour, even in the absence of reward, may reinforce the expression of associated syllables (Fig. 2a).

Fig. 2: Endogenous dopamine transients predict average syllable use and sequence variability.
figure 2

a, The unexpected food reward protocol. Top right, average spontaneous versus food reward-associated transients (Methods). Shading represents bootstrap s.e.m. Bottom right, probability density function of dLight transient amplitudes. The dotted line indicates the threshold for detecting dLight transient peaks. b, Left, robust linear regression between syllable-associated dLight and average syllable counts per syllable (each dot is a syllable–mouse pair). Regression line and kernel density estimate are shown (r is Pearson correlation between held-out predictions and actual data). Right, the distribution of Pearson correlations using models fit to shuffled data compared with the observed correlation (blue line). Shading indicates the 95% bootstrap confidence interval. P-values estimated by one-sided shuffle test. c, Schematic depicting the hypothesis that dopamine predicts changes in future behaviour. Blue star indicates the syllable-associated dopamine peak. d, Left, average dLight waveforms for each fluorescence quartile at syllable onset for an example syllable–experiment pair. Right, log2 fold change compared to average syllable counts after example syllable onset computed over increasing bin sizes (in syllables) after onset. e, The average Pearson correlation between syllable-associated dLight and syllable counts or velocity, and the dLight signal autocorrelation, computed using a set of increasing bin sizes after syllable onset. Grey shading represents the shuffled 95% confidence interval. The two x-axes reflect time in syllables and approximated in seconds. Solid bars indicate statistical significance (P < 0.05, one-sided shuffle test). f, The distributions of exponential decay timescales (τ) for the correlations plotted in Fig. 2e (n = 1,000 bootstrap samples). In all box plots in this Article, the horizontal line represents the median, box edges delineate the first and third quartiles, and whiskers include the furthest data point within 1.5 times the interquartile range of the first or third quartile. g, Average cross-correlation between binned syllable counts and syllable-associated dLight fluorescence (from all mice and experiments) across lags (P < 0.001, one-sided shuffle test; the arrow indicates average peak lag, error is 68% confidence interval). Grey shading represents the shuffled 95% confidence interval. h, Overall correlation between syllable-associated dLight and syllable usage for syllables temporally adjacent to the index syllable. Grey shading represents the shuffled 95% confidence interval. The solid bar denotes statistical significance (P < 0.05, one-sided shuffle test). i, As in b, but for average entropy per syllable. Nat, natural unit of information. j, As in d, but for sequence entropy for an example syllable–experiment pair. k, As in e, but for sequence entropy. l, Fitted τ values for the correlation curve in k (n = 1,000 bootstrap samples).

Consistent with this hypothesis, those syllables that were on average associated with more DLS dopamine during a given experiment tended to be used more during that experiment; furthermore, variation in the average level of DLS dopamine associated with a given syllable across experiments predicted variation in the use of that same syllable (relative to others) across experiments (Fig. 2b). Dopamine transient amplitudes also predicted the near-term use of associated behavioural syllables: if a given instance of a syllable coincided with a relatively high amplitude dopamine transient, that syllable was then used more over the next several minutes, whereas if an instance of that same syllable coincided with a relatively low amplitude transient it was used less (Fig. 2c–g). The relationship between dopamine transient amplitudes and syllable usage was not due to autocorrelation in the dLight signal or to correlations between dopamine and velocity, both of which declined sharply after a few hundred milliseconds (Fig. 2e). Furthermore, the possible consequences of dopamine transients were largely restricted to the specific syllable with which they were associated, as the size of a given dopamine transient did not substantially correlate with the use of syllables neighbouring in time (Fig. 2h).

Each syllable expressed during spontaneous behaviour is associated with a specific set of possible subsequent syllables whose likelihoods vary substantially—some subsequent syllables are very likely, and therefore contribute to creating predictable (that is, more deterministic) behavioural sequences, whereas others are much less likely and therefore participate in less predictable (that is, more probabilistic) behavioural sequences (Extended Data Fig. 2d). Given that treatment with dopaminergic agonists has been shown to increase the variability of ongoing behavioural sequences36,37,38, we tested whether syllable-associated dopamine transients can influence the choice of next expressed syllable. Examining the transition patterns between syllables revealed that mice tended to string together more deterministic behavioural sequences when average syllable-associated dopamine was relatively low rather than high (Fig. 2i and Extended Data Fig. 6a). Furthermore, syllable-associated dopamine levels correlated with sequence predictability on a moment-to-moment basis: if a given instance of a syllable was associated with a relatively high amplitude dopamine transient, the mouse tended to make less predictable syllable choices over the next several seconds, whereas if that same syllable was associated with a relatively low amplitude transient, syllable sequences were more deterministic in the near future (Fig. 2j–l).

Both syllable usage and sequence variability were themselves correlated: those syllables that were used the most also tended to participate in the most variable behavioural sequences (Extended Data Fig. 6a). However, syllable usage and sequence variability contributed independently to the ability of an encoding model to predict dopamine fluctuations during behaviour (Extended Data Fig. 6b–g; Methods). Our findings regarding syllable usage and sequence variability were specific to DLS, as dLight recordings in dorsomedial striatum (DMS) revealed less frequent dopamine transients that do not predict future syllable usage (Extended Data Fig. 7). However, dLight signals were effectively predicted by velocity in both DMS and DLS, consistent with previous findings that dopamine fluctuations in dorsal striatum correlate with movement25 (Extended Data Fig. 7e). Taken together, these findings suggest that moment-to-moment fluctuations in DLS dopamine influence the usage of associated behavioural syllables over timescales of minutes, and the choice of what to do next—and thus the predictability of behavioural sequences—on timescales of seconds.

Syllable-associated Opto-DA influences behaviour

To test directly whether syllable-associated DLS dopamine is sufficient to drive increases in syllable usage and sequence variability—and to determine whether fast dopamine fluctuations also influence movements per se—we built a platform that enables us to trigger an optogenetic pulse during the expression of a specific, targeted syllable (Methods and Extended Data Fig. 8a–e). We used this approach to manipulate syllable-associated phasic dopamine levels (Opto-DA) by optically stimulating dopamine axons in the DLS in mice expressing channelrhodopsin-2 (ChR2) in all dopamine-releasing neurons in the midbrain (Fig. 3a,bMethods). Our stimulation protocol was calibrated to mimic typical syllable-associated dLight amplitudes observed during spontaneous behaviour (Fig. 3cMethods). We assessed spontaneous mouse behaviour before, during and after syllable-specific stimulation; in each mouse we serially repeated this stimulation protocol for six different target syllables, which collectively exhibited substantial variability in syllable kinematics, usage and sequencing (Fig. 3d and Extended Data Fig. 9a–e). In nearly all instances, optogenetic stimulation was limited to the targeted syllable and did not extend into subsequent syllables (Extended Data Fig. 9f).

Fig. 3: Optogenetically evoked dopamine release in DLS reinforces syllable use and increases sequence variability.
figure 3

a, The closed-loop MoSeq pipeline. b, Schematic and representative brain section of fibre cannulae over DLS dopamine axons. c, Top, normalized optogenetically evoked dLight peak magnitude distribution (n = 842 peaks). Bottom, mean waveforms from spontaneous and Opto-DA transients (Methods). Shading represents the 95% bootstrapped confidence interval. Max, maximum. d, Experimental schedule describing baseline (rec.) and stimulation (stim.) sessions for ‘target’ syllables (Methods). e, log2 fold change in target counts compared with baseline, per mouse, averaged across targets (P = 0.002, U = 197, f = 0.82, one-sided Mann–Whitney U test) (see Methods for definition of ‘learner’). f, Cumulative increase in target counts relative to baseline in Opto-DA and control mice (P = 0.007, U = 184, f = 0.77, one-sided MannWhitney U test). g, Cumulative counts over concatenated stimulation sessions per target. Shading indicates bootstrap s.e.m. h, The relationship between target syllable usage changes from baseline during stimulation experiments versus post-stimulation experiments per mouse (Pearson r = 0.89, P = 0.005 for learners and P = 0.082 for controls, one-sided shuffle test). NS, not significant. i, Average transition entropy following stimulation. Shading indicates bootstrap s.e.m. The light grey band indicates the 95% confidence interval of the pre-stimulation average. Data are binned using five-syllable-wide non-overlapping bins. The bar indicates significance (P < 0.05 for Opto-DA mice, P > 0.05 for controls, two-sided Mann–Whitney U test comparing stimulation with catch trials). Syll, syllable. j, Sequence context changes from baseline to post-stimulation for an example mouse–target pair. Sequences proceed from left (incoming syllables) to right (outgoing syllables). Nodes are sorted by decreasing frequency at baseline. k, Average change in inbound and outbound transitions for target syllables on stimulation day sorted by the baseline rank of the transition. Traces are smoothed with a five-point rolling average. Shading indicates bootstrap s.e.m. l, Average kinematic parameters aligned to stimulation in Opto-DA mice and controls. Shading as in i. No comparisons between stimulation and catch trials in any of the mice were significant (P > 0.05, one-sided Mann–Whitney U test). m, As in l, but following 3-s-long stimulation. The solid bar indicates significance (P < 0.05, one-sided Mann–Whitney U test).

Pairing syllable expression with Opto-DA rapidly increased the use of target syllables, evoking a stable increase in syllable usage per unit time (rather than continuously increasing the rate of syllable use) (Fig. 3e–g and Extended Data Fig. 9g,h). This increased usage persisted in experiments after stimulation was terminated, demonstrating that mice learned an association between dopamine stimulation and the specific targeted syllable (Fig. 3h). The reinforcing effects of Opto-DA were specific to the targeted syllable, as non-target syllables were not affected (Extended Data Figs. 8f and 10). Thus, Opto-DA is sufficient to reinforce the expression of targeted behavioural syllables in the absence of task structure or external sensory cues, suggesting that mice can recognize their own movements on a moment-to-moment basis and use this information to upregulate actions that are associated with exogenous dopamine39.

Consistent with the observed correlations between endogenous dopamine fluctuations and sequence variability, behavioural sequences observed immediately after Opto-DA stimulation were more unpredictable than the those observed on catch trials (Fig. 3i). Of note, this increase in sequence variability was transient and only apparent during Opto-DA sessions; in the two subsequent sessions—after optogenetic stimulation had ceased but during which target syllable use remained upregulated—behavioural sequences surrounding the target syllable became more predictable, as the most likely transitions into and out of the target became even more likely (Fig. 3j,k). Thus Opto-DA increases sequence variability over seconds-long timescales, whereas sequence variability decreases over the longer timescales at which Opto-DA supports syllable reinforcement.

We also considered whether pairing Opto-DA with specific syllables changed the vigour with syllables were expressed, given prior experiments demonstrating that dopamine can control the speed (that is, vigour) of future movements by reinforcing the expression of fast (or slow) versions of a given movement, or the pitch of a note in a zebra finch’s song40,41,42,43. To address whether Opto-DA influences syllable-associated vigour, we tailored our Opto-DA experiment such that optogenetic stimulation was delivered only on those target syllable instances in which syllable speed was in the top quartile of its overall distribution; this manipulation systematically increased the velocity with which the target syllable was later expressed (Extended Data Fig. 10d). Conversely, Opto-DA during the slowest quartile of the syllable velocity distribution caused future instances of the targeted syllable to slow down.

In contrast to its dynamic effects on syllable usage, sequence variability and syllable vigour, Opto-DA did not prompt switching from one syllable to the next, alter movement parameters associated with the targeted syllable, change movement during the stimulation experiment in general, or induce a preference for a spatial location in the arena (Fig. 3l and Extended Data Fig. 10e–k). However, extending the duration of optogenetic stimulation to several seconds (which caused dLight signals that are of substantially higher amplitude than typically observed during spontaneous behaviour) caused mice to increase their velocity, consistent with previous reports demonstrating that strong perturbations of dopamine are sufficient to cause movements9,10,14,20 (Fig. 3m and Extended Data Fig. 10l).

Dopamine–behaviour relationships predict learning

Only a fraction of mice successfully associated Opto-DA with targeted syllables, and among those that learned this association, the degree of learning varied (Fig. 3e). We wondered whether this distribution in learning reflected differences in the sensitivity of individual mice to dopamine (Fig. 4a). Consistent with this possibility, the ability of endogenous, syllable-associated dopamine transients to support syllable use and induce sequence variability was similar within an individual mouse but differed across mice (Fig. 4b). Furthermore, mice that were strongly influenced by endogenous dopamine fluctuations were also those that were particularly avid learners of the association between Opto-DA and targeted syllables (Fig. 4c). These findings suggest that inter-mouse variability in Opto-DA learning reflects differences in the sensitivity of DLS to dopamine transients in general (as reflected by behaviour).

Fig. 4: Optogenetic syllable reinforcement varies predictably across mice and syllables.
figure 4

a, Schematic depicting the relationship between observed endogenous dopamine-syllable usage correlations and per-mouse dopamine sensitivity. Dopamine (DA) sensitivity refers to the ability of endogenous, syllable-associated dopamine peaks to influence changes in syllable counts (Endo-DA count influence) or sequencing (Endo-DA entropy influence) within an experiment (see Methods for how indices were computed). b, Top, the distribution of per mouse Endo-DA count influence (left) and Endo-DA entropy influence (right) averaged across all syllables. Bottom, scatter plot (including linear regression model fit) of per mouse average Endo-DA count influence and Endo-DA entropy influence (Pearson r = 0.69 computed from model predictions on leave-two-out held-out data; P = 0.001, P-value computed via one-sided shuffle test). Shading indicates the 95% bootstrap confidence interval. c, Scatter plot (including regression line) of per mouse average Opto-DA learning versus Endo-DA count influence (Pearson r = 0.51 (computed from model predictions on leave-two-out held-out data); P = 0.001, P-value computed via one-sided shuffle test). Shading indicates the 95% bootstrap confidence interval. d, Top, the distribution of per syllable average Endo-DA count influence (n = 296 mouse–syllable pairs). Bottom, Opto-DA learning plotted syllable-by-syllable for ‘learner’ mice (n = 9 mice). e, Top, average Endo-DA count influence across syllable categories. Bottom, average Opto-DA learning across syllable categories. f, Top, scatter plot (including regression line) of catch trial syllable-associated dLight and Opto-DA learning for each mouse–syllable pair (r = 0.32 over held-out data; P < 0.001, estimated via one-sided shuffle test). Bottom, model performance (evaluated with five times fivefold cross-validation) using actual versus shuffled data. g, Hypothesis that evoked dopamine release combines with ongoing endogenous release to alter behavioural choices. h, Model-based likelihood of predicting held-out syllable choices on Opto-DA stim experiments (blue) relative to control models (Methods) (P = 7 × 10−18 across all model comparisons relative to dLight model, U = 2,500, f = 1, two-sided Mann–Whitney U test; n = 50 model restarts; Methods). The right y-axis indicates the model performance as a fraction the of maximum correlation. i, The relationship between average model accuracy (correlation between predicted syllable usage and actual usage) and the ‘extra dopamine’ free parameter (black). Shading indicates the 95% bootstrap confidence interval. The distribution of empirically measured optically evoked dLight fluorescence is shown in blue.

In addition, not all syllables were equally reinforceable in both the endogenous dopamine and Opto-DA experiments (Fig. 4d). There was no discernable relationship between the specific type of behaviour encoded by each syllable (for example, rearing, running or grooming) and the degree of reinforcement by endogenous dopamine or Opto-DA (Fig. 4e). By contrast, the ability of a particular syllable to be reinforced by exogenous dopamine was well predicted by that syllable’s average endogenous dopamine transient amplitude: syllables that on average were associated with high amplitude dopamine transients during spontaneous behaviour were more easily reinforced by Opto-DA than those typically associated with low amplitude dopamine transients (Fig. 4f).

This observation suggests that the endogenous dopamine fluctuations that naturally occur during expression of a targeted syllable sum together with the exogenous dopamine induced by Opto-DA to promote syllable usage. To test this idea, we built a dynamic decoding model that predicted moment-to-moment syllable expression on the basis of syllable-associated dopamine. In this model, syllable-associated dopamine was represented as the sum of two components: endogenous dopamine (that is, the observed syllable-associated dopamine observed at baseline) and a free parameter representing the extra dopamine afforded by Opto-DA. We then fit this model to data observed during catch trials in the Opto-DA experiments, in which optogenetic stimulation during the target syllable was not provided (Methods). This additive model could reliably predict moment-to-moment syllable choices during the Opto-DA experiment (which included stimulation trials), but made less effective predictions when dopamine transients were shuffled in time relative to syllables (Fig. 4g,h). In addition, the amount of extra dopamine required for the model to accurately predict actual syllable usage patterns closely matched the amount of dopamine elicited by Opto-DA in our calibration experiments (Figs. 3c and 4i). Together, these data support a model in which endogenous and exogenous DLS dopamine sum together to influence syllable expression, and provide a possible explanation for the differential reinforcement of syllables observed in the Opto-DA experiment.

An RL model reproduces spontaneous behaviour

The observation that both exogenously induced and endogenous fluctuations in DLS dopamine levels drive changes in syllable usage and sequencing raises the surprising possibility that, during spontaneous exploration, mice are optimizing their behaviour—as they do in more structured tasks—to maximize the amount of total dopamine obtained during an experiment. If so, this would suggest that mice interpret fast DLS dopamine fluctuations during spontaneous behaviour as a teaching signal capable of shaping action choices.

To test this hypothesis, we built a reinforcement learning (RL) model in which an agent is trained—based on observed syllable sequences and dopamine transients—to predict the syllable choices expressed by actual mice during spontaneous behaviour (Fig. 5a). Whereas RL agents typically seek to maximize overall reward by optimizing their action choices in a particular state, here the RL agent seeks to maximize dopamine by optimizing the choice of which syllable to express next given its current syllable (Methods). We train this RL agent using standard Q-learning rules, which govern how syllable-associated dopamine transients update the transition probabilities between syllables44, and formulated this model such that dopamine is both rewarding and injects transient variability into subsequent syllable choices.

Fig. 5: RL models suggest that mice attempt to maximize dopamine during spontaneous behaviour.
figure 5

a, Top, schematic describing modification of a standard RL model to explore relationships between DLS dopamine fluctuations and behavioural choices. Bottom, schematics of ‘reinforcement-only’ and ‘full’ model variants (Methods). b, Left, empirical transition matrix (TM) observed during open field behaviour. Centre, an example transition matrix learned by the full model (top), along with the squared difference between the empirically observed transition matrix and the example learned transition matrix (bottom). Right, as in centre, except for the reinforcement-only model. The average correlation between the observed transition matrix and the transition matrix learned by each model, along with the associated P-value computed via shuffle test, are given for each model type. For visualization, the model transition matrix is estimated by taking a softmax (see Methods) over the Q-table learned by the model. Here, the temperature parameter was set to 0.1 for visualization only. c, The distribution of correlations between the learned and observed transition matrices for both the reinforcement-only (blue) and full (orange) models, compared to a histogram of correlations between transition matrices learned with time-shuffled dLight traces (all models are statistically significant according to a shuffle test, defined as model fits exceeding 95% of shuffle correlation values, n = 100 shuffles). d, The performance of the full model after temporally shifting syllable-associated dLight amplitudes across syllables over various lags. e, The distribution of log likelihoods for models that consider dopamine as a reward versus a reward-prediction error (RPE) signal (Methods). The log likelihoods shown are for the best parameterization for each model type across 50 bootstraps of the dataset. On the basis of this relationship, we formulated models that treated dopamine transients as representing reward rather than reward-prediction error.

Inspection revealed that after training the model converged on a syllable transition matrix similar to that emitted by actual mice exploring an open field; comparing alternative formulations of this model revealed that maximal model performance depended on dopamine both reinforcing specific syllable transitions and briefly increasing the variability of syllable choices (Fig. 5b–d). These findings are consistent with a model in which mice structure spontaneous behaviour to maximize DLS dopamine. We note that although our models assume that dopamine acts as a reward, there are alternative formulations under which dopamine acts as a reward-prediction error that are also consistent with our data (Fig. 5eMethods). This caveat is important given that mice rapidly modify syllable usage in response to Opto-DA stimulation, preventing us from formally distinguishing between these possibilities.

Discussion

Goal-oriented behaviours are purposive and yield explicit rewards, whereas spontaneous behaviour can often appear to be aimless and inscrutable. And yet, even the behaviour of mice placed in a dark empty bucket exhibits remarkable structure. Here we show that this structure is governed by ongoing fast fluctuations in the neuromodulator dopamine. DLS dopamine transients both correlate with and causally influence how often each syllable is used and in what order, despite the absence of an explicit task or exogenous reward. The ability of a simple RL model, in which dopamine transients are substituted for the reward signal, to predict syllable choices suggests that dopamine acts as a moment-to-moment teaching signal, one that enables DLS to dynamically assemble behavioural sequences. Together, our observations identify a neural mechanism that actively shapes the trajectory of spontaneous behaviour as it unfolds, and propose an unexpected functional role for DLS dopamine during self-motivated action that is centred around choice and reinforcement rather than movement initiation or instantaneous kinematic control.

Recordings of dopamine neurons during spontaneous behaviour have revealed a variety of correlations with movement initiation, speed and kinematics, and strong optogenetic stimulation of dopamine neurons elicits movement, suggesting that dopamine may cause movements per se9,11,12,14,20,33. The failure of our calibrated optogenetic experiments to causally influence movement initiation or execution at the moment of stimulation argues that DLS dopamine organizes—rather than commands—movement by influencing the overall statistical structure of behaviour. Notably, DLS dopamine bidirectionally influences future syllable vigour, demonstrating that dopamine regulates both discrete (that is, which syllable to express next) and continuous (that is, how fast to express a particular syllable) aspects of spontaneous behaviour.

Our data suggest four broad, non-mutually exclusive models for how DLS dopamine transients may arise and thereby influence future behavioural choices. First, dopamine may represent an output of the motor system that enables it to modulate (and invigorate) the future expression of some syllables relative to others. In this model, DLS dopamine acts to implement a motor plan articulated by the cortex, thalamus and basal ganglia (all of which send projections to SNc45); our observations that syllable-associated dopamine causally influences both future syllable use and subsequent syllable choices is consistent with this model. Although the mechanisms that enable dopamine to briefly increase sequence variability are not yet clear, it is possible that this phenomenon relates to the ability of dopamine to increase SPN excitability, which may decrease the fidelity with which cortical inputs are transformed into ensemble SPN activity30,46.

Second, DLS dopamine may represent the output of a circuit that evaluates the content of spontaneous behaviour. Although dopamine has classically been thought to report reward-prediction errors—which by definition require the provision of reward—it has recently been argued that dopamine may also encode action-prediction errors47,48 (APEs). APEs are proposed to occur as animals either execute or plan to execute a behaviour that is unexpected in a given context; in the setting of spontaneous behaviour, an APE-like model would predict that DLS dopamine represents the comparison between the expressed (or soon-to-be-expressed) behavioural syllable and that which would have been expressed at a particular moment given an idealized transition matrix. Our finding (similar to that in ref. 9) that syllable-associated dopamine transients reflect the probability of the next expressed behavioural syllable—but convey no information about syllable identity—is also consistent with the proposed role of APEs in conveying information about action errors that is independent of the specific identity of the expressed behaviour.

Third, DLS DA may encode an error signal unrelated to APEs. Given that most syllables are probably associated with some degree of active sensory sampling, it is possible that DLS dopamine reflects the difference between expected and experienced sensory cues in the environment that are encountered during each syllable. Similarly, DLS dopamine might evaluate the actual quality of execution of each syllable (akin to performance prediction errors observed in birdsong49,50), although our data suggest that this is not the case. Finally, the brain may be misinterpreting random fluctuations in DLS dopamine as a reward-like signal, thus inadvertently structuring behaviour; such stochastic fluctuations in cortical dopamine have recently been shown to support RL51. However, spontaneous behaviour in the open field evolves during our experiments with characteristic dynamics, arguing against this possibility. Future work will be required to arbitrate among these models.

We note that the midbrain dopamine system targets many brain areas with a diverse array of functions, and our experiments were deliberately designed to focus on the influence of dopamine on the DLS14. Given differences in their anatomical inputs and intrinsic timescales at which dopamine fluctuates52, it is likely that the relationship that we observe between DLS dopamine and syllable usage does not apply to the DMS or the ventral striatum, which vary in their functional roles in movement, motivation and value assignment24,53,54. Indeed, it is possible that phasic dopamine fluctuations in these areas are responsible for initiating new movements or controlling their kinematics; alternatively, it is possible that tonic dopamine is broadly permissive but not causal for movements per se, as is suggested by the ability of l-DOPA therapy to revert movement deficits in humans with Parkinson’s disease55. The causal relationship between dopamine and movement may also depend on the specific task in which an animal is engaged and the extent of its training; the observation that during goal-oriented tasks many individual dopamine neurons adopt task-related tuning to movements is consistent with this idea11,16,19. Note also that although both the endogenous dopamine and calibrated Opto-DA experiments argue that DLS dopamine is unrelated to movement initiation or kinematics (similar to observations in refs. 9,10,11,14,19), it is possible that our bulk measurements and manipulations obscure a subtle role for specific SNc dopaminergic axons in controlling instantaneous movement.

Despite training, animals often fail to learn to perform structured tasks, with some animals, tasks or behaviours being more resistant to learning than others; understanding brain–behaviour relationships under these circumstances often requires considering only those animals whose behavioural responses exceeds some threshold for accuracy after training56. Although our experiments contain no overt goal or task structure, we also observe variability in Opto-DA learning, both across individual mice and across syllables. A substantial part of this variability can be explained by dopamine itself: mice that are more sensitive to endogenous dopamine fluctuations are better able to associate Opto-DA with targeted syllables, and syllables more effectively associated with Opto-DA are also associated with higher average endogenous dopamine transients. Our results also suggest that some mice are more sensitive to dopamine than others, despite being genetically identical and housed in similar conditions; further work is required to characterize the source of this inter-mouse variation.

Previous work has demonstrated that DLS SPNs encode information about the current syllable and the sequence context in which that syllable is expressed27,28,29. This information is probably inherited from thalamic and cortical inputs to DLS57,58. The ability of mice to upregulate syllables in response to DLS dopamine axon stimulation demonstrates that mice can recognize and reinforce the movements of their own body in the absence of external sensory cues (such as the click commonly used as a sensory trigger during clicker training); this self-recognition is remarkably specific, as dopamine stimulation does not substantially reinforce syllables that are kinematically related to the target, or syllables that are adjacent in time. Our findings naturally lead to a hypothesis in which dopamine encourages the use of associated syllables by inducing short-term plasticity in sensorimotor inputs to DLS59. Notably, our dopamine stimulation coincides with only a fraction of each targeted syllable, yet syllables remain temporally intact and are coherently reinforced as a whole; this observation supports the speculation that syllables are natural units of spontaneous behaviour used by the brain to structure action.

Methods

A list of reagents and resources is provided in Extended Data Table 1.

Ethical compliance

All experimental procedures were approved by the Harvard Medical School Institutional Animal Care and Use Committee (Protocol Number 04930) and were performed in compliance with the ethical regulations of Harvard University as well as the Guide for Animal Care and Use of Laboratory Animals.

MoSeq

Overview

MoSeq (described previously in refs. 4,27,60) is an unsupervised machine learning method that identifies brief, re-used behavioural motifs that mice perform spontaneously. MoSeq takes as its input 3D imaging data of mice and returns a set of behavioural ‘syllables’ that characterizes the expressed behaviour of those mice, and the statistics that govern the order in which those syllables were expressed in the experiment. MoSeq was used as it is originally described to explore relationships between endogenous DLS dopamine release and behaviour. This technology was further adapted to accommodate real-time syllable identification for closed-loop manipulations of neural activity as described below. Importantly, the underlying fitted autoregressive hidden Markov model (AR-HMM) for both the ‘offline’ and ‘online’ variants of MoSeq used in this study are the same, enabling comparisons of neural activity associated with syllables that were recognized and performed across multiple experiments.

Pre-processing

MoSeq consists of two essential workflows: one for pre-processing depth data and converting it into a low-dimensional time series that describes pose dynamics, and another for modelling the low-dimensional time-series data. As previously described, in order to focus on pose dynamics, raw depth frames were first background-subtracted to convert depth units from distance to height from the floor (in millimeters). Next, the location of the mouse was identified by finding the centroid of the contour with the largest area using the OpenCV findcontours function. An 80 × 80 pixel bounding box was drawn around the identified centroid, and the orientation was estimated using an ellipse fit (with a previously described correction for ±180-degree ambiguities4,27). The mouse was rotated in the bounding box to face the right side. The 80 × 80 pixel depth video of the centred, oriented mouse was then used to estimate pose dynamics.

Size-normalizing deep network

To accommodate noise in online syllable estimation and other sources of variation in the depth images not due to changes in pose dynamics (for example, occluding objects such as fibre-optic cables), we designed a denoising convolution autoencoder. The network was designed using TensorFlow to process images in <33 ms, the time between frame captures on the Microsoft Kinect V261. On the encoder side, 4 layers of 2D convolutions (ReLu activation) followed by max pooling were used to downsample the 80 × 80 images to 5 × 5. Another 4 layers of 2D convolutions with successive upsampling layers were used on the decoding side to reconstruct the 80 × 80 images (10,310,041 total parameters). Batch normalization was used during training with a batch size of 128. In order to train the network, we used a size- and age-matched dataset (7–8 weeks of age). Mouse images were corrupted through rotation, position jitter, zooming in and out (that is, changing size), and superimposing depth images of fibre-optic cables. The network was fed corrupted mouse images as input and was trained to minimize the reconstruction loss of the original, corresponding uncorrupted mouse images (Extended Data Fig. 8a–c). The model was trained for 100 epochs using stochastic gradient descent with early stopping. Both online and offline variants of MoSeq included the size-normalizing network to ensure results were comparable.

Dimensionality reduction and AR-HMM training

In order to represent pose dynamics in a common space for all experiments, principal components and an AR-HMM time-series model were trained offline on a sample dataset of genotype- and age-matched mice. The parameters describing the principal components and AR-HMM model were saved. All depth videos acquired for this paper were then projected onto these same principal components for all experiments, whether they used the online or offline variant. As previously described, principal components were estimated from cropped, oriented depth videos, and the AR-HMM was trained on the top 10 principal components. Since the denoising autoencoder was used for all experiments, mouse videos from the size-and-age-matched dataset were fed through the denoising autoencoder prior to principal component estimation.

Offline variant

In the offline variant, the Viterbi algorithm was used to estimate the most probable discrete latent state sequence according to the trained AR-HMM for each experiment post hoc. This variant was used to analyse all data except for the Opto-DA experiments shown in Figs. 3 and  4.

Online variant

In the online variant, syllable likelihoods were computed and updated by computing the forward probabilities of the discrete latent states for each frame as they arrived from the depth sensor. To avoid spurious syllable detections, the targeted syllable probability had to cross a user-defined threshold for three consecutive frames.

Histological verification

Mice were euthanized following completion of behavioural tests. Mice were first perfused with cold 1× PBS and subsequently with 4% paraformaldehyde. Fifty-micrometre sections of extracted brain tissue were sliced on a Leica VT1000 vibratome. All slices were mounted on glass slides using Vectashield with DAPI (Vector Laboratories) and imaged with an Olympus VS120 Virtual Slide Microscope.

dLight validation and variant selection

dLight1.1 was selected to visualize dopamine release dynamics in the DLS owing to its rapid rise and decay times, comparatively lower dopamine affinity (so as to not saturate binding), as well as its responsiveness over much of the physiological range of known DA concentrations in freely moving rodents31,62,63,64.

Since dopamine-free and dopamine-bound excitation spectra have yet to be reported for the dLight1.1 sensor, a series of in vitro experiments was performed to identify an excitation wavelength whose fluorescence was stable and independent of dopamine levels, and which therefore could be used for post hoc motion artefact correction. Like GCaMP, dLight1.1 uses cpGFP as a chromophore, and various generations of GCaMP have been shown to: (1) have an increase in ligand-free fluorescence when excited with 400 nm wavelengths and (2) have an isosbestic wavelength in the UV to blue region65,66,67. To test whether UV excitation could be a suitable reference wavelength for dLight1.1, HEK 293 cells (ATCC, cells were validated by ATCC via short tandem repeat analysis and were not tested for mycoplasma) were transfected with the dLight1.1 plasmid (Addgene 111067-AAV5) using Mirus TransIT-LT1 (MIR 2304). Cells were imaged using an Olympus BX51W I upright microscope and a LUMPlanFl/IR 60×/0.90W objective. Excitation light was delivered by an AURA light engine (Lumencor) at 400 and 480 nm with 50 ms exposure time. Emission light was split with an FF395/495/610-Di01 dichroic mirror and bandpass filtered with an FF01-425/527/685 filter (all filter optics from Semrock). Images were collected with a CCD camera (IMAGO-QE, Thermo Fisher Scientific), at a rate of one frame every two seconds, alternating the excitation wavelengths in each frame. Image acquisition and analysis were performed using custom-built software written in MATLAB68 (Mathworks). Cells were segmented from maximum-projection fluorescence images using Cellpose69. Cells with a diameter of less than 30 pixels were excluded from downstream analysis. Fluorescence traces were denoised using a hampel filter (window size 10 and threshold set to 2 median absolute deviations from the median) and normalized to ΔF/F0. Cells were included if their maximum ΔF/F0 exceeded 5%. F0 was computed by fitting a bi-exponential function to the time series.

Stereotaxic surgery for open field photometric recordings

Eight- to ten-week-old C57BL/6J (n = 6 mice, The Jackson Laboratory stock no. 000664) mice of either sex were anaesthetized using 1–2% isofluorane in oxygen, at a flow rate of 1 l min−1 for the duration of the procedure. AAV5.CAG.dLight1.1 (Addgene #111067, titre: 4.85 × 1012) was injected at a 1:2 dilution (either sterile PBS or sterile Ringer’s solution) into the DLS (AP 0.260; ML 2.550; DV −2.40), in a total volume of 400 nl per injection. For all stereotaxic implants, AP and ML were zeroed relative to bregma, DV was zeroed relative to the pial surface, and coordinates are in units of mm. Injections were performed by a Nanoject II or a Nanoject III (Drummond) at a rate of 10 nl per 10 s, unilaterally in each mouse. A single 200-µm diameter, 0.37–0.57 NA fibre cannula was implanted 200 µm above the injection site at the DLS (DV −2.20) for photometry data collection. Finally, medical-grade titanium headbars (South Shore Manufacturing) were secured to the skull with cyanoacrylate glue (Loctite 454).

Mice were group-housed prior to stereotaxic surgery procedures, and following surgery were individually housed on a 12-hour dark–light cycle (09:00–21:00). All behavioural recordings were done between 010:00 and 17:00.

Stereotaxic surgery for simultaneous photometric recordings and optogenetic stimulation

Six- to 12-week old DAT-IRES-cre mice (n = 10 mice, The Jackson Laboratory stock no. 006660) of either sex were injected with the same dLight1.1 virus described above into the right hemisphere DLS. Additionally, using the same previously described surgical procedure, 350 nl of AAV1.Syn.Flex.ChrimsonR.tdTomato (UNC Vector Core, titre: 4.1 × 1012) was injected into the right hemisphere SNc (AP −3.160; ML 1.400; DV −4.200 from pia), in a 1:2 dilution for calibration and stimulation experiments (see below). Mice were implanted unilaterally with a 200 µm core 0.37–0.57 NA fibre over the DLS for simultaneous stimulation and photometric data collection.

Two of the ten mice were used to calibrate optogenetic stimulation (see ‘dLight calibration experiments’). The other 8 mice injected with dLight and ChrimsonR were also run through the 3 complete closed-loop experiments described in ‘Closed-loop DLS dopamine stimulation experiments’ (one experiment with 250 ms continuous wave (CW) stimulation, one with 2 s CW stimulation, and another with 3 pulsed stimulation, 25 Hz frequency with 5 ms pulse width). Baseline data from these experiments were combined with mice described in ‘Fibre Photometry for dLight recordings’, thus yielding a total of n = 14 mice. Two of the 12 dLight only mice did not pass our quality control criteria for dLight recordings and were thus excluded from all dLight analysis (note that they were included in Extended Data Fig. 2a–b,d only, which strictly used behavioural data). Baseline data were considered data from the day prior to a stimulation day, or the day after with the targeted syllable excluded (yielding n = 378 experiments total). If the targeted syllable could not be reasonably excluded then data from the day after a stimulation day was excluded entirely.

dLight behaviour procedures

OFA experiments

Depth videos of mouse behaviour were acquired at 30 Hz using a Kinect 2 for Windows (Microsoft) using a custom user interface written in Python (similar to ref. 60) on a Linux computer. For all OFA experiments, except where noted, mice were placed in a circular open field (US Plastics 14317) in the dark for 30 min per experiment, for 2 experiments per day. As described previously, the open field was sanded and painted black with spray paint (Acryli-Quik Ultra Flat Black; 132496) to eliminate reflective artefacts in the depth video.

Food reward experiments

To assess whether spontaneous dLight transients in the DLS were of appreciable magnitude compared to reward consumption-related transients, a series of separate dLight photometry experiments were run to measure reward consumption-related transient magnitudes (n = 6 mice). For two days prior to the experiment, mice were habituated to the open field arena for two 30-min experiments on each day. On the morning of the experiment, to increase the salience of food reward, mice were habituated to the experimental room and food and water restricted for 3–5 h prior to beginning the experiment. Mice were placed in the arena, and behaviour and photometry data were simultaneously acquired. Chocolate chips (Nestle Toll House Milk Chocolate) were divided into quarters and introduced into the arena at random intervals and locations decided by the experimenter (with an average of 1 chocolate chip piece every 4 min) for mice to freely consume for a total of 30 min. To identify reward consumption-related responses, a human observer indicated each moment in time during the experiment where mice began to consume the chocolate via post hoc inspection of the infrared video captured by the Kinect. Photometry signal peaks for Fig. 2a were identified at the onset of consumption. Mean spontaneous transient peak had observed magnitudes of 2.12 ± 0.80 ΔF/F0 (z) (n = 5,247 transients). By comparison, mean reward consumption-associated transients had an approximate magnitude of 2.36 ± 0.92 ΔF/F0 (z) (n = 10 transients).

Fibre photometry for dLight recordings

Photometry and behavioural data were collected simultaneously. A digital lock-in amplifier was implemented using a TDT RX8 digital signal processor as previously described27. A 470 nm (blue) LED and a 405nM (UV) LED (Mightex) were sinusoidally modulated at 161 Hz and 381 Hz, respectively (these frequencies were chosen to avoid harmonic cross-talk). Modulated excitation light was passed through a three-colour fluorescence mini-cube (Doric Lenses FMC7_E1(400-410)_F1(420-450)_E2(460-490)_F2(500-540)_E3(550-575)_F3(600-680)_S), then through a pigtailed rotary joint (Doric Lenses B300-0089, FRJ_1x1_PT_200/220/LWMJ-0.37_1.0m_FCM_0.08m_FCM) and finally into a low-autofluorescence fibre-optic patch cord (Doric Lenses MFP_200/230/900-0.37_0.75m_FCM-MF1.25_LAF or MFP_200/230/900-0.57_0.75m_FCM-MF1.25_LAF) connected to the optical implant in the freely moving mouse. Emission light was collected through the same patch cord, then passed back through the mini-cube. Light on the F2 port was bandpass filtered for green emission (500–540 nm) and sent to a silicon photomultiplier with an integrated transimpedance amplifier (SensL MiniSM-30035-X08). Voltages from the SensL unit were collected through the TDT Active X interface using 24-bit analogue-to-digital convertors at >6 kHz, and voltage signals driving the UV and blue LEDs were also stored for offline analysis.

The output of the PMT was then demodulated into the components generated by the blue and UV LEDs. The voltage signal was multiplied by the two driving signals—corresponding to the green emission due separately to blue and UV LED excitation—and low-passed using a third order elliptic filter (max ripple: 0.1; stop attenuation: 40 dB; corner frequency: 8 Hz). The UV component was used a reference signal.

Synchronizing depth video and photometry

To align photometry and behavioural data, a custom IR led-based synchronization system was implemented. Two sets of 3 IR (850 nm) LEDs (Mouser part # 720-SFH4550) were attached to the walls of the recording bucket and directed towards the Kinect depth sensor. The signal used to power the LEDs was digitally copied to the TDT. An Arduino was used to generate a sequence of pulses for each LED set. One LED set transitioned between on and off states every 2 s while the other transitioned into an on state randomly every 2–5 s and remained in the on state for 1 s. The sequences of on and off states of each LED set were detected in the photometry data acquired with the TDT and IR videos captured by the Kinect. The timestamps of the sequences were aligned across each recording modality and photometry recordings were down sampled to 30 Hz to match the depth video sampling rate. This same mechanism was used to align photometry data to keypoints in Extended Data Fig. 3.

Photometry pre-processing

Demodulated photometry traces were normalized by first computing ΔF/F0. F0 was estimated by calculating the 10th percentile of the photometry amplitude using a 5-s sliding window to account for slow, correlated fluorescence changes between dLight and the UV reference channels. Both the dLight and reference channels were normalized using this procedure. Since the UV reference signal captures non-ligand-associated fluctuations in fluorescence (deriving from hemodynamics, pH changes, autofluorescence, motion artefact, mechanical shifts, and so on), a fit reference signal was subtracted from the dLight channel (see ‘Photometry active referencing’). Finally, referenced dLight traces were z-scored using a 20-s sliding window with a single sample step size slid over the entire experiment to remove slow trends in ΔF/F0 amplitudes due to long timescale effects—for example, photobleaching. Only experiments where the maximum percentage ΔF/F0 exceeded 1.5 and the dLight to reference correlation was below 0.6 were included for further analysis.

Photometry active referencing

In order to remove the effects of motion and mechanical artefacts from downstream analysis, a fit reference signal was subtracted from the demodulated dLight photometry trace as initially mentioned in ‘Photometry pre-processing’31,54 (Extended Data Fig. 1g). First, the reference signal was low-pass filtered with a second-order Butterworth filter with a 3 Hz corner frequency. Next, to account for differences in gain or DC offset, RANSAC ordinary least squares regression was used to find the slope and bias with which to transform the reference signal to minimize the difference between the reference and the dLight photometry traces. Finally, the transformed reference trace was subtracted from the dLight trace.

Capturing 3D keypoints

To capture 3D keypoints, mice were recorded in a multi-camera open field arena with transparent floor and walls. Near-infrared video recordings at 30 Hz were obtained from six cameras (Microsoft Azure Kinect; cameras were placed above, below and at four cardinal directions). Separate deep neural networks with an HRNet architecture were trained to detect keypoints in each view (top, bottom and side) using ~1,000 hand-labelled frames70. Frame-labelling was crowdsourced through a commercial service (Scale AI), and included the tail tip, tail base, three points along the spine, the ankle and toe of each hind limb, the forepaws, ears, nose and implant. After detection of 2D keypoints from each camera, 3D keypoint coordinates were triangulated and then refined using GIMBAL—a model-based approach that leverages anatomical constraints and motion continuity71. GIMBAL requires learning an anatomical model and then applying the model to multi-camera behaviour recordings. For model fitting, we followed the approach described in ref. 71, using 50 pose states and excluding outlier poses using the EllipticEnvelope method from sklearn. For applying GIMBAL to behaviour recordings, we again followed71, setting the parameters obs_outlier_variance, obs_inlier_variance, and pos_dt_variance to 1e6, 10 and 10, respectively for all keypoints.

Computing 2D and 3D velocity

To compute 2D translational velocity, the centroid of the keypoints associated with the spine (approximating whole-body movement) was computed for the x and y planes (the z plane was disregarded). Then, the velocity was computed from the difference in position between every 2 frames and divided by 2 (to provide a smoother estimate of velocity). 3D translational velocity was computed the same way, except the z plane was included in the calculation. The average velocity of the keypoints associated with the forepaws were used to compute 3D forelimb velocity.

Partialing kinematic parameters from dLight

To compute the relationship between dLight and forelimb velocity, other kinematic parameters known to be correlated with dLight were partialed out of the dLight fluorescence signal. Specifically, 2D velocity, 3D velocity and height were partialed out of dLight using linear regression. Then, the correlation between the partialed dLight signal and 3D forelimb velocity were computed and compared to 1,000 bootstrapped shuffles.

Movement initiation analyses

A changepoint detection algorithm was used to find moments where mice transitioned from periods of relative stillness to movement. To capture long bouts of movement, the velocity of the 2D centroid of the mouse was z-scored across each experiment and then smoothed with a 50-point (1.67s) boxcar window. To find sharp changes in velocity, the derivative of smoothed velocity trace was computed, and the result was raised to the third power. Peaks in this velocity changepoint score were discovered using SciPy’s findpeaks function with the following parameters: height 1, width 1, prominence 1 so that consecutive data points around each peak were disregarded.

dLight time warping

To account for variability in syllable duration, dLight traces were time warped for Extended Data Fig. 4a. Here, all dLight traces were linearly interpolated using the numpy.interp function to a duration of 0.83 s, or 25 samples. Thus, syllables longer than 0.83 s were linearly compressed, and syllables shorter than 0.83 s were linearly expanded. We obtained similar results time warping dLight traces to 0.4 s; thus, the duration of time warped instances did not affect interpretation of subsequent analyses.

dLight average waveform z-scoring

For dLight waveforms shown in Fig. 1f, top and bottom,  h,i,k and Extended Data Figs. 4c–g,  5c,f and 7c, first onset-aligned waveforms were z-scored using the mean and s.d. of fluorescence values from 10 s prior to 10 s after onset. Next, to account for differences in the number of syllable instances (trials) in each average, waveforms were additionally normalized by z-scoring relative to the mean and s.d. of 1,000 shuffle averages, where individual trials were circularly permuted prior to averaging.

Decoding syllable identity from dLight waveforms

To decode syllable identity from dLight waveforms or dLight peaks, a random forest classifier72 (cuRF = 1,000 trees, max depth = 1,000, number of bins = 128, with cross-validation on 5 folds of data) was trained to predict syllable and syllable group identity on held-out data (similar to ref. 27). Syllable groups were created by hierarchically clustering syllables based on their pairwise MoSeq distance (see below) and thresholds were increased with a distance cut-off in steps of 0.2. The inputs to the random forest classifier were either: (1) the maximum z-scored dLight value from syllable onset to 300 ms after syllable onset for each syllable instance or (2) dLight waveforms and their derivatives starting at syllable onset up to 300 ms into the future for individual syllable instances. Held-out accuracy was compared to 100 shuffles of syllable identity.

Decoding turning orientation from dLight waveforms

To decode turning orientation from dLight waveforms (Extended Data Fig. 5c), a linear support vector machine was trained to classify whether a particular syllable instance is a left- or rightward turning syllable using cross-validation on five folds of data. To sample the behaviour space of turning syllables, eight syllables with the largest angular velocities were chosen, four for each turning orientation. The model was fit to dLight waveforms and their derivatives starting at syllable onset up to 300 ms after onset for individual syllable instances and was tested on held-out data.

MoSeq distance

The MoSeq distance between two syllables was computed as previously described27. In brief, the estimated autoregressive matrices for each syllable were used to generate synthetic trajectories through principal component space (that is, in the space defined by the first ten principal components of the depth video). Then, the correlation distance between trajectories for all pairs of syllables were computed. Since the online and offline variants of MoSeq used the same autoregressive matrices, these distances are equivalent in the online and offline variants.

Analysing the relationship between dLight and syllable statistics within an experiment across syllables

The dLight fluorescence associated with syllable transitions was computed as the maximum z-scored dLight value from syllable onset to 300 ms after syllable onset for each syllable transition, to account for jitter in dopamine release or technical jitter in defining syllable changepoints. Throughout the text, we refer to syllable-associated waveform peak amplitudes in z-scored ΔF/F0 units as ‘syllable-associated dLight’. These dLight values were then averaged for each syllable and for each experiment. To assess the correlation between syllable-associated dLight and syllable counts, the dLight averages were z-scored across syllables in each experiment. These normalized dLight peaks represented whether a syllable had relatively higher or lower dLight during a given experiment. Finally, experiment-normalized dLight values along with syllable counts were then averaged across experiments for each mouse, thus leaving a value for each mouse and each syllable.

In order to measure the linear relationship between dLight peak values and syllable counts, a robust linear regression using the Huber regressor73 predicted average syllable counts from average dLight peaks. The regression model was evaluated using a fivefold cross-validation repeated 100 times. Reported correlation values in Figs. 1j and  2 were estimated over the held-out data. P-values were estimated by comparing held-out correlation values to those estimated from a linear model computed on shuffled data. To remove syllables that varied due to finite size effects, only syllables that occurred at least 100 times total across all experiments per mouse were included.

To compute syllable entropy (estimating the randomness of outgoing transitions associated with each syllable), the outgoing transition probabilities associated with each syllable for each mouse were computed by counting the number of occurrences a syllable transitions to all others within an experiment and expressing this as a probability distribution. Next, the Shannon entropy was estimated over the outgoing transition probabilities for each syllable. Finally, the linear regression was estimated using the exact same procedure used for syllable counts.

Analysing the relationship between dLight and syllable statistics across experiments for each syllable

This series of analyses queried a total of 379 experiments. To capture the correlation between syllable-associated dLight peaks and syllable-associated behavioural features (syllable frequency, syllable entropy) within syllables but across experiments, first, the maximum z-scored dLight amplitude from onset to 300 ms after syllable onset at each syllable transition was computed. These syllable-associated dLight peaks were averaged for each experiment and syllable. Then, the dLight peak averages for each syllable and mouse were z-scored separately across experiments. Additionally, to put variation of each syllable across experiments on the same scale, syllable frequency, and syllable entropy were also z-scored for each syllable and mouse across experiments (Fig. 2b,i, bottom). Next, to remove variability in the calculation, values were pooled across syllables for each experiment, thus leaving a value for each experiment and mouse. To remove syllables that varied due to finite size effects, first only syllables that occurred at least 50 times per session on average were considered for downstream analysis. Linear models (Huber regressors) were fit to the resulting average dLight peaks, syllable frequency, and syllable entropy and evaluated as described in the previous section.

Analysing the moment-to-moment relationship between dLight and syllable statistics within an experiment

This series of analyses queried a total of 760 syllable–experiment pairs. dLight peak values were estimated by taking the maximum dLight value from onset to 300 ms after onset at each syllable transition. Velocity, syllable counts, and dLight peak values were averaged per syllable and per mouse over an expanding bin size; that is, velocity, syllable counts, and dLight peak values were estimated over the subsequent n syllables after the transition were dLight value was calculated, where n varied from 5 syllables up to 400 (Fig. 2e). For sequence randomness, to avoid finite size effects, dLight values were binned into 20 equally spaced bins per syllable (Fig. 2k). Then, transition matrices were combined within each bin across all syllables per mouse and per time bin. Finally, Pearson correlation values were then calculated between dLight values and the behavioural features estimated at each bin size. Pearson coefficients were z-scored using the mean and s.d. from Pearson coefficients estimated after shuffling dLight peak values.

Note that, in order to prevent the measurement from being influence by consistent non-stationarities in behaviour, these correlations were computed within each of the five time segments shown by dashed lines in Extended Data Fig. 2e. Then, per-segment correlations were averaged.

Time-constants associated with the correlation between dLight values and behavioural features over increasing bin sizes were estimated by fitting an exponential decay curve to the correlation values at each bin size using the SciPy’s curvefit function74. Decay functions were fit over 1,000 bootstrap resamples of the data; the depicted distributions are taus fit over each resample.

Analysing the cross-correlation between syllable-associated dLight and syllable usage

The dLight fluorescence associated with all instances of a given syllable was binned across a three-minute window (chosen based upon the decay in Fig. 2f) and correlated with the use of that same syllable across a 3-min window, with the windows shifted the indicated amounts (x-axis). Correlation values (in Fig. 2g,h) were z-scored using the mean and s.d. from shuffles. P-values were estimated via shuffle test.

Analysing the relationship between syllable-associated dLight and syllable classes

Syllables were manually classified into 6 classes by hand-labelling crowd videos summarizing model output4,27,60. Then, syllable-associated dLight was averaged for all syllables within each class.

Encoding model predicting average dLight from behaviour

As with the linear regression analysis (previous section), dLight peaks were estimated by taking the maximum z-scored dLight amplitude from syllable onset to 300 ms after onset. Behavioural features (entropy, velocity, and syllable counts) after each transition were computed across various bin sizes as described in ‘Analysing the relationship between dLight and syllable statistics within an experiment’. The bin sizes used were 5, 10, 25, 50, 100, 200 300, 400, 800 and 1,600 syllables. Syllable frequency, syllable entropy and velocity were averaged for each experiment and syllable in each bin size. These syllable and experiment-wide average values were then z-scored separately for each mouse and then averaged for each mouse and each syllable. In order to remove correlations between behavioural features they were whitened using zero-phase component analysis (ZCA) whitening. Whitened behavioural features were then fed to a Bayesian linear regression model to predict average dLight peak amplitudes per syllable and per mouse according to the following equation:

$$p(\,y|X,\beta ,{\sigma }^{2})=N({\beta }^{T}X,{\sigma }^{2})$$

where X is defined as features, β is regression coefficients, y is dLight peak values, σ is the s.d., and N is the normal distribution. A normal prior was placed on the regression coefficients, and an exponential prior was placed on the s.d. Samples from the posterior were drawn via the no u-turn sampler (NUTS) using NumPyro (n = 1,000 warmup samples, then n = 3,000 samples)75. To assess the temporal relationship between behavioural features and dLight, a separate model was fit at each lag (here, features were whitened separately within each lag, Extended Data Fig. 6c). Overall model performance was quantified by feeding in features at their approximate best bin size to the model. For kinematic parameters and for entropy, this bin size (lag) was 10 timesteps; for syllable counts, this bin size (lag) was 100 timesteps (in syllable time). Then, each feature was fed in separately to quantify the performance of feature subsets.

Encoding model predicting instantaneous dLight from behaviour

In order to predict instantaneous dLight amplitudes from syllable counts, syllable entropy, velocity (2D, angular and height velocity), and acceleration, a series of convolution kernels were estimated, each of which map from each behavioural feature to dLight amplitude. Mathematically, the model can be written as follows:

$${\rm{dLight}}\left(t\right)=\sum _{f\in F}\mathop{\sum }\limits_{t{\prime} =-2s}^{2s}{{\rm{\beta }}}_{f}\left(t-{t}^{{\prime} }\right)f\left(t\right)$$

where dLight (t) corresponds to the dLight trace at time step t, f(t) is the behavioural feature at time step t, and β is the weight of the convolution kernel. Kernel weights were optimized using a Huber loss via the Jax library76. That is to say, the dLight amplitude at each time sample is predicted by convolving each behavioural feature (frequency, entropy, velocity, and acceleration) with a convolution kernel and then summing the result across features. The model was trained and evaluated using twofold cross-validation by recording experiment, and the Pearson correlation between predicted dLight amplitudes and actual amplitudes was assessed on held-out experiments. In order to remove the effects of high frequency noise on training and evaluation, the dLight traces were smoothed using a 60-sample (2-s) boxcar filter prior to training and evaluation.

Decoding model predicting behaviour from dLight

The decoding model was designed to capture the two main effects of dopamine on behavioural statistics—usage and sequencing. The goal of the decoding model is to predict the likelihood of a sequence of syllables given past dopamine. The model comprises two key features: (1) a component that scales syllable usage by past syllable-associated dopamine, and (2) a component that scales randomness of the next syllable choice by past global dopamine. This can be summed up with the following equation:

$$P({s}_{t}=i)\propto \exp \left(\frac{{\alpha }_{a}\mathop{\sum }\limits_{n=1}^{250}\left({\rm{d}}{a}_{t-n}\exp \left(\frac{-n}{{\tau }_{a}}\right)\delta \left({s}_{t-n}=i\right)\right)}{{\alpha }_{b}\mathop{\sum }\limits_{n=1}^{250}\left({\rm{d}}{a}_{t-n}\exp \left(\frac{-n}{{\tau }_{b}}\right)\right)}\right)$$

where st is the syllable a mouse performs at time t during a behaviour experiment, dat is the peak dLight recorded for syllable st, τa and τb describe the timescale of the usage and choice randomness component respectively, αa and αb scale the usage and choice randomness components respectively, and δ is the Dirac delta function (that is, one-hot encoding) that returns 1 when st − 1 = i and 0 otherwise.

The parameters αb, τa and τb were fixed using approximations of analysis of the behavioural data (Fig. 2), and only αa was learned by maximizing the likelihood of the function above given the sequence of syllables mice perform across a group of experiments and peak dLight measurements associated with the syllable sequence z-scored across each experiment. This was done via evaluating the likelihood of the function over multiple values of αa. τa (describing the effect of dopamine on future syllable usage/counts) was fixed at 100 syllable timesteps, and τb (describing the effect of dopamine on syllable sequence entropy) was fixed at 10 syllable timesteps. These values were approximated from the median τ values reported in Fig. 2.

To test model performance, data were split into 5 folds of training and test experiments and repeated 100 times using repeated K-fold cross-validation. We then computed the Pearson correlation between syllable counts from model simulations and actual syllable counts after smoothing with a 50-point rolling average. The one free parameter was fit using the training dataset and assessed on the test dataset. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used. The model was compared to a suite of control models, each evaluated over the same folds. The dopamine phase shift model was evaluated on the same data, but with all dopamine traces circularly shifted by a random integer between 1 and 1,000, and the noise model was evaluated with dopamine traces replaced by numbers drawn from a unit variance random normal distribution (since the traces were z-scored). In order to determine the maximum possible performance, the per experiment number of counts per syllable was correlated with the across-experiment average. Here, the model performed significantly better than controls. Median Pearson correlation between held-out predictions and observed data: actual model r = 0.20, phase shift control r = 0.04, noise model r = 0.04. Comparison between actual model and controls, P = 7 × 10−18, U = 2,500, f = 1, Mann–Whitney U test, n = 50 model restarts.

To test the hypothesis that endogenous and exogenous dopamine linearly combine to alter the future usage of single syllables of behaviour, the present decoding model was modified. Maximal correlations were identified between predicted and observed syllable usages when adding (or subtracting) extra dopamine (termed ‘extra DA’) to the syllable-associated dopamine amplitudes observed on catch trials (Fig. 4g–i). Model-based log likelihoods of held-out syllable choices from Opto-DA stimulation day experiments were then computed. Other versions of this model (shown in Fig. 4h) included: (1) a control model in which no ‘extra DA’ is added to the model (‘no offset’), (2) a control that uses a phase-shifted version of the dLight trace (‘random shift’), and (3) a model that uses random numbers from a normal distribution with mean and variance matched to the dLight signal (‘noise’).

dLight calibration experiments

In order to characterize the speed and magnitude of evoked dopamine transients in the open field, dLight transients were elicited using brief optogenetic stimulation of SNc axons in the DLS expressing ChrimsonR while mice freely explored an open field arena77. A number of stimulation parameters were tested, using varying light intensity, stimulation length, and whether the stimulus was delivered in as a single continuous-wave pulse or delivered as multiple rapid short pulses. A single, short (250 ms; roughly the timescale of syllables), continuous stimulation pulse of red light at 10 mW (Opto Engine MRL-III-635; SKU: RD-635-00500-CWM-SD-03-LED-0) most effectively matched the amplitude and dynamics of endogenous dLight transients observed in the open field. The mean Opto-DA peak was measured at 2.18 ± 0.85 ΔF/F0 (z), mean spontaneous peak = 2.23 ± 0.62 ΔF/F0 (z) and 99th percentile spontaneous peak = 3.40 ΔF/F0 (z) Pulsed stimulation was also disfavoured as numerous studies have shown that pulsed stimulation can cause synchrony in neural and axonal networks that can evoke prolonged release78,79,80. Note that when excited with 635 nm light, the efficiency with which light evokes spiking in neurons expressing ChrimsonR is similar to efficiency with which blue light evokes spiking in neurons expressing ChR277.

Once a single 250 ms continuous pulse of 10 mW light was preliminarily chosen as the desired optogenetic stimulus to evoke dopamine release from DLS dopamine axons, another round of open-loop stimulation with these stimulation parameters was performed in the open field in two of the 10 total mice injected with dLight and ChrimsonR. In these two mice, the intervals between stimulation times were drawn by randomly choosing an integer delay between 6 and 17 s for each stimulation. This range was chosen to guarantee each animal received at least 100 stimulations during an experiment. This enabled analysis of more stimulation trials with intended parameters to verify that the amplitude of evoked transients were within the same order of magnitude as spontaneously evoked transients (Fig. 3c).

DMS dLight recordings

As a series of control experiments to establish the specificity of DLS dopamine encodings, dLight recordings were performed in the DMS using the same techniques described above. dLight stereotactic injections in wild-type mice of either sex (C57BL/6J, n = 8) were performed at AP: 0.26, ML: 1.5, and DV: −2.2. Fibres for photometry (in C57BL/6J mice, n = 8, n = 64 recording experiments) were implanted in the manner described above at coordinates: AP: 0.26, ML: 1.5, DV: −2.0. Open field behavioural recordings and encoding models were performed for these data exactly as described above.

Stereotaxic surgery for optogenetics

Eight- to fifteen-week-old DAT-IRES-cre::Ai32 mice resulting from the cross of DAT-IRES-cre mice (The Jackson Laboratory, 006660) and Ai32 mice (The Jackson Lab, 012569) of either sex were used. The double transgenic DAT-IRES-cre::Ai32 mouse line has previously been used to conduct specific dopaminergic neuron activation10,81,82. Similar surgical procedures were used as described above, except two 200 µm 0.37 NA multimode optical fibres were implanted bilaterally over DLS (AP 0.260; ML 2.550; DV −2.300), in DAT-IRES-cre::Ai32 mice (n = 20). Control animals (DAT-IRES-cre mice, n = 12) of either sex were implanted bilaterally at the same coordinates, with 6 of these animals implanted in the nucleus accumbens (AP 1.300; ML 1.000; DV −4.000). These animals are collectively termed ‘no-opsin controls’ throughout the manuscript. Medical-grade titanium headbars were secured to the skull using cyanoacrylate. Optical stimulation experiments were then performed 2–3 weeks post-surgery.

Closed-loop stimulation behavioural paradigm

For two days prior to the closed-loop stimulation schedule (Fig. 3d), mice were habituated to the bucket for two 30-min experiments on each day. To test the change in statistics of specific syllables via syllable-triggered optogenetic stimulation, experiments were performed in a three-day schedule for each of six chosen target syllables. On the first day, two 30-min experiments were run for each mouse to characterize baseline target syllable usage. On the second day, two 30-min ‘stimulation’ experiments were performed for each mouse. During these experiments blue light (470 nm, 10 mW, a single 250-ms continuous-wave pulse) was delivered on 75% of target syllable detections. Stimulation was not conditioned on syllables occurring before the target. Finally, on the third day, baseline experiment recordings were repeated to assess syllable usage memory and usage decay after reinforcement. For half of the targeted syllables for each mouse (randomized across mice), the pre-stimulation baseline experiment is the same experiment as the post-stimulation baseline experiment for a different syllable (see Fig. 3d). A three-day cadence with multiple, short behavioural recording experiments per day was chosen to both minimize non-stationarities in syllable usage within an experiment, as well as to not expose the mice to the behavioural arena for more than one total hour per day. To control for order effects on changes in target syllable usages over time, animals were randomly split into two groups, each of which had a unique ordering of target syllables across the six stimulation days of the three-week cadence. The time interval between the first experiment and the second experiment for the same mouse on each day (either recording or stimulation) was 195 min on average ±58 min (s.d.). Mice were euthanized following completion of behavioural tests, and histology was performed using procedures described above.

To assess the effect of increased dopamine release these experiments were repeated with 3-s pulsed stimulation (25 Hz, 5 ms pulse width) in n = 3 DAT-cre::Ai32 and n = 2 (DAT-IRES-cre) control animals.

Closed-loop velocity modulation experiments

DAT-IRES-cre::Ai32 mice (n = 5) of either sex underwent 90-min recording and manipulation experiments. For the first 30 min, we estimated the distribution of velocities for a specific target syllable. Then, for the next 30 min, optogenetic stimulation was triggered both when the syllable was expressed according to our closed-loop system and when the animal’s syllable-specific velocity exceeded the 75th percentile or went below the 25th percentile. Experiments were analysed only if the mouse received at least 50 stimulations and they increased the usage of the target syllable on average relative to their average baseline (established via separate recording experiments with no stimulation).

Quantifying changes in target syllable counts

First, the number of times the targeted syllable was performed during a 30-s sliding window (non-overlapping) for each 30-min stimulation experiment was computed. Then, a cumulative sum was taken. To turn the result into an estimate of excess target counts, a cumulative sum was also computed from the morning and evening experiments from the most recent previous baseline day. Finally, the average of the morning and evening baseline estimates was averaged and subtracted off.

‘Learner’ mice were defined as mice whose average change in target counts above baseline across all syllables exceed the maximum average change in target counts exhibited by no-opsin control animals. These n = 9 animals were used for subsequent analyses of target kinematics and learning specificity (Extended Data Fig. 10).

Quantifying effects on syllables near the target in time

To assess whether syllables temporally adjacent to the target were reinforced as a result of optogenetic stimulation, syllables were identified that—on average—were near to the target in time. Specifically, the average time between all non-targeted syllables and the target was computed, along with their change in counts above baseline. Then, syllables were binned when they occurred on average relative to the target in syllable units in equally spaced bins from ten syllables before the target to ten syllables after. Finally, for each experiment, a weighted average of the change in counts above baseline for all syllables in each bin was computed, where a syllable’s weight was defined by its relative frequency in an experiment.

Quantifying effects on syllables whose velocity was similar to the target

To understand whether syllables with similar velocity profiles to the target were also reinforced, the average velocity from onset to offset for each syllable was computed and z-scored across instances within an experiment. Then, the average velocity of the target was subtracted from each syllable’s average velocity. Finally, the change in count above baseline for each syllable was binned by its target-velocity-difference.

Quantifying Opto-DA effects on movement parameters and sequence randomness

To quantify the effects of Opto-DA on movement parameters and sequence randomness over short timescales, sequence entropy, velocity (2D, angular and height velocity) and acceleration were estimated in five-syllable-long non-overlapping bins starting from stimulation onset. This window was chosen to minimize noise in downstream calculations while retaining reasonable time-resolution. To compensate for non-stationarities in behaviour across the experiments, mice, and targeted syllables, entropy, velocity and acceleration pre-stimulation-onset were subtracted from their values post-stimulation. Finally, these baseline-subtracted values were z-scored using the mean and s.d. estimated from catch trials.

Analysing the influence of dopamine on optogenetic reinforcement

Mice used to assess the influence of endogenous dopamine fluctuations on optogenetic reinforcement

As described above, eight mice injected with dLight and ChrimsonR were also run through closed-loop reinforcement experiments. The reinforcement experiment run with 250 ms 10mW CW stimulation enabled decoding analysis of how exogenous dopamine release altered usage of syllables during experiments in which ‘extra DA’ was added (Fig. 4g–i).

Predicting the amount of exogenously added dopamine during Opto-DA experiments using the decoding model

To predict the magnitude of exogenously evoked dLight fluorescence using the decoding model, dLight fluorescence on each instance in which the mouse expressed the target syllable and received stimulation was replaced with the average dLight fluorescence observed for the target syllable on catch trials during which there was no optogenetic stimulation. Then an offset (denoted as ‘extra DA’) was added to each syllable instance in which the mouse received stimulation. The likelihood of the syllable sequences expressed during Opto-DA experiments was computed for a range of extra DA offsets (and hence a range of exogenously added dopamine). The model was evaluated using the exact same procedure described in ‘Decoding model predicting behaviour from dLight’, except the repeated K-fold splits (5-fold split repeated 100 times) was performed over stimulation experiments. The ‘extra DA’ outputs of the model were compared to empirical photometric data collected from animals expressing dLight that underwent ChrimsonR-mediated closed-loop reinforcement (Fig. 4i).

Using the influence of endogenous dopamine to predict Opto-DA reinforcement

In order to assess whether the impact of dopamine at baseline could predict Opto-DA reinforcement, we used the correlation between dopamine fluctuations and syllable statistics (usage and entropy) within an experiment. Specifically, we computed the correlation between dLight levels and usage as outlined in ‘Analysing the moment-to-moment relationship between dLight and syllable statistics within an experiment’ (Fig. 2e,k), except correlations were assessed per mouse and per syllable. Values at each bin size were z-scored using the mean and s.d. correlations computed over shuffled data. Here, n = 100 shuffles were used for the correlation with entropy for computational efficiency. To determine the modulation depth of these correlation curves for each mouse and syllable, we used the s.d. of the correlation values across bin sizes. This resulted in a value that reflected the short-term influence of dopamine on usage (Endo-DA count) and entropy (Endo-DA entropy) for all syllable–mouse pairs. Finally, these estimates were averaged per mouse for Fig. 4b,c, and per syllable for Fig. 4d. Then, the log2 fold change in target counts on stimulation days relative to baseline days was used as an estimate of Opto-DA learning. To mitigate mouse-to-mouse variability, the log2 fold change in target counts was normalized by computing the log2 fold change in target counts against all pairs of non-stimulation days per mouse. The mean and s.d. of this distribution was used to z-score Opto-DA learning per mouse.

Bayesian linear regression models were used in Fig. 4b,c. A normal prior was placed on the regression coefficients, and an exponential prior on the variance. Samples from the posterior were drawn via the no u-turn sampler (NUTS) using NumPyro (n = 1,000 warmup samples, n = 2,000 samples)75. Performance was assessed using leave-two-out cross-validation. The linear regression model presented in Fig. 4f utilized a Huber regressor73. Performance of the Huber regressors was assessed using fivefold cross-validation repeated five times.

Applying RL models to open field behaviour

Reinforcement-only RL model

RL models have four key components: a reward signal, a state, a state-dependent set of available actions, and a policy (which governs how actions are chosen). Here, a simple Q-learning agent with a softmax policy was designed to model mouse behaviour in the open field as an RL process over endogenous dopamine levels44. Our model was recast (specifically a Q-learning agent with a softmax policy) to use endogenous dopamine (that is, syllable-associated dLight) as a reward signal, behavioural syllables as states, and transitions between behavioural syllables as actions. Given a syllable at time t + 1, the dLight peak occurring during the syllable at time t is considered the ‘reward’. The Q-table for the model was initialized with a uniform matrix with the diagonal set to 0, since by definition there are no self-transitions in our data. For every step of each simulation, given the currently expressed syllable (that is, the state), the model samples possible future syllables (actions) based on the behavioural policy and the expected dLight transient magnitude (expected reward, specified by the Q-table) associated with each syllable transition. Then, the model selected actions according to the softmax equation

$$p(a|s)=\frac{{e}^{{Q}_{s}(a)/\tau }}{\mathop{\sum }\limits_{b=1}^{n}{e}^{{Q}_{s}(b)/\tau }}$$

where τ is the temperature. The model is fed 30-min experiments of actual data. Data was formatted as a sequence of states and syllable-associated dopamine. Given the current state, the model selects an action according to the softmax equation. To update the Q-table and simulate the effect of endogenous dopamine as reward, the syllable-associated dopamine is presented to the model as reward in a standard Q-learning equation. Specifically, the Q-table was then updated according to

$$Q({s}_{t},{a}_{t})\leftarrow Q({s}_{t},{a}_{t})+\alpha [{r}_{t+1}+\gamma {max}_{a}Q({s}_{t+1},a)-Q({s}_{t},{a}_{t})]$$

where Q is the Q-table that defines the probability of action a while in state s, α is the learning rate, r is the reward associated with action a and state s (the dLight peak value at the transition between syllable a and syllable s), and γ is the discount factor. Performance was assessed by taking the Pearson correlation between the model’s resulting Q-table at the end of the simulation and the empirical transition matrix observed in the experimental data. Here, each row of the empirical transition matrix and the Q-table were separately z-scored prior to computing the Pearson correlation. Note that the learned Q-table is functionally equivalent to a transition matrix in this formulation. To avoid degradation in performance due to syllable sparsity, the top 10 syllables were used.

Dynamic RL model

To account for the short-term effect of dopamine on sequence randomness, a dopamine-dependent term was added to the baseline model’s policy

$$p(a|s)=\frac{{e}^{{Q}_{s}(a)/\tau (t)}}{\mathop{\sum }\limits_{b=1}^{n}{e}^{{Q}_{s}(b)/\tau ({\rm{t}})}}$$

where temperature is now time-dependent and evolves according to,

$$\tau \left(t\right)=I\left(t\right)\exp \left(\frac{t-n}{{{\tau }}_{{\rm{decay}}}}\right)+\,{\tau }_{{\rm{baseline}}}$$

and,

$$I\left(t\right)=v\,\,\text{if}\,\,r\left(t\right)\ge \lambda $$

Here, τdecay corresponds to the time constant with which dopamine’s effect on temperature decays, τbaseline is the baseline temperature, ν is the amount by which temperature is increased if the r(t) goes above the threshold λ, and n is the number of timesteps after the threshold has been crossed. Experiments were split into training and test datasets via twofold cross-validation, and the training set was used to fit all free parameters. To compare the dynamic to the reinforcement-only model, v was set to 0—this turns off the temperature varying component of the dynamic model. Note that we observe qualitatively similar results under an alternative formulation. Rather than feeding the model 30-min sessions of actual data, we allow the model to freely select actions, and reward was randomly drawn from dLight peaks associated with that action in actual data.

Reward-prediction error model variant

Models were fit using observed dopamine magnitude as either the (1) reward term (see above) or (2) reward-prediction error term \([{r}_{t+1}+\gamma {max}_{a}Q({s}_{t+1},a)-Q({s}_{t},{a}_{t})]\). For each model type, a grid search was performed across values of α (learning rate), γ (discount factor, used in the reward model only), and temperature (randomness of the next action). Held-out log likelihood was computed for each fit and z-scored using the mean and variance of the held-out log likelihood from models fit to data shuffled between experiments (n = 10 shuffles). This comparison is only valid for our particular model formulation. There are alternative formulations for which dopamine acting as a reward-prediction error are consistent with our data.

Statistics

All hypothesis tests were non-parametric. Effect sizes for Mann–Whitney U tests are presented as the common language effect size f. Correlations were established as significant by comparing to n = 1,000 shuffled correlations (referred to as the shuffle test throughout the manuscript). For shuffle test if all correlations exceeded the 1,000 shuffles, the P-value is listed as P < 0.001 rather than P = 0. P-values were adjusted to account for multiple comparisons where appropriate using the Holm–Bonferonni stepdown procedure. Sample sizes were not pre-determined but are consistent with sample sizes typically used in the field. For examples using similar techniques see10,14. Blinding was not performed, but MoSeq-based analysis of behaviour was automated.

Plotting

Box plots (here and throughout) obey standard conventions: edges represent the first and third quartiles, whereas whiskers extend to include the furthest data point within 1.5 interquartile ranges of either the first or third quartile.

Software packages

In addition to analysis-specific packages cited in the relevant sections above, the following packages were used for analysis: NumPy83, Python84, Seaborn85, Matplotlib86 and Python 3 (ref. 87).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.