xmdpy is an open source package that provides an xarray interface for molecular dynamics trajectory files to facilitate fast and easy analysis of large trajectories.
- Lazy loading trajectories
- Common analysis tools provided via the
.xmdaccessor on xarray objects - Parallel and memory-sensitive data manipulation with
dask - Powerful slicing/indexing of N-dimensional datasets
| Format | Description |
|---|---|
xyz |
Basic XYZ files; ignores information on comment line and anything beyond the 4th column |
xdatcar |
VASP XDATCAR trajectory files; compatible with both Direct and Cartesian coordinates; not compatible with trajectories with variable cell parameters (e.g., NPT simulations) |
Note
Other common formats to be implemented in the future include: extxyz, netcdf, hdf5, vaspout.h5 (VASP-specific format), lammpstrj, traj (binary ASE format), xtc, dcd
- From source:
git clone https://github.com/ldgibson/xmdpy.git
cd xmdpy
pip install .Warning
This package is in early stages of development and is subject to change without backward compatibility.
Standard XYZ files do not contain information about the cell parameters, so XYZ files that are loaded do not include a cell data variable.
>>> import xarray as xr
>>> import xmdpy
>>> traj = xr.open_dataset("trajectory.xyz", engine="xmdpy")
>>> print(traj)
<xarray.Dataset> Size: 5MB
Dimensions: (time: 1000, atom_id: 200, xyz_dim: 3)
Coordinates:
* time (time) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
* atom_id (atom_id) int64 2kB 0 1 2 3 4 5 6 ... 194 195 196 197 198 199
* atoms (atom_id) <U2 2kB 'Li' 'Li' 'Li' 'Li' ... 'Cl' 'Cl' 'Cl' 'Cl'
* xyz_dim (xyz_dim) <U1 12B 'x' 'y' 'z'
Data variables:
xyz (time, atom_id, xyz_dim) float64 5MB ...Cell parameters can be easily specified via the cell keyword argument, which accepts a scalar or various array shapes and nested list structures:
- Single cell length:
cell=L - Separate lengths of a single cell:
cell=[Lx, Ly, Lz] - Cell vectors for each dimension:
cell=[[Ax, Ay, Az], [Bx, By, Bz], [Cx, Cy, Cz]] - Variable cell vectors per frame:
cell=[
[[Ax, Ay, Az], [Bx, By, Bz], [Cx, Cy, Cz]],
[[Ax, Ay, Az], [Bx, By, Bz], [Cx, Cy, Cz]],
...,
[[Ax, Ay, Az], [Bx, By, Bz], [Cx, Cy, Cz]],
]When specified, xmdpy will attempt to reshape the input to represent cell vectors at every frame. If parameters for only a single cell are detected, the cell vectors will be broadcasted along the time index of the trajectory to minimize the memory footprint.
>>> traj = xr.open_dataset("trajectory.xyz", engine="xmdpy", cell=20.12)
>>> print(traj)
>>> traj
<xarray.Dataset> Size: 5MB
Dimensions: (time: 1000, atom_id: 200, xyz_dim: 3, cell_vector: 3)
Coordinates:
* time (time) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
* atom_id (atom_id) int64 2kB 0 1 2 3 4 5 6 ... 194 195 196 197 198 199
* atoms (atom_id) <U2 2kB 'Li' 'Li' 'Li' 'Li' ... 'Cl' 'Cl' 'Cl' 'Cl'
* xyz_dim (xyz_dim) <U1 12B 'x' 'y' 'z'
* cell_vector (cell_vector) <U1 12B 'A' 'B' 'C'
Data variables:
xyz (time, atom_id, xyz_dim) float64 5MB ...
cell (time, cell_vector, xyz_dim) float64 72kB ...Lazily compute all pairwise distances betwen two atom types.
First, load the trajectory and specify the chunk size along the time index, which loads the trajectory data as Dask arrays.
>>> traj = xr.open_dataset(
"trajectory.xyz",
engine="xmdpy",
cell=20.12,
chunks={"time": 100},
)
>>> print(traj)
<xarray.Dataset> Size: 5MB
Dimensions: (time: 1000, atom_id: 200, xyz_dim: 3, cell_vector: 3)
Coordinates:
* time (time) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
* atom_id (atom_id) int64 2kB 0 1 2 3 4 5 6 ... 194 195 196 197 198 199
* atoms (atom_id) <U2 2kB 'Li' 'Li' 'Li' 'Li' ... 'Cl' 'Cl' 'Cl' 'Cl'
* xyz_dim (xyz_dim) <U1 12B 'x' 'y' 'z'
* cell_vector (cell_vector) <U1 12B 'A' 'B' 'C'
Data variables:
xyz (time, atom_id, xyz_dim) float64 5MB dask.array<chunksize=(100, 200, 3), meta=np.ndarray>
cell (time, cell_vector, xyz_dim) float64 72kB dask.array<chunksize=(100, 3, 3), meta=np.ndarray>To compute the distances lazily, simply use the .xmd.get_distances() accessor method. The result will immediately return an xarray object with the distances stored in a Dask array.
>>> distances = traj.xmd.get_distances('Li', 'Cl')
>>> print(distances)
<xarray.DataArray (time: 1000, atom_id1: 100, atom_id2: 100)> Size: 80MB
dask.array<transpose, shape=(1000, 100, 100), dtype=float64, chunksize=(100, 100, 100), chunktype=numpy.ndarray>
Coordinates:
* time (time) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
* atom_id1 (atom_id1) int64 800B 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99
* atoms1 (atom_id1) <U2 800B 'Li' 'Li' 'Li' 'Li' ... 'Li' 'Li' 'Li' 'Li'
* atom_id2 (atom_id2) int64 800B 100 101 102 103 104 ... 195 196 197 198 199
* atoms2 (atom_id2) <U2 800B 'Cl' 'Cl' 'Cl' 'Cl' ... 'Cl' 'Cl' 'Cl' 'Cl'