November 29

Linux: Ceph Cheatsheet

Summary of some ops-oriented Ceph commands (using Jewel, might or might not work with others)

Monitoring and Health  ·  Working with Pools and OSDs  ·  Working with Placement Groups  ·  Interact with individual daemons  ·  Authentication and Authorization  ·  Object Store Utility  ·  RBD Block Storage  ·  Runtime Configurables

Monitoring and Health

ceph -sStatus summary# ceph -s cluster 01234567-89ab-cdef-ghij-klmnopqrstuv health HEALTH_OK monmap e1: 3 mons at…
ceph -wWatch ongoing status# ceph -w … status ommitted 2016-07-… mon.0 [INF] pgmap v34890186: 1992 pgs: …
ceph –watch-warnWatch ongoing, only WARN messages
ceph dfDisk usage overview, global and per pool# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 8376G 8373G 2508M 0.03 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 2782G 0 glance 1 795M 0.03 2782G 114 cinder 2 0 0 2782G 1
ceph health detailDetails about health issues# ceph health detail HEALTH_WARN 1 pgs degraded; 3 pgs stuck unclean; recovery 23/20714847 objects degraded … pg 5.8a is stuck unclean for 129.978642, current state active+remapped, last acting [35,19,24] osd.388 is near full at 85%
ceph osd df treeDisplays disk usage linked to the CRUSH tree, including weights and variance (non-uniform usage)ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME -1 8.15340 – 8376G 2508M 8373G 0.03 1.00 0 root default -2 2.72670 – 2792G 836M 2791G 0.03 1.00 0 host foo 0 2.72670 1.00000 2792G 836M 2791G 0.03 1.00 80 osd.0 -3 2.70000 – 2792G 836M 2791G 0.03 1.00 0 host bar 1 2.70000 1.00000 2792G 836M 2791G 0.03 1.00 80 osd.1 -4 2.72670 – 2792G 835M 2791G 0.03 1.00 0 host quux 2 2.72670 1.00000 2792G 835M 2791G 0.03 1.00 80 osd.2 TOTAL 8376G 2508M 8373G 0.03 MIN/MAX VAR: 1.00/1.00 STDDEV: 0

Working with Pools and OSDs

Subcommands of “ceph osd”.

ceph osd blocked-byPrint histogram of which OSDs are blocking their peers# ceph osd blocked-by …
ceph osd crush reweight {id} {wght}Permanently set weight instead of system-assigned value. Note weights are displayed eg. with “ceph osd tree”, eg. size of disk in TB. Also cf. “ceph osd reweight” for temp. weight changes# ceph osd crush reweight osd.1 2.7
ceph osd deep-scrub {osd}Instruct an OSD to perform a deep scrub (consistency check) on {osd}. Careful, those are I/O intensive as they actually read all data on the OSD and might therefore impact clients. Cf. the non-deep variant “ceph osd scrub” below# ceph osd deep-scrub osd.2 osd.2 instructed to deep-scrub
ceph osd find {num}Display location of a given OSD (hostname, port, CRUSH details)# ceph osd find 2 { “osd”: 2, “ip”: “\/50598”, “crush_location”: { “host”: “foohost”, “root”: “default” } }
ceph osd map {pool} {obj}Locate an object from a pool. Displays primary / replica placement groups for the objectceph osd map glance rbd_object_map.8e122e5a5dcd osdmap e36 pool ‘glance’ (1) object ‘rbd_object_map.8e122e5a5dcd’ -> pg 1.19ac25a5 (1.5) -> up ([1,2,0], p1) acting ([1,2,0], p1)
ceph osd metadata {id}Display OSD metadata (host and host info)# ceph osd metadata 1 { “id”: 1, … “front_addr”: “\/53448”, … “hostname”: “surskit”, … “osd_data”: “\/var\/lib\/ceph\/osd\/ceph-1”, “osd_journal”: “\/var\/lib\/ceph\/osd\/ceph-1\/journal”, … }
ceph osd out {num}Take an OSD out of the cluster, rebalancing it’s data to other OSDs. The inverse is “ceph osd in {num}”# ceph osd out 123
ceph osd pool create {name} {num}Create a new replicated pool with num placement groups. Use eg. num=128 for a small cluster. Check here for details on calculating placement groups. Check the docs for more details, eg. erasure code pools, custom rulesets, etc.# ceph osd pool create test 128 pool ‘test’ created
ceph osd pool delete {pool-name} [{pool-name} –yes-i-really-really-mean-it]Delete a pool. Have to give the pool name twice, followed by confirmation. Deleted pools and their data are gone for good, so take care!# ceph osd pool delete test test –yes-i-really-really-mean-it pool ‘test’ removed
ceph osd pool get {name} allGet all parameters for a pool. Instead of ‘all’ can also specify param name. Also cf. “ceph osd pool set {x}”# ceph osd pool get test all size: 3 min_size: 2 crash_replay_interval: 0 pg_num: 128 pgp_num: 128 crush_ruleset: 0 …
ceph osd pool ls {detail}List pools and optionally some details# ceph osd pool ls detail pool 0 ‘rbd’ replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 … pool 1 ‘glance’ …
ceph osd pool set {name} {param} {val}Set a pool parameter, eg. size|min_size|pg_num# ceph osd pool set cinder min_size 1 set pool 2 min_size to 1
ceph osd reweight {num} {wght}Temp. override weight instead of 1 for this OSD. Also cf. “ceph osd crush reweight” above# ceph osd reweight 123 0.8 # use 80% of default space
ceph osd reweight-by-utilization {percent}Ceph tries to balance disk usage evenly, but this does not always work that well – variations by +/-15% are not uncommon. reweight-by-utilization automatically reweights disks according to their utilization. The {percent} value is a threshold – OSDs which have non-perfect balance (with perfect being defined as 100%) but fall below the {percent} threshold will not be reweighted. The {percent} value defaults to 120. Note that as with all reweighting, this can kick off a lot of data shuffling, potentially impacting clients. See “Runtime Configurables” below for notes on how to minimize impact. Als note that reweight-by-utilization doesn’t work well with low overall utilization. See the “test-reweight-by-utilization” subcommand which is a dry-run version of this.# ceph osd reweight-by-utilization 105 no change moved 0 / 624 (0%) avg 208 stddev 0 -> 0 (expected baseline 11.7757) min osd.0 with 208 -> 208 pgs (1 -> 1 * mean)…
ceph osd scrub {osd}Initiate a “regular”, non-deep scrub on {osd}. Also cf. “ceph osd deep-scrub” above# ceph osd scrub osd.1 osd.1 instructed to scrub
ceph osd test-reweight-by-utilization {percent}This is a dry-run for the reweight-by-utilization subcommand described above. It behaves the same way but does not actually initiate reweighting# ceph osd test-reweight-by-utilization 101 no change moved 0 / 624 (0%)…
ceph osd set {flag}Set various flags on the OSD subsystem. Some useful flags: nodown – prevent OSDs from getting marked down, noout – prevent OSDs from getting marked out (will inhibit rebalance), noin – prevent booting OSDs from getting marked in, noscrub and nodeep-scrub – prevent respective scrub type (regular or deep). More: full, pause, noup, nobackfill, norebalance, norecover. The inverse subcommand is “ceph osd unset {flag}”# ceph osd set nodown set nodown
ceph osd treeLists hosts, their OSDs, up/down status, their weight, local reweight# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 8.18010 root default -2 2.72670 host alpha 0 2.72670 osd.0 up 1.00000 1.00000 …

More examples

Query OSDs

 Many ceph commands can output json. This lends itself well for filtering through jq (requires the jq utility). For example, query ceph osd tree for OSDs that have been reweighted from the default (ie., 1)

# ceph osd tree -f json-pretty | jq '.nodes[]|select(.type=="osd")|select(.reweight != 1)|.id'

Working with Placement Groups

Subcommands of “ceph pg”.

ceph pg {pg-id} queryQuery statistics and other metadata about a placement group. Often valuable info for troubleshooting, e.g. state of replicas, past events, etc.# ceph pg 3.0 query { “state”: “active+clean”, “epoch”: 31, “up”: [ 1, 2, 0 ], … “info”: { “pgid”: “3.0”, “last_update”: “29’188”, … “history”: { …
ceph pg {pg-id} list_missingIf primary OSDs go down before writes are fully distributed, Ceph might miss some data (but knows it’s missing some). Ceph refers to those as “missing” or “unfound” objects. In those cases it will block writes to the respective objects, in the hope the primary will come back eventually. The list_missing command will list those objects. You can find out more with “ceph pg {pg-id} query” about which OSDs were considered and what their state is. Note the “more” field; if it’s true there are more objects which are not listed yet.# ceph pg 3.0 list_missing { … “num_missing”: 0, “num_unfound”: 0, “objects”: [], “more”: 0 }
ceph pg {pg-id} mark_unfound_lost revert|deleteSee above under “list_missing” for missing/unfound objects. This command tells Ceph to delete those objects, respective revert to previous versions of them. Note that it’s up to you to deal with potential data loss.# ceph pg 3.0 mark_unfound_lost revert pg has no unfound objects
ceph pg dump [–format {format}]Dump statistics and metadata for all placement groups. Outputs info about scrubs, last replication, current OSDs, blocking OSDs, etc. Format can be plain or json. Depending on the number of placement groups, the output can potentially get large. The json output lends itself well to filtering/mangling, eg. with jq. The example uses this to extract a list of pg ids and timestamps of the last deep scrub# ceph pg dump –format json | jq -r ‘.pg_stats[] | [.pgid, .last_deep_scrub_stamp ] | @csv’ | head dumped all in format json “0.22”,”2016-11-04 00:31:24.035419″ “0.21”,”2016-11-04 00:31:24.035411″ “0.20”,”2016-11-04 00:31:24.035410″ …
ceph pg dump_stuck inactive | unclean | stale | undersized | degraded [–format <format>] [-t|–threshold <seconds>]Dump stuck placement groups, if any. Format plain or json. Threshold is the cutoff after which a pg is returned as stuck, with a default of 300s.inactiveInactive means they couldn’t read from/written to (possibly a peering problem)uncleanUnclean placement groups are those that have not been able to complete recovery.staleStale pg’s have not been updated by an OSD for an extended period of time; possibly all nodes which store that pg are down, overloaded or unreachable. The output will indicate which OSDs last were seen with that pg (“last acting”).undersizedPg has fewer copies than the configured pool replication leveldegradedCeph has not replicated some objects in the placement group the correct number of times yetThe output will contain the placement group names, their full state and which OSDs hold (or held) the data. Also see here for details on pg states.# ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 0.28 active+undersized+degraded [1,2] 1 [1,2] 1 0.27 active+undersized+degraded [1,2] 1 [1,2] 1 0.26 active+undersized+degraded [1,2] 1 [1,2] 1
ceph pg scrub {pg-id}, deep-scrub {pg-id}Initiate a (deep) scrub on the placement groups contents. This enables very fine-tuned control over what gets scrubbed when (especially useful for the resource-hungry deep scrub).# ceph pg deep-scrub 3.0 instructing pg 3.0 on osd.1 to deep-scrub
ceph pg repair {pg-id}If a placement group becomes “inconsistent” this indicates a possible error during scrubbing. “repair” instructs Ceph to, well, repair that pg# ceph pg repair 3.0 instructing pg 3.0 on osd.1 to repair

Interact with individual daemons

Subcommands of “ceph daemon <daemonname>”. The “ceph daemon” commands interact with individual daemons on the current host. Typically this is used for low-level investigation and troubleshooting. The target daemon can be specified via name, eg. “osd.1”, or as a path to a socket, eg. “/var/run/ceph/ceph-osd.0.asok”

ceph daemon {osd} dump_ops_in_flightDump a json list of currently active operations for an OSD. Useful if one or more ops are stuck# ceph daemon osd.1 dump_ops_in_flight { “ops”: [ { “description”: “osd_op(xyz 8.a .dir.someuuid [call rgw.bucket_prepare_op] …” …
ceph daemon {daemon} helpPrint a list of commands the daemon supports# ceph daemon osd.1 help { “config diff”: “dump diff of current config and default config”, “config get”: “config get <field>: get the config value”, …
ceph daemon {mon} mon_statusPrint high level status info for this MON# ceph daemon mon.foohost mon_status { “name”: “foohost”, “rank”: 0, “state”: “leader”, “election_epoch”: 10, “quorum”: [ 0, 1, 2 …]}
ceph daemon {osd} statusPrint high level status info for this OSD# ceph daemon osd.2 status { … “state”: “active”, … “num_pgs”: 220 }
ceph daemon {osd|mon|radosgw} perf dumpPrint performance statistics. See here for details on counter meaning# ceph daemon client.radosgw.gateway perf dump { “cct”: { “total_workers”: 32, “unhealthy_workers”: 0 }, “client.radosgw.gateway”: { “req”: 156875425,…

Authentication and Authorization

Very briefly about users (typically non-human) and perms. Commands listed here are subcommands of “ceph auth”. I don’t have examples of keyring management here – please check docs on keyring management when adding or deleting users.

ceph auth listList users# ceph auth list client.nova-compute key: Axxxx== caps: [mon] allow rw caps: [osd] allow rwx client.radosgw.gateway key: Ayyyyy== caps: [mon] allow rw caps: [osd] allow rwx
ceph auth get-or-createGet user details, or create the user if it doesn’t exist yet and return details (you likely will need to distribute keys).# ceph auth get-or-create client.alice mon ‘allow r’ osd ‘allow rw pool=data’


key = Axxxxxxxxxxx== ceph auth deleteDelete a user (likely will also need to remove key material)# ceph auth del updatedceph auth capsAdd or remove permissions for a user. Permissions are grouped per daemon type (eg. mon, osd, mds). Capabilities can be ‘r’, ‘w’, ‘x’ or ‘*’. For OSDs capabilities can also be restricted per pool (note if no pool is specified the caps apply to all pools!). For more details refer to the docs. The example makes bob an admin.# ceph auth caps client.bob mon ‘allow *’ osd ‘allow *’ mds ‘allow *’ updated caps for client.bob

Object Store Utility

Rados object storage utility invocations

rados -p {pool} put {obj} {file}Upload a file into a pool, name the resulting obj. Give ‘-‘ as a file name to read from stdin# rados -p test put foo foo.dat
rados -p {pool} lsList objects in a pool# rados -p test ls bar foo
# rados -p {pool} get {obj} {file}Download an object from a pool into a local file. Give ‘-‘ as a file name to write to stdout# rados -p test get foo – hello world
rados -p {pool} rm {obj}Delete an object from a pool# rados -p test rm foo
rados -p {pool} listwatchers {obj}List watchers of an object in pool. For instance, the head object of a mapped rbd volume has it’s clients as watchers# rados -p rbd listwatchers myvol.rbd watcher= client.173295 cookie=1
rados bench {seconds} {mode} [ -b objsize ] [ -t threads ]Run the built-in benchmark for given length in secs. Mode can be write, seq, or rand (latter are read benchmarks). Before running one of the reading benchmarks, run a write benchmark with the –no-cleanup option. The default object size is 4 MB, and the default number of simulated threads (parallel writes) is 16.# rados bench -t 32 -p bench 1800 write –no-cleanup Total time run: 1800.641454 Total writes made: 324623 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 721.127 …

RBD Block Storage

RBD block storage utility invocations

rbd create {volspec} –size {mb}Create a volume. Volspec can be of the form pool/volname, or as “-p pool volname”. Pool defaults to “rbd”# rbd create test/myimage –size 1024
rbd map [–read-only] {vol-or-snap}Map a volume or snapshot to a block device on the local machine# rbd map test/myimage /dev/rbd0
rbd showmappedShow mapped volumes / snapshots# rbd showmapped id pool image snap device 0 rbd myvol – /dev/rbd0
rbd status {volspec}Show mapping status of a given volume# rbd status myvol Watchers: watcher= client.173424 cookie=1
rbd info {volspec}Print some metadata about a given volume.# rbd info myvol rbd image ‘myvol’: size 1024 MB in 256 objects order 22 (4096 kB objects) block_name_prefix: rb.0.2a4e6.74b0dc51 format: 1
rbd export {volspec} {destfile}Export image to local file# rbd export cinder/volid localvol.img Exporting image: 100% complete…done.
rbd unmap {dev}Unmap a mapped rbd device# rbd unmap /dev/rbd0
rbd rm {volspec}Delete a volume# rbd rm myvol

More examples

Specify user for client

 This works similarly to other clients. You need a user keyring, eg. as /etc/ceph/ceph.client.nova-compute.keyring, and need to specify that user for the client:

# rbd --id nova-compute -p cinder-ceph export volume-xxx-yyy-zzz vol.img

Runtime Configurables

Here are some global configurables that I found useful to tweak. Many can be updated while Ceph daemons are running, but not all. Ceph is a complex system and has a lot of knobs, naturally there are much more configuration options than listed here. Note on formatting: Ceph config name parts can either be separated by spaces, dashes or underscore – Ceph doesn’t care. Eg. “osd-max-backfills”, “osd max backfills” and “osd_max_backfills” are equivalent.

You can query current config of a daemon by issuing “show-config”, eg. ceph -n osd.123 --show-config

SettingPurposeExample Setting
osd-max-backfillsLimit maximum simultaneous backfill operations (a kind of recovery operation). Immensely useful for traffic shaping! Set to 1 to minimize client impact# ceph tell osd.* injectargs ‘–osd-max-backfills 1’
osd-recovery-max-activeMax. active recovery operations per OSD. Can be useful to tune this down (eg.: 1) to lessen the load on a busy cluster and prevent impacting clients# ceph tell osd.* injectargs ‘–osd-recovery-max-active 1’
osd-recovery-op-priorityRelative priority of recovery operations, default 10. The value is relative to osd-client-op-priority which defaults to 63. Lower to lessen impact of recovery operations# ceph tell osd.* injectargs ‘–osd-recovery-op-priority 1’
filestore-max-sync-interval, filestore-min-sync-intervalMax. and min. interval in which the filestore commits data. Defaults to 0.01 and 5s. With small objects it might be more efficient to increase the interval. On the other hand, if objects are large on average, increasing the interval could create undesirable load spikes# ceph tell osd.* injectargs ‘–filestore-max-sync-interval 20’
mon-pg-warn-max-per-osdCeph warns if it thinks the placement-group-to-OSD ratio is outside sane boundaries. The thresholds are geared towards efficient operation of prod clusters – on test or dev clusters one maybe doesn’t care too much about those# ceph tell mon.* injectargs ‘–mon-pg-warn-max-per-osd 4096’
mon-clock-drift-allowedCeph relies on accurate clocks. It’ll warn when monitor hosts detect a relative time diff of >50ms among them. Use of ntp or similar on the monitors is highly recommended. If you want to relax this check (eg. for mons in KVMs for test or dev clouds where data safety is not a priority) set mon-clock-drift-allowed (in seconds)# ceph tell mon.* injectargs ‘–mon-clock-drift-allowed 0.5’

By: S Gmbh

Copyright 2021. All rights reserved.

Posted November 29, 2021 by Timothy Conrad in category "Linux

About the Author

If I were to describe myself with one word it would be, creative. I am interested in almost everything which keeps me rather busy. Here you will find some of my technical musings. PGP: 4CB8 91EB 0C0A A530 3BE9 6D76 B076 96F1 6135 0A1B