NVIDIA NIM

  • Concept
  • 2-min read
  • Published Dec 22, 2024

NVIDIA NIM (NVIDIA Inference Microservices) is a set of microservices that accelerate foundation model deployment on any cloud or data center, optimizing AI infrastructure for efficiency and cost-effectiveness while reducing hardware and operational costs.

NVIDIA NIM Dashboard

Explore the sample dashboard on the Dynatrace Playground.

Enable monitoring

Follow the Set up Dynatrace on Kubernetes guide to monitor your cluster.

Afterwards, add the following annotations to your NVIDIA NIM deployments:

  • metrics.dynatrace.com/scrape: "true"
  • metrics.dynatrace.com/port: "8000"

Spans

The following attributes are available for GenAI Spans.

AttributeTypeDescription
gen_ai.completion.0.contentstringThe full response received from the GenAI model.
gen_ai.completion.0.content_filter_resultsstringThe filter results of the response received from the GenAI model.
gen_ai.completion.0.finish_reasonstringThe reason the GenAI model stopped producing tokens.
gen_ai.completion.0.rolestringThe role used by the GenAI model.
gen_ai.openai.api_basestringGenAI server address.
gen_ai.openai.api_versionstringGenAI API version.
gen_ai.openai.system_fingerprintstringThe fingerprint of the response generated by the GenAI model.
gen_ai.prompt.0.contentstringThe full prompt sent to the GenAI model.
gen_ai.prompt.0.rolestringThe role setting for the GenAI request.
gen_ai.prompt.prompt_filter_resultsstringThe filter results of the prompt sent to the GenAI model.
gen_ai.request.max_tokensintegerThe maximum number of tokens the model generates for a request.
gen_ai.request.modelstringThe name of the GenAI model a request is being made to.
gen_ai.request.temperaturedoubleThe temperature setting for the GenAI request.
gen_ai.request.top_pdoubleThe top_p sampling setting for the GenAI request.
gen_ai.response.modelstringThe name of the model that generated the response.
gen_ai.systemstringThe GenAI product as identified by the client or server instrumentation.
gen_ai.usage.completion_tokensintegerThe number of tokens used in the GenAI response (completion).
gen_ai.usage.prompt_tokensintegerThe number of tokens used in the GenAI input (prompt).
llm.request.typestringThe type of the operation being performed.

Metrics

The following metrics will be available:

MetricTypeUnitDescription
e2e_request_latency_secondshistoGrailmsHistogram of end-to-end request latency in seconds
generation_tokens_totalcounterintegerNumber of generation tokens processed
gpu_cache_usage_percgaugeintegerGPU KV-cache usage. 1 means 100 percent usage
num_request_maxcounterintegerMaximum number of concurrently running requests
num_requests_runningcounterintegerNumber of requests currently running on GPU
num_requests_waitingcounterintegerNumber of requests waiting to be processed
prompt_tokens_totalcounterintegerNumber of prefill tokens processed
request_failure_totalcounterintegerNumber of failed requests; requests with other finish reason are counted
request_finish_totalcounterintegerNumber of finished requests, with label indicating finish reason
request_generation_tokenshistogramintegerHistogram of number of generation tokens processed
request_prompt_tokenshistogramintegerHistogram of number of prefill tokens processed
request_success_totalcounterintegerNumber of successful requests; requests with finish reason "stop" or "length" are counted
time_per_output_token_secondshistogramsHistogram of time per output token in seconds
time_to_first_token_secondshistogramsHistogram of time to first token in seconds

Additionally, the following metrics are reported.

MetricTypeUnitDescription
gen_ai.client.generation.choicescounternoneThe number of choices returned by chat completions call.
gen_ai.client.operation.durationhistogramsThe GenAI operation duration.
gen_ai.client.token.usagehistogramnoneThe number of input and output tokens used.
llm.openai.embeddings.vector_sizecounternoneThe size of returned vector.