Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI

Enterprises are increasingly shifting from relying solely on large, general-purpose language models to developing specialized large language models (LLMs) fine-tuned on their own proprietary data. Although foundation models (FMs) offer impressive general capabilities, they often fall short when applied to the complexities of enterprise environments—where accuracy, security, compliance, and domain-specific knowledge are non-negotiable.

To meet these demands, organizations are adopting cost-efficient models tailored to their internal data and workflows. By fine-tuning on proprietary documents and domain-specific terminology, enterprises are building models that understand their unique context—resulting in more relevant outputs, tighter data governance, and simpler deployment across internal tools.

This shift is also a strategic move to reduce operational costs, improve inference latency, and maintain greater control over data privacy. As a result, enterprises are redefining their AI strategy as customized, right-sized models aligned to their business needs.

Scaling LLM fine-tuning for enterprise use cases presents real technical and operational hurdles, which are being overcome through the powerful partnership between Hugging Face and Amazon SageMaker AI.

Many organizations face fragmented toolchains and rising complexity when adopting advanced fine-tuning techniques like Low-Rank Adaptation (LoRA), QLoRA, and Reinforcement Learning with Human Feedback (RLHF). Additionally, the resource demands of large model training—including memory limitations and distributed infrastructure challenges—often slow down innovation and strains internal teams.

To overcome this, SageMaker AI and Hugging Face have joined forces to simplify and scale model customization. By integrating the Hugging Face Transformers libraries into SageMaker’s fully managed infrastructure, enterprises can now:

  • Run distributed fine-tuning jobs out of the box, with built-in support for parameter-efficient tuning methods
  • Use optimized compute and storage configurations that reduce training costs and improve GPU utilization
  • Accelerate time to value by using familiar open source libraries in a production-grade environment

This collaboration helps businesses focus on building domain-specific, right-sized LLMs, unlocking AI value faster while maintaining full control over their data and models.

In this post, we show how this integrated approach transforms enterprise LLM fine-tuning from a complex, resource-intensive challenge into a streamlined, scalable solution for achieving better model performance in domain-specific applications. We use the meta-llama/Llama-3.1-8B model, and execute a Supervised Fine-Tuning (SFT) job to improve the model’s reasoning capabilities on the MedReason dataset by using distributed training and optimization techniques, such as Fully-Sharded Data Parallel (FSDP) and LoRA with the Hugging Face Transformers library, executed with Amazon SageMaker Training Jobs.

Understanding the core concepts

The Hugging Face Transformers library is an open-source toolkit designed to fine-tune LLMs by enabling seamless experimentation and deployment with popular transformer models.

The Transformers library supports a variety of methods for aligning LLMs to specific objectives, including:

  • Thousands of pre-trained models – Access to a vast collection of models like BERT, Meta Llama, Qwen, T5, and more, which can be used for tasks such as text classification, translation, summarization, question answering, object detection, and speech recognition.
  • Pipelines API – Simplifies common tasks (such as sentiment analysis, summarization, and image segmentation) by handling tokenization, inference, and output formatting in a single call.
  • Trainer API – Provides a high-level interface for training and fine-tuning models, supporting features like mixed precision, distributed training, and integration with popular hardware accelerators.
  • Tokenization tools – Efficient and flexible tokenizers for converting raw text into model-ready inputs, supporting multiple languages and formats.

SageMaker Training Jobs is a fully managed, on-demand machine learning (ML) service that runs remotely on AWS infrastructure to train a model using your data, code, and chosen compute resources. This service abstracts away the complexities of provisioning and managing the underlying infrastructure, so you can focus on developing and fine-tuning your ML and foundation models. Key capabilities offered by SageMaker training jobs are:

  • Fully managed – SageMaker handles resource provisioning, scaling, and management for your training jobs, so you don’t need to manually set up servers or clusters.
  • Flexible input – You can use built-in algorithms, pre-built containers, or bring your own custom training scripts and Docker containers, to execute training workloads with most popular frameworks such as the Hugging Face Transformers library.
  • Scalable – It supports single-node or distributed training across multiple instances, making it suitable for both small and large-scale ML workloads.
  • Integration with multiple data sources – Training data can be stored in Amazon Simple Storage Service (Amazon S3), Amazon FSx, and Amazon Elastic Block Store (Amazon EBS), and output model artifacts are saved back to Amazon S3 after training is complete.
  • Customizable – You can specify hyperparameters, resource types (such as GPU or CPU instances), and other settings for each training job.
  • Cost-efficient options – Features like managed Spot Instances, flexible training plans, and heterogeneous clusters help optimize training costs.

Solution overview

The following diagram illustrates the solution workflow of using the Hugging Face Transformers library with a SageMaker Training job.

The workflow consists of the following steps:

  1. The user prepares the dataset by formatting it with the specific prompt style used for the selected model.
  2. The user prepares the training script by using the Hugging Face Transformers library to start the training workload, by specifying the configuration for the distribution option selected, such as Distributed Data Parallel (DDP) or Fully-Sharded Data Parallel (FSDP).
  3. The user submits an API request to SageMaker AI, passing the location of the training script, the Hugging Face Training container URI, and the training configurations required, such as distribution algorithm, instance type, and instance count.
  4. SageMaker AI uses the training job launcher script to run the training workload on a managed compute cluster. Based on the selected configuration, SageMaker AI provisions the required infrastructure, orchestrates distributed training, and upon completion, automatically decommissions the cluster.

This streamlined architecture delivers a fully managed user experience, helping you quickly develop your training code, define training parameters, and select your preferred infrastructure. SageMaker AI handles the end-to-end infrastructure management with a pay-as-you-go pricing model that bills only for the net training time in seconds.

Prerequisites

You must complete the following prerequisites before you can run the Meta Llama 3.1 8B fine-tuning notebook:

  1. Make the following quota increase requests for SageMaker AI. For this use case, you will need to request a minimum of 1 p4d.24xlarge instance (with 8 x NVIDIA A100 GPUs) and scale to more p4d.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). To help determine the right cluster size for the fine-tuning workload, you can use tools like VRAM Calculator or “Can it run LLM“. On the Service Quotas console, request the following SageMaker AI quotas:
    • P4D instances (p4.24xlarge) for training job usage: 1
  2. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give required access to SageMaker AI to run the examples.
  3. Assign the following policy as a trust relationship to your IAM role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "sagemaker.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    
  4. (Optional) Create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. You can also use JupyterLab in your local setup

These permissions grant broad access and are not recommended for use in production environments. See the SageMaker Developer Guide for guidance on defining more fine-grained permissions.

Prepare the dataset

To prepare the dataset, you must load the UCSC-VLAA/MedReason dataset. MedReason is a large-scale, high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in LLMs. The following table shows an example of the data.

dataset_name id_in_dataset question answer reasoning options
medmcqa 7131 Urogenital Diaphragm is made up of the following… Colle’s fascia. Explanation: Colle’s fascia do… Finding reasoning paths:n1. Urogenital diaphr… Answer Choices:nA. Deep transverse Perineusn…
medmcqa 7133 Child with Type I Diabetes. What is the advise… After 5 years. Explanation: Screening for diab… **Finding reasoning paths:**nn1. Type 1 Diab… Answer Choices:nA. After 5 yearsnB. After 2 …
medmcqa 7134 Most sensitive test for H pylori is-

Biopsy urease test. Explanation:

Davidson&…

**Finding reasoning paths:**nn1. Consider th… Answer Choices:nA. Fecal antigen testnB. Bio…

We want to use the following columns for preparing our dataset:

  • question – The question being posed
  • answer – The correct answer to the question
  • reasoning – A detailed, step-by-step logical explanation of how to arrive at the correct answer

We can use the following steps to format the input in the proper style used for Meta Llama 3.1, and configure the data channels for SageMaker training jobs on Amazon S3:

  1. Load the UCSC-VLAA/MedReason dataset, using the first 10,000 rows of the original dataset:
    from datasets import load_dataset
    dataset = load_dataset("UCSC-VLAA/MedReason", split="train[:10000]")
  2. Apply the proper chat template to the dataset by using the apply_chat_template method of the Tokenizer:
    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    def prepare_dataset(sample):
    
        system_text = (
            "You are a deep-thinking AI assistant.nn" 
            "For every user question, first write your thoughts and reasoning inside ... tags, then provide your answer."
        )
    
        messages = []
    
        messages.append({"role": "system", "content": system_text})
        messages.append({"role": "user", "content": sample["question"]})
        messages.append(
            {
                "role": "assistant",
                "content": f"n{sample['reasoning']}nn{sample['answer']}",
            }
        )
    
        # Apply chat template
        sample["text"] = tokenizer.apply_chat_template(
            messages, tokenize=False
        )
    
        return sample
    

    The function prepare_dataset will iterate over the elements of the dataset, and use the apply_chat_template function to have a prompt template in the following form:

    system
    {{SYSTEM_PROMPT}}
    user
    {{QUESTION}}
    assistant
    
    {{REASONING}}
    
    
    {{FINAL_ANSWER}}
    

    The following code is an example of the formatted prompt:

    <|begin_of_text|><|start_header_id|>system<|end_header_id|> 
    You are a deep-thinking AI assistant. 
    For every user question, first write your thoughts and reasoning inside ... tags, then provide your answer.
    <|eot_id|><|start_header_id|>user<|end_header_id|> 
    A 66-year-old man presents to the emergency room with blurred vision, lightheadedness, and chest pain that started 30 minutes ago. The patient is awake and alert. 
    His history is significant for uncontrolled hypertension, coronary artery disease, and he previously underwent percutaneous coronary intervention. 
    He is afebrile. The heart rate is 102/min, the blood pressure is 240/135 mm Hg, and the O2 saturation is 100% on room air. 
    An ECG is performed and shows no acute changes. A rapid intravenous infusion of a drug that increases peripheral venous capacitance is started. 
    This drug has an onset of action that is less than 1 minute with rapid serum clearance than necessitates a continuous infusion. What is the most severe side effect of this medication?
    <|eot_id|><|start_header_id|>assistant<|end_header_id|> 
     
    ### Finding Reasoning Paths: 
    1. **Blurred vision, lightheadedness, and chest pain** → Malignant hypertension → Rapid IV antihypertensive therapy. 
    2. **Uncontrolled hypertension and coronary artery disease** → Malignant hypertension → Rapid IV antihypertensive therapy. 
    3. **Severe hypertension (BP 240/135 mm Hg)** → Risk of end-organ damage → Malignant hypertension → Rapid IV antihypertensive therapy. 
    4. **Chest pain and history of coronary artery disease** → Risk of myocardial ischemia → Malignant hypertension → Rapid IV antihypertensive therapy. --- 
    
    ### Reasoning Process: 
    1. **Clinical Presentation and Diagnosis**:  - The patient presents with blurred vision...
    ...
     
    
    Cyanide poisoning
    <|eot_id|><|end_of_text|>
    
  3. Split the dataset into train, validation, and test datasets:
    from datasets import Dataset, DatasetDict
    from random import randint
    
    train_dataset = Dataset.from_pandas(train)
    val_dataset = Dataset.from_pandas(val)
    test_dataset = Dataset.from_pandas(test)
    
    dataset = DatasetDict({"train": train_dataset, "val": val_dataset})
    train_dataset = dataset["train"].map(
        prepare_dataset, remove_columns=list(train_dataset.features)
    )
    
    val_dataset = dataset["val"].map(
        prepare_dataset, remove_columns=list(val_dataset.features)
    )
    
  4. Prepare the training and validation datasets for the SageMaker training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:
    ...
     
    train_dataset.to_json("./data/train/dataset.jsonl")
    val_dataset.to_json("./data/val/dataset.jsonl")
    
     
    s3_client.upload_file(
        "./data/train/dataset.jsonl", bucket_name, f"{input_path}/train/dataset.jsonl"
    )
    s3_client.upload_file(
        "./data/val/dataset.jsonl", bucket_name, f"{input_path}/val/dataset.jsonl"
    )
    

Prepare the training script

To fine-tune meta-llama/Llama-3.1-8B with a SageMaker Training job, we prepared the train.py file, which serves as the entry point of the training job to execute the fine-tuning workload.

The training process can use Trainer or SFTTrainer classes to fine-tune our model. This simplifies the process of continued pre-training for LLMs. This approach makes fine-tuning efficient for adapting pre-trained models to specific tasks or domains.

The Trainer and SFTTrainer classes both facilitate model training with Hugging Face transformers. The Trainer class is the standard high-level API for training and evaluating transformer models on a wide range of tasks, including text classification, sequence labeling, and text generation. The SFTTrainer is a subclass built specifically for supervised fine-tuning of LLMs, particularly for instruction-following or conversational tasks.

To accelerate the model fine-tuning, we distribute the training workload by using the FSDP technique. It is an advanced parallelism technique designed to train large models that might not fit in the memory of a single GPU, with the following benefits:

  • Parameter sharding – Instead of replicating the entire model on each GPU, FSDP splits (shards) model parameters, optimizer states, and gradients across GPUs
  • Memory efficiency – By sharding, FSDP drastically reduces the memory footprint on each device, enabling training of larger models or larger batch sizes
  • Synchronization – During training, FSDP gathers only the necessary parameters for each computation step, then releases memory immediately after, further saving resources
  • CPU offload – Optionally, FSDP can offload some data to CPUs to save even more GPU memory
  1. In our example, we use the Trainer class and define the required TrainingArguments to execute the FSDP distributed workload:
    from transformers import (
        Trainer,
        TrainingArguments
    )
    
    trainer = Trainer(
        model=model,
        train_dataset=train_ds,
        eval_dataset=test_ds if test_ds is not None else None,
        args=transformers.TrainingArguments(
            **training_args, 
        ),
        callbacks=callbacks,
        data_collator=transformers.DataCollatorForLanguageModeling(
            tokenizer, mlm=False
        )
    )
    
  2. To further optimize the fine-tuning workload, we use the QLoRA technique, which quantizes a pre-trained language model to 4 bits and attaches small Low-Rank Adapters, which are fine-tuned:
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        BitsAndBytesConfig,
    )
    
    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(script_args.model_id)
    
    # Define PAD token
    tokenizer.pad_token = tokenizer.eos_token
    
    # Configure quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_storage=torch.bfloat16
    )
    
    # Load the model
    model = AutoModelForCausalLM.from_pretrained(
        script_args.model_id,
        trust_remote_code=True,
        quantization_config=bnb_config,
        use_cache=not training_args.gradient_checkpointing,
        cache_dir="/tmp/.cache",
        **model_configs,
    )
    
  3. The script_args and training_args are provided as hyperparameters for the SageMaker Training job in a configuration recipe .yaml file and parsed in the train.py file by using the TrlParser class provided by Hugging Face TRL:
    model_id: "meta-llama/Llama-3.1-8B-Instruct"      # Hugging Face model id
    # sagemaker specific parameters
    output_dir: "/opt/ml/model"                       # path to where SageMaker will upload the model 
    checkpoint_dir: "/opt/ml/checkpoints/"            # path to where SageMaker will upload the model checkpoints
    train_dataset_path: "/opt/ml/input/data/train/"   # path to where S3 saves train dataset
    val_dataset_path: "/opt/ml/input/data/val/"       # path to where S3 saves test dataset
    save_steps: 100                                   # Save checkpoint every this many steps
    token: ""
    # training parameters
    lora_r: 32
    lora_alpha:64
    lora_dropout: 0.1                 
    learning_rate: 2e-4                    # learning rate scheduler
    num_train_epochs: 2                    # number of training epochs
    per_device_train_batch_size: 4         # batch size per device during training
    per_device_eval_batch_size: 2          # batch size for evaluation
    gradient_accumulation_steps: 4         # number of steps before performing a backward/update pass
    gradient_checkpointing: true           # use gradient checkpointing
    bf16: true                             # use bfloat16 precision
    tf32: false                            # use tf32 precision
    fsdp: "full_shard auto_wrap offload"   #FSDP configurations
    fsdp_config: 
        backward_prefetch: "backward_pre"
        cpu_ram_efficient_loading: true
        offload_params: true
        forward_prefetch: false
        use_orig_params: true
    warmup_steps: 100
    weight_decay: 0.01
    merge_weights: true                    # merge weights in the base model
    

    For the implemented use case, we decided to fine-tune the adapter with the following values:

    • lora_r: 32 – Allows the adapter to capture more complex reasoning transformations.
    • lora_alpha: 64 – Given the reasoning task we are trying to improve, this value allows the adapter to have a significant impact to the base.
    • lora_dropout: 0.05 – We want to preserve reasoning connection by avoiding breaking important ones.
    • warmup_steps: 100 – Gradually increases the learning rate to the specified value. For this reasoning task, we want the model to learn a new structure without forgetting the previous knowledge.
    • weight_decay: 0.01 – Maintains model generalization.
  4. Prepare the configuration file for the SageMaker Training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:
    import os
    
    if default_prefix:
        input_path = f"{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft"
    else:
        input_path = f"datasets/llm-fine-tuning-modeltrainer-sft"
    
    train_config_s3_path = f"s3://{bucket_name}/{input_path}/config/args.yaml"
    
    # upload the model yaml file to s3
    model_yaml = "args.yaml"
    s3_client.upload_file(model_yaml, bucket_name, f"{input_path}/config/args.yaml")
    os.remove("./args.yaml")
    
    print(f"Training config uploaded to:")
    print(train_config_s3_path)

SFT training using a SageMaker Training job

To run a fine-tuning workload using the SFT training script and SageMaker Training jobs, we use the ModelTrainer class.

The ModelTrainer class is a and more intuitive approach to model training that significantly enhances user experience and supports distributed training, Build Your Own Container (BYOC), and recipes. For additional information refer to the SageMaker Python SDK documentation.

Set up the fine-tuning workload with the following steps:

  1. Specify the instance type, the container image for the training job, and the checkpoint path where the model will be stored:
    instance_type = "ml.p4d.24xlarge"
    instance_count = 1
    
    image_uri = image_uris.retrieve(
        framework="huggingface",
        region=sagemaker_session.boto_session.region_name,
        version="4.56.2",
        base_framework_version="pytorch2.8.0",
        instance_type=instance_type,
        image_scope="training",
    )
    
  2. Define the source code configuration by pointing to the created train.py:
    from sagemaker.train.configs import SourceCode
    
    source_code = SourceCode(
        source_dir="./scripts",
        requirements="requirements.txt",
        entry_script="train.py",
    )
    
  3. Configure the training compute by optionally providing the parameter keep_alive_period_in_seconds to use managed warm pools, to retain and reuse the cluster during the experimentation phase:
    from sagemaker.train.configs Compute
    
    compute_configs = Compute(
        instance_type=instance_type,
        instance_count=instance_count,
        keep_alive_period_in_seconds=0,
    )
    
  4. Create the ModelTrainer function by providing the required training setup, and define the argument distributed=Torchrun() to use torchrun as a launcher to execute the training job in a distributed manner across the available GPUs in the selected instance:
    from sagemaker.train.configs import (
        CheckpointConfig,
        OutputDataConfig,
        StoppingCondition,
    )
    from sagemaker.train.distributed import Torchrun
    from sagemaker.train.model_trainer import ModelTrainer
    
    
    # define Training Job Name
    job_name = f"train-{model_id.split('/')[-1].replace('.', '-')}-sft"
    
    # define OutputDataConfig path
    output_path = f"s3://{bucket_name}/{job_name}"
    
    # Define the ModelTrainer
    model_trainer = ModelTrainer(
        training_image=image_uri,
        source_code=source_code,
        base_job_name=job_name,
        compute=compute_configs,
        distributed=Torchrun(),
        stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
        hyperparameters={
            "config": "/opt/ml/input/data/config/args.yaml"  # path to TRL config which was uploaded to s3
        },
        output_data_config=OutputDataConfig(s3_output_path=output_path),
        checkpoint_config=CheckpointConfig(
            s3_uri=output_path + "/checkpoint", local_path="/opt/ml/checkpoints"
        ),
    ) 
    
  5. Set up the input channels for the ModelTrainer by creating InputData objects from the provided S3 bucket paths for the training and validation dataset, and for the configuration parameters:
    from sagemaker.train.configs import InputData
    # Pass the input data
    train_input = InputData(
        channel_name="train",
        data_source=train_dataset_s3_path, # S3 path where training data is stored
    )
    val_input = InputData(
        channel_name="val",
        data_source=val_dataset_s3_path, # S3 path where validation data is stored
    )
    config_input = InputData(
        channel_name="config",
        data_source=train_config_s3_path, # S3 path where configurations are stored
    )
    # Check input channels configured
    data = [train_input, val_input, config_input]
    
  6. Submit the training job:
    model_trainer.train(input_data_config=data, wait=False)

The training job with Flash Attention 2 for one epoch with a dataset of 10,000 samples takes approximately 18 minutes to complete.

Deploy and test fine-tuned Meta Llama 3.1 8B on SageMaker AI

To evaluate your fine-tuned model, you have several options. You can use an additional SageMaker Training job to evaluate the model with Hugging Face Lighteval on SageMaker AI, or you can deploy the model to a SageMaker real-time endpoint and interactively test the model by using techniques like LLM as judge to compare generated content with ground truth content. For a more comprehensive evaluation that demonstrates the impact of fine-tuning on model performance, you can use the MedReason evaluation script to compare the base meta-llama/Llama-3.1-8B model with your fine-tuned version.

In this example, we use the deployment approach, iterating over the test dataset and evaluating the model on those samples using a simple loop.

  1. Select the instance type and the container image for the endpoint:
    import boto3
    
    sm_client = boto3.client("sagemaker", region_name=sess.boto_region_name)
    
    image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.13-gpu-py312"
    
  2. Create the SageMaker Model using the container URI for vLLM and the S3 path to your model. Set your vLLM configuration, including the number of GPUs and max input tokens. For a full list of configuration options, see vLLM engine arguments.
    env = {
        "SM_VLLM_MODEL": "/opt/ml/model",
        "SM_VLLM_DTYPE": "bfloat16",
        "SM_VLLM_GPU_MEMORY_UTILIZATION": "0.8",
        "SM_VLLM_MAX_MODEL_LEN": json.dumps(1024 * 16),
        "SM_VLLM_MAX_NUM_SEQS": "1",
        "SM_VLLM_ENABLE_CHUNKED_PREFILL": "true",
        "SM_VLLM_KV_CACHE_DTYPE": "auto",
        "SM_VLLM_TENSOR_PARALLEL_SIZE": "4",
    }
    
    model_response = sm_client.create_model(
        ModelName=f"{model_id.split('/')[-1].replace('.', '-')}-model",
        ExecutionRoleArn=role,
        PrimaryContainer={
            "Image": image_uri,
            "Environment": env,
            "ModelDataSource": {
                "S3DataSource": {
                    "S3Uri": f"s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz",
                    "S3DataType": "S3Prefix",
                    "CompressionType": "Gzip",
                }
            },
        },
    )
    
  3. Create the endpoint configuration by specifying the type and number of instances:
    instance_count = 1
    instance_type = "ml.g5.12xlarge"
    health_check_timeout = 700
    
    endpoint_config_response = sm_client.create_endpoint_config(
        EndpointConfigName=f"{model_id.split('/')[-1].replace('.', '-')}-config",
        ProductionVariants=[
            {
                "VariantName": "AllTraffic",
                "ModelName": f"{model_id.split('/')[-1].replace('.', '-')}-model",
                "InstanceType": instance_type,
                "InitialInstanceCount": instance_count,
                "ModelDataDownloadTimeoutInSeconds": health_check_timeout,
                "ContainerStartupHealthCheckTimeoutInSeconds": health_check_timeout,
                "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
            }
        ],
    )
    
  4. Deploy the model:
    endpoint_response = sm_client.create_endpoint(
        EndpointName=f"{model_id.split('/')[-1].replace('.', '-')}-sft", 
        EndpointConfigName=f"{model_id.split('/')[-1].replace('.', '-')}-config",
    ) 
    

SageMaker AI will now create the endpoint and deploy the model to it. This can take 5–10 minutes. Afterwards, you can test the model by sending some example inputs to the endpoint. You can use the invoke_endpoint method of the sagemaker-runtime client to send the input to the model and get the output:

import json
import pandas as pd

eval_dataset = []

for index, el in enumerate(test_dataset, 1):
    print("Processing item ", index)

    payload = {
        "messages": [
            {
                "role": "system",
                "content": "You are a deep-thinking AI assistant.nnFor every user question, first write your thoughts and reasoning inside <think>...</think> tags, then provide your answer.",
            },
            {"role": "user", "content": el["question"]},
        ],
        "max_tokens": 4096,
        "stop": ["<|eot_id|>", "<|end_of_text|>"],
        "temperature": 0.4,
        "top_p": 0.9,
        "repetition_penalty": 1.15,
        "no_repeat_ngram_size": 3,
        "do_sample": True,
    }

    response = predictor.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )

    result = json.loads(response["Body"].read().decode())
    eval_dataset.append([el["question"], result["choices"][0]["message"]["content"]])

    print("**********************************************")

eval_dataset_df = pd.DataFrame(
    eval_dataset, columns=["question", "answer"]
)

eval_dataset_df.to_json(
    "./eval_dataset_results.jsonl", orient="records", lines=True
)

The following are some examples of generated answers:

Question: "Perl's stain or prussion blue test is for:"

Answer Fine-tuned: """
<think>
The Perl's stain or Prussian blue test is used to detect the presence of iron in biological samples. 
It involves adding potassium ferrocyanide (K4[Fe(CN)6]) to the sample, 
which reacts with the iron ions present in it to form a dark blue-colored compound known as ferric ferrocyanide. 
This reaction can be observed visually, allowing researchers to determine if iron is present in the sample.
</think>

In simpler terms, the Perl's stain or Prussian blue test is used to identify iron in biological samples.
"""

The fine-tuned model shows strong reasoning capabilities by providing structured, detailed explanations with clear thought processes, breaking down the concepts step-by-step before arriving at the final answer. This example showcases the effectiveness of our fine-tuning approach using Hugging Face Transformers and a SageMaker Training job.

Clean up

To clean up your resources to avoid incurring additional charges, follow these steps:

  1. Delete any unused SageMaker Studio resources.
  2. (Optional) Delete the SageMaker Studio domain.
  3. Verify that your training job isn’t running anymore. To do so, on the SageMaker console, under Training in the navigation pane, choose Training jobs.
  4. Delete the SageMaker endpoint.

Conclusion

In this post, we demonstrated how enterprises can efficiently scale fine-tuning of both small and large language models by using the integration between the Hugging Face Transformers library and SageMaker Training jobs. This powerful combination transforms traditionally complex and resource-intensive processes into streamlined, scalable, and production-ready workflows.

Using a practical example with the meta-llama/Llama-3.1-8B model and the MedReason dataset, we demonstrated how to apply advanced techniques like FSDP and LoRA to reduce training time and cost—without compromising model quality.

This solution highlights how enterprises can effectively address common LLM fine-tuning challenges such as fragmented toolchains, high memory and compute requirements, and multi-node scaling inefficiencies and GPU underutilization.

By using the integrated Hugging Face and SageMaker architecture, businesses can now build and deploy customized, domain-specific models faster—with greater control, cost-efficiency, and scalability.

To get started with your own LLM fine-tuning project, explore the code samples provided in our GitHub repository.


About the Authors

Florent Gbelidji is a Machine Learning Engineer for Customer Success at Hugging Face. Based in Paris, France, Florent joined Hugging Face 3.5 years ago as an ML Engineer in the Expert Acceleration Program, helping companies build solutions with open source AI. He is now the Cloud Partnership Tech Lead for the AWS account, driving integrations between the Hugging Face environment and AWS services.

Bruno Pistone is a Senior Worldwide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and machine learning solutions that take full advantage of the AWS cloud and Amazon ML stack. His expertise includes distributed training and inference workloads, model customization, generative AI, and end-to-end ML. He enjoys spending time with friends, exploring new places, and traveling to new destinations.

Louise Ping is a Senior Worldwide GenAI Specialist, where she helps partners build go-to-market strategies and leads cross-functional initiatives to expand opportunities and drive adoption. Drawing from her diverse AWS experience across Storage, APN Partner Marketing, and AWS Marketplace, she works closely with strategic partners like Hugging Face to drive technical collaborations. When not working at AWS, she attempts home improvement projects—ideally with limited mishaps.

Safir Alvi is a Worldwide GenAI/ML Go-To-Market Specialist at AWS based in New York. He focuses on advising strategic global customers on scaling their model training and inference workloads on AWS, and driving adoption of Amazon SageMaker AI Training Jobs and Amazon SageMaker HyperPod. He specializes in optimizing and fine-tuning generative AI and machine learning models across diverse industries, including financial services, healthcare, automotive, and manufacturing.

​In this post, we show how this integrated approach transforms enterprise LLM fine-tuning from a complex, resource-intensive challenge into a streamlined, scalable solution for achieving better model performance in domain-specific applications. Read More

Leave a Reply

Your email address will not be published. Required fields are marked *