02 May 2025

The Lazy Solution!

While transferring large files from an SFTP server to Amazon S3, I faced a recurring issue: the program would frequently hang or crash while downloading the file to a temporary folder. This was especially problematic when the file size exceeded a certain limit, and the system couldn’t handle loading the entire content into memory at once.

After trying a few approaches, I landed on a solution that involved lazy reading — a technique that allowed me to stream large files in smaller chunks rather than loading them all at once.

Initial Approach: Naive File Download

lazy

However, this failed when the file size was large — the program would freeze after about 20MB. Clearly, reading the full file at once was not scalable.

DOWNLOAD_PATH = '/tmp'
def download_files_from_sftp(records):
if len(records) == 0:
return []
print(records)
downloaded_files = []
sftp = get_sftp_connection()
try:
local_path = os.path.join(DOWNLOAD_PATH)
os.makedirs(local_path, exist_ok=True)
for record in records:
filename = record['FILENAME']
sftp_path_file = "some_path"
print(f"Processing file: {filename}")
print(f"SFTP path: {sftp_path_file}")
try:
sftp.stat(sftp_path_file)
print(f"File exists on SFTP server: {sftp_path_file}")
local_path_file = os.path.join(local_path, filename)
print(f"Downloading to: {local_path_file}")
sftp.get(sftp_path_file, local_path_file)
print(f"Successfully downloaded: {local_path_file}")
downloaded_files.append(local_path_file)
sftp.close()
except FileNotFoundError:
print(f"ERROR: File not found on SFTP server: {sftp_path_file}")
continue
except Exception as e:
sftp.close()
print(f"Error downloading files from SFTP: {e}")
return []

Approach - 2: The Lazy Solution

Lazy reading is a technique where data is read in small, manageable chunks rather than loading the entire content into memory at once. This approach is especially useful when working with large files.

To avoid memory bottlenecks, I implemented lazy reading, where the file is read and written in fixed-size chunks (e.g., 32KB). This allows the program to stay responsive and memory-efficient, even with very large files.

lazy-2

def download_large_file(sftp, remote_path, local_path):
os.makedirs(os.path.dirname(local_path), exist_ok=True)
with sftp.file(remote_path, 'rb') as remote_file, open(local_path, 'wb') as local_file:
bytes_downloaded = 0
while True:
data = remote_file.read(32768) # 32 KB chunks
if not data:
break
local_file.write(data)
bytes_downloaded += len(data)
if bytes_downloaded % (5 * 1024 * 1024) < 32768: # ~every 5MB
print(f"Downloaded {bytes_downloaded / (1024 * 1024):.2f} MB...")
def download_files_from_sftp(records):
if not records:
return []
downloaded_files = []
sftp = get_sftp_connection()
try:
for record in records:
sftp_path_file = "some_path"
filename = record['FILENAME']
path = record["S3_PATH"]
local_dir = os.path.join(DOWNLOAD_PATH, path)
local_path_file = os.path.join(local_dir, filename)
print(f"Starting download")
download_large_file(sftp, sftp_path_file, local_path_file)
print(f"Completed")
downloaded_files.append(local_path_file)
except Exception as e:
print("Download failed with exception:")
traceback.print_exc()
finally:
sftp.close()
return downloaded_files

lazy-3

Benefits:

Trade-offs:

You can reach out to me on linkedin / mail to discuss more!